“Large data” workflows using pandas
Dask emphasizes the following virtues:
Dask emphasizes the following virtues:
I have a reasonable size (18GB compressed) HDF5 dataset and am looking to optimize reading rows for speed. Shape is (639038, 10000). I will be reading a selection of rows (say ~1000 rows) many times, located across the dataset. So I can’t use x:(x+1000) to slice rows.
I am trying to read data from hdf5 file in Python. I can read the hdf5 file using h5py
, but I cannot figure out how to access data within the file.
I’m trying to save bottleneck values to a newly created hdf5 file.
The bottleneck values come in batches of shape (120,10,10, 2048)
.
Saving one alone batch is taking up more than 16 gigs and python seems to be freezing at that one batch. Based on recent findings (see update, it seems hdf5 taking up large memory is okay, but the freezing part seems to be a glitch.
I have a struct array created by matlab and stored in v7.3 format mat file: