I noticed Pandas now has support for Sparse Matrices and Arrays. Currently, I create DataFrame()s like this:
return DataFrame(matrix.toarray(), columns=features, index=observations)
Is there a way to create a SparseDataFrame() with a scipy.sparse.csc_matrix() or csr_matrix()? Converting to dense format kills RAM badly. Thanks!
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
A direct conversion is not supported ATM. Contributions are welcome!
Try this, should be ok on memory as the SpareSeries is much like a csc_matrix (for 1 column)
and pretty space efficient
In [37]: col = np.array([0,0,1,2,2,2])
In [38]: data = np.array([1,2,3,4,5,6],dtype='float64')
In [39]: m = csc_matrix( (data,(row,col)), shape=(3,3) )
In [40]: m
Out[40]:
<3x3 sparse matrix of type '<type 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Column format>
In [46]: pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])
Out[46]:
0 1 2
0 1 0 4
1 0 0 5
2 2 3 6
In [47]: df = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])
In [48]: type(df)
Out[48]: pandas.sparse.frame.SparseDataFrame
Method 2
As of pandas v 0.20.0 you can use the SparseDataFrame constructor.
An example from the pandas docs:
import numpy as np import pandas as pd from scipy.sparse import csr_matrix arr = np.random.random(size=(1000, 5)) arr[arr < .9] = 0 sp_arr = csr_matrix(arr) sdf = pd.SparseDataFrame(sp_arr)
Method 3
A much shorter version:
df = pd.DataFrame(m.toarray())
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0