I have a large dataframe with 423244 lines. I want to split this in to 4. I tried the following code which gave an error? ValueError: array split does not result in an equal division
for item in np.split(df, 4):
print item
How to split this dataframe in to 4 groups?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Use np.array_split:
Docstring: Split an array into multiple sub-arrays. Please refer to the ``split`` documentation. The only difference between these functions is that ``array_split`` allows `indices_or_sections` to be an integer that does *not* equally divide the axis.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : randn(8), 'D' : randn(8)})
In [3]: print df
A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468
In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]:
[ A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837,
A B C D
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861,
A B C D
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468]
Method 2
I wanted to do the same, and I had first problems with the split function, then problems with installing pandas 0.15.2, so I went back to my old version, and wrote a little function that works very well. I hope this can help!
# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller chunks
def split_dataframe(df, chunk_size = 10000):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
Method 3
Be aware that np.array_split(df, 3) splits the dataframe into 3 sub-dataframes, while the split_dataframe function defined in @elixir’s answer, when called as split_dataframe(df, chunk_size=3), splits the dataframe every chunk_size rows.
Example:
With np.array_split:
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11], columns=['TEST']) df_split = np.array_split(df, 3)
…you get 3 sub-dataframes:
df_split[0] # 1, 2, 3, 4 df_split[1] # 5, 6, 7, 8 df_split[2] # 9, 10, 11
With split_dataframe:
df_split2 = split_dataframe(df, chunk_size=3)
…you get 4 sub-dataframes:
df_split2[0] # 1, 2, 3 df_split2[1] # 4, 5, 6 df_split2[2] # 7, 8, 9 df_split2[3] # 10, 11
Hope I’m right, and that this is useful.
Method 4
I guess now we can use plain iloc with range for this.
chunk_size = int(df.shape[0] / 4)
for start in range(0, df.shape[0], chunk_size):
df_subset = df.iloc[start:start + chunk_size]
process_data(df_subset)
....
Method 5
Caution:
np.array_split doesn’t work with numpy-1.9.0. I checked out: It works with 1.8.1.
Error:
Dataframe has no ‘size’ attribute
Method 6
you can use list comprehensions to do this in a single line
n = 4 chunks = [df[i:i+n] for i in range(0,df.shape[0],n)]
Method 7
You can use groupby, assuming you have an integer enumerated index:
import math df = pd.DataFrame(dict(sample=np.arange(99))) rows_per_subframe = math.ceil(len(df) / 4.) subframes = [i[1] for i in df.groupby(np.arange(len(df))//rows_per_subframe)]
Note: groupby returns a tuple in which the 2nd element is the dataframe, thus the slightly complicated extraction.
>>> len(subframes), [len(i) for i in subframes] (4, [25, 25, 25, 24])
Method 8
building on @elixir’s answer…
I’d suggest using a generator
to avoid loading all the chunks in memory:
def chunkit(df, chunk_size = 10000):
num_chunks = len(df) // chunk_size
if len(df) % chunk_size != 0:
num_chunks += 1
for i in range(num_chunks):
yield df[i*chunk_size:(i + 1) * chunk_size]
Method 9
I like a one-liners, so @LucyDrops answer works for me.
However, there is one important thing: add a .copy() if chunks should be COPY of original df parts:
chunks = [df[i:i+n].copy() for i in range(0,df.shape[0],n)]
Otherwise there is a high chance to receive the next warning during the further processing of chunks (in loop for example):
A value is trying to be set on a copy of a slice from a DataFrame.
(see the details in the Pandas documentation)
Method 10
I also experienced np.array_split not working with Pandas DataFrame. My solution was to only split the index of the DataFrame and then introduce a new column with the “group” label:
indexes = np.array_split(df.index,N, axis=0) for i,index in enumerate(indexes): df.loc[index,'group'] = i
This makes groupby operations very convenient, such as for calculating the mean value of each group:
df.groupby(by='group').mean()
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0