I have to create a function which would split provided dataframe into chunks of needed size. For instance if dataframe contains 1111 rows, I want to be able to specify chunk size of 400 rows, and get three smaller dataframes with sizes of 400, 400 and 311. Is there a convenience function to do the job? What would be the best way to store and iterate over sliced dataframe?
Example DataFrame
import numpy as np import pandas as pd test = pd.concat([pd.Series(np.random.rand(1111)), pd.Series(np.random.rand(1111))], axis = 1)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can take the floor division of a sequence up to the amount of rows in the dataframe, and use it to groupby splitting the dataframe into equally sized chunks:
n = 400
for g, df in test.groupby(np.arange(len(test)) // n):
print(df.shape)
# (400, 2)
# (400, 2)
# (311, 2)
Method 2
A more pythonic way to break large dataframes into smaller chunks based on fixed number of rows is to use list comprehension:
n = 400 #chunk row size list_df = [test[i:i+n] for i in range(0,test.shape[0],n)] [i.shape for i in list_df]
Output:
[(400, 2), (400, 2), (311, 2)]
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0