I have a large dataframe (>3MM rows) that I’m trying to pass through a function (the one below is largely simplified), and I keep getting a Memory Error message.
I think I’m passing too large of a dataframe into the function, so I’m trying to:
1) Slice the dataframe into smaller chunks (preferably sliced by AcctName)
2) Pass the dataframe into the function
3) Concatenate the dataframes back into one large dataframe
def trans_times_2(df):
df['Double_Transaction'] = df['Transaction'] * 2
large_df
AcctName Timestamp Transaction
ABC 12/1 12.12
ABC 12/2 20.89
ABC 12/3 51.93
DEF 12/2 13.12
DEF 12/8 9.93
DEF 12/9 92.09
GHI 12/1 14.33
GHI 12/6 21.99
GHI 12/12 98.81
I know that my function works properly, since it will work on a smaller dataframe (e.g. 40,000 rows). I tried the following, but I was unsuccessful with concatenating the small dataframes back into one large dataframe.
def split_df(df):
new_df = []
AcctNames = df.AcctName.unique()
DataFrameDict = {elem: pd.DataFrame for elem in AcctNames}
key_list = [k for k in DataFrameDict.keys()]
new_df = []
for key in DataFrameDict.keys():
DataFrameDict[key] = df[:][df.AcctNames == key]
trans_times_2(DataFrameDict[key])
rejoined_df = pd.concat(new_df)
How I envision the dataframes being split:
df1 AcctName Timestamp Transaction Double_Transaction ABC 12/1 12.12 24.24 ABC 12/2 20.89 41.78 ABC 12/3 51.93 103.86 df2 AcctName Timestamp Transaction Double_Transaction DEF 12/2 13.12 26.24 DEF 12/8 9.93 19.86 DEF 12/9 92.09 184.18 df3 AcctName Timestamp Transaction Double_Transaction GHI 12/1 14.33 28.66 GHI 12/6 21.99 43.98 GHI 12/12 98.81 197.62
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can use list comprehension to split your dataframe into smaller dataframes contained in a list.
n = 200000 #chunk row size list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
Or use numpy array_split:
list_df = np.array_split(df, n)
You can access the chunks with:
list_df[0] list_df[1] etc...
Then you can assemble it back into a one dataframe using pd.concat.
By AcctName
list_df = []
for n,g in df.groupby('AcctName'):
list_df.append(g)
Method 2
I’d suggest using a dependency more_itertools. It handles all edge cases like uneven partition of the dataframe and returns an iterator that will make things a tiny bit more efficient.
(updated using code from @Acumenus)
from more_itertools import sliced CHUNK_SIZE = 5 index_slices = sliced(range(len(df)), CHUNK_SIZE) for index_slice in index_slices: chunk = df.iloc[index_slice] # your dataframe chunk ready for use
Method 3
I love @ScottBoston answer, although, I still haven’t memorized the incantation. Here’s a more verbose function that does the same thing:
def chunkify(df: pd.DataFrame, chunk_size: int):
start = 0
length = df.shape[0]
# If DF is smaller than the chunk, return the DF
if length <= chunk_size:
yield df[:]
return
# Yield individual chunks
while start + chunk_size <= length:
yield df[start:chunk_size + start]
start = start + chunk_size
# Yield the remainder chunk, if needed
if start < length:
yield df[start:]
To rebuild the data frame, accumulate each chunk in a list, then pd.concat(chunks, axis=1)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0