I am running a basic script that loops over a nested dictionary, grabs data from each record, and appends it to a Pandas DataFrame. The data looks something like this:
data = {"SomeCity": {"Date1": {record1, record2, record3, ...}, "Date2": {}, ...}, ...}
In total it has a few million records. The script itself looks like this:
city = ["SomeCity"]
df = DataFrame({}, columns=['Date', 'HouseID', 'Price'])
for city in cities:
for dateRun in data[city]:
for record in data[city][dateRun]:
recSeries = Series([record['Timestamp'],
record['Id'],
record['Price']],
index = ['Date', 'HouseID', 'Price'])
FredDF = FredDF.append(recSeries, ignore_index=True)
This runs painfully slow, however. Before I look for a way to parallelize it, I just want to make sure I’m not missing something obvious that would make this perform faster as it is, as I’m still quite new to Pandas.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I also used the dataframe’s append function inside a loop and I was perplexed how slow it ran.
A useful example for those who are suffering, based on the correct answer on this page.
Python version: 3
Pandas version: 0.20.3
# the dictionary to pass to pandas dataframe
d = {}
# a counter to use to add entries to "dict"
i = 0
# Example data to loop and append to a dataframe
data = [{"foo": "foo_val_1", "bar": "bar_val_1"},
{"foo": "foo_val_2", "bar": "bar_val_2"}]
# the loop
for entry in data:
# add a dictionary entry to the final dictionary
d[i] = {"col_1_title": entry['foo'], "col_2_title": entry['bar']}
# increment the counter
i = i + 1
# create the dataframe using 'from_dict'
# important to set the 'orient' parameter to "index" to make the keys as rows
df = DataFrame.from_dict(d, "index")
The “from_dict” function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html
Method 2
Appending rows to lists is far more efficient than to a DataFrame.
Hence you would want to
- append the rows to a list.
- Then convert it into
DataFrameand - set the index as required.
Method 3
I think the best way to do it is, if you know the data you are going to receive, allocate before hand.
import numpy as np import pandas as pd random_matrix = np.random.randn(100, 100) insert_df = pd.DataFrame(random_matrix) df = pd.DataFrame(columns=range(100), index=range(200)) df.loc[range(100), df.columns] = random_matrix df.loc[range(100, 200), df.columns] = random_matrix
This is the pattern that I think makes the most sense. append will be faster if
you have a very small dataframe, but it doesn’t scale.
In [1]: import numpy as np; import pandas as pd In [2]: random_matrix = np.random.randn(100, 100) ...: insert_df = pd.DataFrame(random_matrix) ...: df = pd.DataFrame(np.random.randn(100, 100)) In [2]: %timeit df.append(insert_df) 272 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [3]: %timeit df.loc[range(100), df.columns] = random_matrix 493 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [4]: %timeit df.loc[range(100), df.columns] = insert_df 821 µs ± 8.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
When we run this with a 100,000 row dataframe, we see much more dramatic results.
In [1]: df = pd.DataFrame(np.random.randn(100_000, 100)) In [2]: %timeit df.append(insert_df) 17.9 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [3]: %timeit df.loc[range(100), df.columns] = random_matrix 465 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [4]: %timeit df.loc[range(99_900, 100_000), df.columns] = random_matrix 465 µs ± 5.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [5]: %timeit df.loc[range(99_900, 100_000), df.columns] = insert_df 1.02 ms ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So we can see an append is about 17 times slower than an insert with a dataframe, and 35 times slower than an insert with a numpy array.
Method 4
Another way is to make it into a list and then use pd.concat
import pandas as pd
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
def append(df):
df_out = df.copy()
for i in range(1000):
df_out = df_out.append(df)
return df_out
def concat(df):
df_list = []
for i in range(1001):
df_list.append(df)
return pd.concat(df_list)
# some testing
df2 = concat(df)
df3 = append(df)
pd.testing.assert_frame_equal(df2,df3)
%timeit concat(df):
20.2 ms ± 794 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit append(df)
275 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It is the recommended way to concatenate rows in pandas now:
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once. link
Method 5
I ran into a similar problem where I had to append many times to a DataFrame, but did not know the values in advance of the appends. I wrote a lightweight DataFrame like data structure that is just blists() under the hood. I use that to accumulate all of the data and then when it is complete transform the output into a Pandas DataFrame. Here is a link to my project, all open source so I hope it helps others:
https://pypi.python.org/pypi/raccoon
Method 6
In my case I was loading a large number of dataframes with the same columns from different files and wanted to append them to create one large data frame.
My solution was to first load all the dataframes into a list, and then use
all_dfs = [] for i in all_files: all_dfs.append(/* load df from file */) master_df = pd.concat(all_dfs, ignore_index=True)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0