I have a list of 4 pandas dataframes containing a day of tick data that I want to merge into a single data frame. I cannot understand the behavior of concat on my timestamps. See details below:
data [<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 35228 entries, 2013-03-28 00:00:07.089000+02:00 to 2013-03-28 18:59:20.357000+02:00 Data columns: Price 4040 non-null values Volume 4040 non-null values BidQty 35228 non-null values BidPrice 35228 non-null values AskPrice 35228 non-null values AskQty 35228 non-null values dtypes: float64(6), <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 33088 entries, 2013-04-01 00:03:17.047000+02:00 to 2013-04-01 18:59:58.175000+02:00 Data columns: Price 3969 non-null values Volume 3969 non-null values BidQty 33088 non-null values BidPrice 33088 non-null values AskPrice 33088 non-null values AskQty 33088 non-null values dtypes: float64(6), <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 50740 entries, 2013-04-02 00:03:27.470000+02:00 to 2013-04-02 18:59:58.172000+02:00 Data columns: Price 7326 non-null values Volume 7326 non-null values BidQty 50740 non-null values BidPrice 50740 non-null values AskPrice 50740 non-null values AskQty 50740 non-null values dtypes: float64(6), <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 60799 entries, 2013-04-03 00:03:06.994000+02:00 to 2013-04-03 18:59:58.180000+02:00 Data columns: Price 8258 non-null values Volume 8258 non-null values BidQty 60799 non-null values BidPrice 60799 non-null values AskPrice 60799 non-null values AskQty 60799 non-null values dtypes: float64(6)]
Using append I get:
pd.DataFrame().append(data) <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 179855 entries, 2013-03-28 00:00:07.089000+02:00 to 2013-04-03 18:59:58.180000+02:00 Data columns: AskPrice 179855 non-null values AskQty 179855 non-null values BidPrice 179855 non-null values BidQty 179855 non-null values Price 23593 non-null values Volume 23593 non-null values dtypes: float64(6)
Using concat I get:
pd.concat(data) <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 179855 entries, 2013-03-27 22:00:07.089000+02:00 to 2013-04-03 16:59:58.180000+02:00 Data columns: Price 23593 non-null values Volume 23593 non-null values BidQty 179855 non-null values BidPrice 179855 non-null values AskPrice 179855 non-null values AskQty 179855 non-null values dtypes: float64(6)
Notice how the index changes when using concat. Why is that happening and how would I go about using concat to reproduce the results obtained using append? (Since concat seems so much faster; 24.6 ms per loop vs 3.02 s per loop)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Pandas concat vs append vs join vs merge
- Concat gives the flexibility to join based on the axis( all rows or all columns)
- Append is the specific case(axis=0, join=’outer’) of concat (being deprecated use concat)
- Join is based on the indexes (set by set_index) on how variable =[‘left’,’right’,’inner’,’couter’]
-
Merge is based on any particular column each of the two dataframes, this columns are variables on like ‘left_on’, ‘right_on’, ‘on’
Method 2
So what are you doing is with append and concat is almost equivalent. The difference is the empty DataFrame. For some reason this causes a big slowdown, not sure exactly why, will have to look at some point. Below is a recreation of basically what you did.
I almost always use concat (though in this case they are equivalent, except for the empty frame);
if you don’t use the empty frame they will be the same speed.
In [17]: df1 = pd.DataFrame(dict(A = range(10000)),index=pd.date_range('20130101',periods=10000,freq='s'))
In [18]: df1
Out[18]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000 entries, 2013-01-01 00:00:00 to 2013-01-01 02:46:39
Freq: S
Data columns (total 1 columns):
A 10000 non-null values
dtypes: int64(1)
In [19]: df4 = pd.DataFrame()
The concat
In [20]: %timeit pd.concat([df1,df2,df3])
1000 loops, best of 3: 270 us per loop
This is equavalent of your append
In [21]: %timeit pd.concat([df4,df1,df2,df3])
10 loops, best of
3: 56.8 ms per loop
Method 3
I have implemented a tiny benchmark (please find the code on Gist) to evaluate the pandas’ concat and append. I updated the code snippet and the results after the comment by ssk08 – thanks alot!
The benchmark ran on a Mac OS X 10.13 system with Python 3.6.2 and pandas 0.20.3.
+--------+---------------------------------+---------------------------------+ | | ignore_index=False | ignore_index=True | +--------+---------------------------------+---------------------------------+ | size | append | concat | append/concat | append | concat | append/concat | +--------+--------+--------+---------------+--------+--------+---------------+ | small | 0.4635 | 0.4891 | 94.77 % | 0.4056 | 0.3314 | 122.39 % | +--------+--------+--------+---------------+--------+--------+---------------+ | medium | 0.5532 | 0.6617 | 83.60 % | 0.3605 | 0.3521 | 102.37 % | +--------+--------+--------+---------------+--------+--------+---------------+ | large | 0.9558 | 0.9442 | 101.22 % | 0.6670 | 0.6749 | 98.84 % | +--------+--------+--------+---------------+--------+--------+---------------+
Using ignore_index=False append is slightly faster, with ignore_index=True concat is slightly faster.
tl;dr
No significant difference between concat and append.
Method 4
One more thing you have to keep in mind that the APPEND() method in Pandas doesn’t modify the original object. Instead it creates a new one with combined data. Because of involving creation and data buffer, its performance is not well. You’d better use CONCAT() function when doing multi-APPEND operations.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0