I am trying to calculate time-based aggregations in Pandas based on date values stored in a separate tables.
The top of the first table table_a looks like this:
COMPANY_ID DATE MEASURE
1 2010-01-01 00:00:00 10
1 2010-01-02 00:00:00 10
1 2010-01-03 00:00:00 10
1 2010-01-04 00:00:00 10
1 2010-01-05 00:00:00 10
Here is the code to create the table:
table_a = pd.concat(
[pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),
'COMPANY_ID': 1 , 'MEASURE': 10}),
pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),
'COMPANY_ID': 2 , 'MEASURE': 10})])
The second table, table_b, looks like this:
COMPANY END_DATE
1 2010-03-01 00:00:00
1 2010-06-02 00:00:00
2 2010-03-01 00:00:00
2 2010-06-02 00:00:00
and the code to create it is:
table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']),
'COMPANY':(1,1,2,2)})
I want to be able to get the sum of the ‘measure’ column for each ‘COMPANY_ID’ for each 30-day period prior to the ‘END_DATE’ in table_b.
This is (I think) the SQL equivalent:
select
b.COMPANY_ID,
b.DATE
sum(a.MEASURE) AS MEASURE_TO_END_DATE
from table_a a, table_b b
where a.COMPANY = b.COMPANY and
a.DATE < b.DATE and
a.DATE > b.DATE - 30
group by b.COMPANY;
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Well, I can think of a few ways:
- essentially blow up the dataframe by just merging on the exact field (
company)… then filter on the 30-day windows after the merge.
- should be fast but could use lots of memory
- Move the merging and filtering on the 30-day window into a
groupby().
- results in a merge for each group, so slower but should use less memory
Option #1
Suppose your data looks like the following (I expanded your sample data):
print df
company date measure
0 0 2010-01-01 10
1 0 2010-01-15 10
2 0 2010-02-01 10
3 0 2010-02-15 10
4 0 2010-03-01 10
5 0 2010-03-15 10
6 0 2010-04-01 10
7 1 2010-03-01 5
8 1 2010-03-15 5
9 1 2010-04-01 5
10 1 2010-04-15 5
11 1 2010-05-01 5
12 1 2010-05-15 5
print windows
company end_date
0 0 2010-02-01
1 0 2010-03-15
2 1 2010-04-01
3 1 2010-05-15
Create a beginning date for the 30 day windows:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print windows
company end_date beg_date
0 0 2010-02-01 2010-01-02
1 0 2010-03-15 2010-02-13
2 1 2010-04-01 2010-03-02
3 1 2010-05-15 2010-04-15
Now do a merge and then select based on if date falls within beg_date and end_date:
df = df.merge(windows,on='company',how='left')
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df
company date measure end_date beg_date
2 0 2010-01-15 10 2010-02-01 2010-01-02
4 0 2010-02-01 10 2010-02-01 2010-01-02
7 0 2010-02-15 10 2010-03-15 2010-02-13
9 0 2010-03-01 10 2010-03-15 2010-02-13
11 0 2010-03-15 10 2010-03-15 2010-02-13
16 1 2010-03-15 5 2010-04-01 2010-03-02
18 1 2010-04-01 5 2010-04-01 2010-03-02
21 1 2010-04-15 5 2010-05-15 2010-04-15
23 1 2010-05-01 5 2010-05-15 2010-04-15
25 1 2010-05-15 5 2010-05-15 2010-04-15
You can compute the 30 day window sums by grouping on company and end_date:
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
def cond_merge(g,windows):
g = g.merge(windows,on='company',how='left')
g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]
return g.groupby('end_date')['measure'].sum()
print df.groupby('company').apply(cond_merge,windows)
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn’t blow up a dataframe but is pretty fast:
windows['date'] = windows['end_date']
df = df.merge(windows,on=['company','date'],how='outer')
print df
company date measure end_date
0 0 2010-01-01 10 NaT
1 0 2010-01-15 10 NaT
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 NaT
4 0 2010-03-01 10 NaT
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 NaT
8 1 2010-03-15 5 NaT
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 NaT
11 1 2010-05-01 5 NaT
12 1 2010-05-15 5 2010-05-15
This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:
df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())
print df
company date measure end_date
0 0 2010-01-01 10 2010-02-01
1 0 2010-01-15 10 2010-02-01
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 2010-03-15
4 0 2010-03-01 10 2010-03-15
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 2010-04-01
8 1 2010-03-15 5 2010-04-01
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 2010-05-15
11 1 2010-05-01 5 2010-05-15
12 1 2010-05-15 5 2010-05-15
df = df[df.end_date.notnull()]
df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print df
company date measure end_date beg_date
0 0 2010-01-01 10 2010-02-01 2010-01-02
1 0 2010-01-15 10 2010-02-01 2010-01-02
2 0 2010-02-01 10 2010-02-01 2010-01-02
3 0 2010-02-15 10 2010-03-15 2010-02-13
4 0 2010-03-01 10 2010-03-15 2010-02-13
5 0 2010-03-15 10 2010-03-15 2010-02-13
7 1 2010-03-01 5 2010-04-01 2010-03-02
8 1 2010-03-15 5 2010-04-01 2010-03-02
9 1 2010-04-01 5 2010-04-01 2010-03-02
10 1 2010-04-15 5 2010-05-15 2010-04-15
11 1 2010-05-01 5 2010-05-15 2010-04-15
12 1 2010-05-15 5 2010-05-15 2010-04-15
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.
Method 2
There is a very easy, and practical (or maybe the only direct way) to do conditional join in pandas. Since there is no direct way to do conditional join in pandas, you will need an additional library, and that is, pandasql
Install the library pandasql from pip using the command pip install pandasql. This library allows you to manipulate the pandas dataframes using the SQL queries.
import pandas as pd
from pandasql import sqldf
df = pd.read_excel(r'play_data.xlsx')
df
id Name Amount
0 A001 A 100
1 A002 B 110
2 A003 C 120
3 A005 D 150
Now let’s just do a conditional join to compare the Amount of the IDs
# Make your pysqldf object:
pysqldf = lambda q: sqldf(q, globals())
# Write your query in SQL syntax, here you can use df as a normal SQL table
cond_join= '''
select
df_left.*,
df_right.*
from df as df_left
join df as df_right
on
df_left.[Amount] > (df_right.[Amount]+10)
'''
# Now, get your queries results as dataframe using the sqldf object that you created
pysqldf(cond_join)
id Name Amount id Name Amount
0 A003 C 120 A001 A 100
1 A005 D 150 A001 A 100
2 A005 D 150 A002 B 110
3 A005 D 150 A003 C 120
Method 3
I know I am late for the party but here are two solutions. The first one is rather simple but not very general, while the second one should be more universal. In what follows I assume that table_a and table_b objects are already defined as in the original question.
Solution 1
This one is simple. Here we just do a left join and append END_DATE values to table_a and then filter out the rows we are not interested in. So the memory overhead here is size of table_a * number of unique END_DATE values per COMPANY in table_b.
table_c = table_a.merge(table_b, left_on="COMPANY_ID", right_on="COMPANY")
table_c[(table_c["DATE"] - table_c["END_DATE"]).dt.days.between(-30, 0)]
.groupby(["COMPANY", "END_DATE"])["MEASURE"].sum()
## OUTPUT:
COMPANY END_DATE
1 2010-03-01 310
2010-06-02 310
2 2010-03-01 310
2010-06-02 310
Name: MEASURE, dtype: int64
This is quite fast, but could blow up the size of table_a significantly if table_b contained many values.
Solution 2
This one is a bit smarter and operates row-by-row, where to each row in table_b we explicitly map only the relevant subset of table_a. Thus, we get only the data we need, so there is no memory overhead (beyond the memory needed to represent the raw records over which we want to sum).
table_b.groupby(["COMPANY", "END_DATE"])
.apply(lambda g: table_a[
(table_a["COMPANY_ID"] == g["COMPANY"].iloc[0]) &
((table_a["DATE"] - g["END_DATE"].iloc[0]).dt.days.between(-30, 0))
]["MEASURE"].sum())
## OUTPUT:
COMPANY END_DATE
1 2010-03-01 310
2010-06-02 310
2 2010-03-01 310
2010-06-02 310
dtype: int64
Note that in this case for each inequality we use only the relevant subsets of table_a, which will be much more memory efficient. The price is that this soution seems to be about 2-3 times slower (but in general still relatively fast; ~2-3ms runtime on your data).
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0