So my dataframe looks like this:
date site country score 0 2018-01-01 google us 100 1 2018-01-01 google ch 50 2 2018-01-02 google us 70 3 2018-01-03 google us 60 4 2018-01-02 google ch 10 5 2018-01-01 fb us 50 6 2018-01-02 fb us 55 7 2018-01-03 fb us 100 8 2018-01-01 fb es 100 9 2018-01-02 fb gb 100
Each site has a different score depending on the country. I’m trying to find the 1/3/5-day difference of scores for each site/country combination.
Output should be:
date site country score diff 8 2018-01-01 fb es 100 0.0 9 2018-01-02 fb gb 100 0.0 5 2018-01-01 fb us 50 0.0 6 2018-01-02 fb us 55 5.0 7 2018-01-03 fb us 100 45.0 1 2018-01-01 google ch 50 0.0 4 2018-01-02 google ch 10 -40.0 0 2018-01-01 google us 100 0.0 2 2018-01-02 google us 70 -30.0 3 2018-01-03 google us 60 -10.0
I first tried sorting by site/country/date, then grouping by site and country but I’m not able to wrap my head around getting a difference from a grouped object.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
First, sort the DataFrame and then all you need is groupby.diff():
df = df.sort_values(by=['site', 'country', 'date'])
df['diff'] = df.groupby(['site', 'country'])['score'].diff().fillna(0)
df
Out:
date site country score diff
8 2018-01-01 fb es 100 0.0
9 2018-01-02 fb gb 100 0.0
5 2018-01-01 fb us 50 0.0
6 2018-01-02 fb us 55 5.0
7 2018-01-03 fb us 100 45.0
1 2018-01-01 google ch 50 0.0
4 2018-01-02 google ch 10 -40.0
0 2018-01-01 google us 100 0.0
2 2018-01-02 google us 70 -30.0
3 2018-01-03 google us 60 -10.0
sort_values doesn’t support arbitrary orderings. If you need to sort arbitrarily (google before fb for example) you need to store them in a collection and set your column as categorical. Then sort_values will respect the ordering you provided there.
Method 2
You can shift and substract grouped values:
df.sort_values(['site', 'country', 'date'], inplace=True) df['diff'] = df['score'] - df.groupby(['site', 'country'])['score'].shift()
Result:
date site country score diff 8 2018-01-01 fb es 100 NaN 9 2018-01-02 fb gb 100 NaN 5 2018-01-01 fb us 50 NaN 6 2018-01-02 fb us 55 5.0 7 2018-01-03 fb us 100 45.0 1 2018-01-01 google ch 50 NaN 4 2018-01-02 google ch 10 -40.0 0 2018-01-01 google us 100 NaN 2 2018-01-02 google us 70 -30.0 3 2018-01-03 google us 60 -10.0
To fill NaN with 0 use df['diff'].fillna(0, inplace=True).
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0