First I’m new to pandas, but I’m already falling in love with it. I’m trying to implement the equivalent of the Lag function from Oracle.
Let’s suppose you have this DataFrame:
Date Group Data 2014-05-14 09:10:00 A 1 2014-05-14 09:20:00 A 2 2014-05-14 09:30:00 A 3 2014-05-14 09:40:00 A 4 2014-05-14 09:50:00 A 5 2014-05-14 10:00:00 B 1 2014-05-14 10:10:00 B 2 2014-05-14 10:20:00 B 3 2014-05-14 10:30:00 B 4
If this was an oracle database and I wanted to create a lag function grouped by the “Group” column and ordered by the Date I could easily use this function:
LAG(Data,1,NULL) OVER (PARTITION BY Group ORDER BY Date ASC) AS Data_lagged
This would result in the following Table:
Date Group Data Data lagged 2014-05-14 09:10:00 A 1 Null 2014-05-14 09:20:00 A 2 1 2014-05-14 09:30:00 A 3 2 2014-05-14 09:40:00 A 4 3 2014-05-14 09:50:00 A 5 4 2014-05-14 10:00:00 B 1 Null 2014-05-14 10:10:00 B 2 1 2014-05-14 10:20:00 B 3 2 2014-05-14 10:30:00 B 4 3
In pandas I can set the date to be an index and use the shift method:
db["Data_lagged"] = db.Data.shift(1)
The only issue is that this doesn’t group by a column. Even if I set the two columns Date and Group as indexes, I would still get the “5” in the lagged column.
Is there a way to implement the equivalent of the Lead and lag functions in Pandas?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You could perform a groupby/apply (shift) operation:
In [15]: df['Data_lagged'] = df.groupby(['Group'])['Data'].shift(1)
In [16]: df
Out[16]:
Date Group Data Data_lagged
2014-05-14 09:10:00 A 1 NaN
2014-05-14 09:20:00 A 2 1
2014-05-14 09:30:00 A 3 2
2014-05-14 09:40:00 A 4 3
2014-05-14 09:50:00 A 5 4
2014-05-14 10:00:00 B 1 NaN
2014-05-14 10:10:00 B 2 1
2014-05-14 10:20:00 B 3 2
2014-05-14 10:30:00 B 4 3
[9 rows x 4 columns]
To obtain the ORDER BY Date ASC effect, you must sort the DataFrame first:
df['Data_lagged'] = (df.sort_values(by=['Date'], ascending=True)
.groupby(['Group'])['Data'].shift(1))
Method 2
For lead operation in pandas, one need to just use shift(-1) instead of 1
df['Data_lead'] = df.groupby(['Group'])['Data'].shift(-1)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0