How to group values of pandas dataframe and select the latest(by date) from each group?
For example, given a dataframe sorted by date:
id product date 0 220 6647 2014-09-01 1 220 6647 2014-09-03 2 220 6647 2014-10-16 3 826 3380 2014-11-11 4 826 3380 2014-12-09 5 826 3380 2015-05-19 6 901 4555 2014-09-01 7 901 4555 2014-10-05 8 901 4555 2014-11-01
grouping by id or product, and selecting the earliest gives:
id product date 2 220 6647 2014-10-16 5 826 3380 2015-05-19 8 901 4555 2014-11-01
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can also use tail with groupby to get the last n values of the group:
df.sort_values('date').groupby('id').tail(1)
id product date
2 220 6647 2014-10-16
8 901 4555 2014-11-01
5 826 3380 2015-05-19
Method 2
use idxmax in groupby and slice df with loc
df.loc[df.groupby('id').date.idxmax()]
id product date
2 220 6647 2014-10-16
5 826 3380 2015-05-19
8 901 4555 2014-11-01
Method 3
I had a similar problem and ended up using drop_duplicates rather than groupby.
It seems to run significatively faster on large datasets when compared with other methods suggested above.
df.sort_values(by="date").drop_duplicates(subset=["id"], keep="last")
id product date
2 220 6647 2014-10-16
8 901 4555 2014-11-01
5 826 3380 2015-05-19
Method 4
Given a dataframe sorted by date, you can obtain what you ask for in a number of ways:
Like this:
df.groupby(['id','product']).last()
like this:
df.groupby(['id','product']).nth(-1)
or like this:
df.groupby(['id','product']).max()
If you don’t want id and product to appear as index use groupby(['id', 'product'], as_index=False).
Alternatively use:
df.groupby(['id','product']).tail(1)
Method 5
To use .tail() as an aggregation method and keep your grouping intact:
df.sort_values('date').groupby('id').apply(lambda x: x.tail(1))
id product date
id
220 2 220 6647 2014-10-16
826 5 826 3380 2015-05-19
901 8 901 4555 2014-11-01
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0