group by pandas dataframe and select latest in each group

How to group values of pandas dataframe and select the latest(by date) from each group?

For example, given a dataframe sorted by date:

    id     product   date
0   220    6647     2014-09-01 
1   220    6647     2014-09-03 
2   220    6647     2014-10-16
3   826    3380     2014-11-11
4   826    3380     2014-12-09
5   826    3380     2015-05-19
6   901    4555     2014-09-01
7   901    4555     2014-10-05
8   901    4555     2014-11-01

grouping by id or product, and selecting the earliest gives:

    id     product   date
2   220    6647     2014-10-16
5   826    3380     2015-05-19
8   901    4555     2014-11-01

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can also use tail with groupby to get the last n values of the group:

df.sort_values('date').groupby('id').tail(1)

    id  product date
2   220 6647    2014-10-16
8   901 4555    2014-11-01
5   826 3380    2015-05-19

Method 2

use idxmax in groupby and slice df with loc

df.loc[df.groupby('id').date.idxmax()]

    id  product       date
2  220     6647 2014-10-16
5  826     3380 2015-05-19
8  901     4555 2014-11-01

Method 3

I had a similar problem and ended up using drop_duplicates rather than groupby.

It seems to run significatively faster on large datasets when compared with other methods suggested above.

df.sort_values(by="date").drop_duplicates(subset=["id"], keep="last")

    id  product        date
2  220     6647  2014-10-16
8  901     4555  2014-11-01
5  826     3380  2015-05-19

Method 4

Given a dataframe sorted by date, you can obtain what you ask for in a number of ways:

Like this:

df.groupby(['id','product']).last()

like this:

df.groupby(['id','product']).nth(-1)

or like this:

df.groupby(['id','product']).max()

If you don’t want id and product to appear as index use groupby(['id', 'product'], as_index=False).
Alternatively use:

df.groupby(['id','product']).tail(1)

Method 5

To use .tail() as an aggregation method and keep your grouping intact:

df.sort_values('date').groupby('id').apply(lambda x: x.tail(1))

        id  product date
id              
220 2   220 6647    2014-10-16
826 5   826 3380    2015-05-19
901 8   901 4555    2014-11-01

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating