pandas groupby sort within groups

I want to group my dataframe by two columns and then sort the aggregated results within the groups.

In [167]: df

Out[167]:
   count     job source
0      2   sales      A
1      4   sales      B
2      6   sales      C
3      3   sales      D
4      7   sales      E
5      5  market      A
6      3  market      B
7      2  market      C
8      4  market      D
9      1  market      E


In [168]: df.groupby(['job','source']).agg({'count':sum})

Out[168]:
               count
job    source       
market A           5
       B           3
       C           2
       D           4
       E           1
sales  A           2
       B           4
       C           6
       D           3
       E           7

I would now like to sort the count column in descending order within each of the groups. And then take only the top three rows. To get something like:

                count
job     source
market  A           5
        D           4
        B           3
sales   E           7
        C           6
        B           4

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Method 9

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You could also just do it in one go, by doing the sort first and using head to take the first 3 of each group.

In[34]: df.sort_values(['job','count'],ascending=False).groupby('job').head(3)

Out[35]: 
   count     job source
4      7   sales      E
2      6   sales      C
1      4   sales      B
5      5  market      A
8      4  market      D
6      3  market      B

Method 2

What you want to do is actually again a groupby (on the result of the first groupby): sort and take the first three elements per group.

Starting from the result of the first groupby:

In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})

We group by the first level of the index:

In [63]: g = df_agg['count'].groupby('job', group_keys=False)

Then we want to sort (‘order’) each group and take the first three elements:

In [64]: res = g.apply(lambda x: x.sort_values(ascending=False).head(3))

However, for this, there is a shortcut function to do this, nlargest:

In [65]: g.nlargest(3)
Out[65]:
job     source
market  A         5
        D         4
        B         3
sales   E         7
        C         6
        B         4
dtype: int64

So in one go, this looks like:

df_agg['count'].groupby('job', group_keys=False).nlargest(3)

Method 3

Here’s other example of taking top 3 on sorted order, and sorting within the groups:

In [43]: import pandas as pd                                                                                                                                                       

In [44]:  df = pd.DataFrame({"name":["Foo", "Foo", "Baar", "Foo", "Baar", "Foo", "Baar", "Baar"], "count_1":[5,10,12,15,20,25,30,35], "count_2" :[100,150,100,25,250,300,400,500]})

In [45]: df                                                                                                                                                                        
Out[45]: 
   count_1  count_2  name
0        5      100   Foo
1       10      150   Foo
2       12      100  Baar
3       15       25   Foo
4       20      250  Baar
5       25      300   Foo
6       30      400  Baar
7       35      500  Baar


### Top 3 on sorted order:
In [46]: df.groupby(["name"])["count_1"].nlargest(3)                                                                                                                               
Out[46]: 
name   
Baar  7    35
      6    30
      4    20
Foo   5    25
      3    15
      1    10
dtype: int64


### Sorting within groups based on column "count_1":
In [48]: df.groupby(["name"]).apply(lambda x: x.sort_values(["count_1"], ascending = False)).reset_index(drop=True)
Out[48]: 
   count_1  count_2  name
0       35      500  Baar
1       30      400  Baar
2       20      250  Baar
3       12      100  Baar
4       25      300   Foo
5       15       25   Foo
6       10      150   Foo
7        5      100   Foo

Method 4

Try this Instead, which is a simple way to do groupby and sorting in descending order:

df.groupby(['companyName'])['overallRating'].sum().sort_values(ascending=False).head(20)

Method 5

If you don’t need to sum a column, then use @tvashtar’s answer. If you do need to sum, then you can use @joris’ answer or this one which is very similar to it.

df.groupby(['job']).apply(lambda x: (x.groupby('source')
                                      .sum()
                                      .sort_values('count', ascending=False))
                                     .head(3))

Method 6

I was getting this error without using “by”:

TypeError: sort_values() missing 1 required positional argument: ‘by’

So, I changed it to this and now it’s working:

df.groupby(['job','source']).agg({'count':sum}).sort_values(by='count',ascending=False).head(20)

Method 7

You can do it in one line –

df.groupby(['job']).apply(lambda x: x.sort_values(['count'], ascending=False).head(3)
.drop('job', axis=1))

what apply() does is that it takes each group of groupby and assigns it to the x in lambda function.

Method 8

@joris answer helped a lot.
This is what worked for me.

df.groupby(['job'])['count'].nlargest(3)

Method 9

When grouped dataframe contains more than one grouped columns other methods erases other columns.

edf = pd.DataFrame({"job":["sales", "sales", "sales", "sales", "sales",
                           "market", "market", "market", "market", "market"],
                    "source":["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"],
                    "count":[2, 4,6,3,7,5,3,2,4,1],
                    "other_col":[1,2,3,4,56,6,3,4,6,11]})

gdf = edf.groupby(["job", "source"]).agg({"count":sum, "other_col":np.mean})
gdf.groupby(level=0, group_keys=False).apply(lambda g:g.sort_values("count", ascending=False))

This keeps other_col as well as ordering by count column within each group

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating