I want to group my dataframe by two columns and then sort the aggregated results within the groups.
In [167]: df
Out[167]:
count job source
0 2 sales A
1 4 sales B
2 6 sales C
3 3 sales D
4 7 sales E
5 5 market A
6 3 market B
7 2 market C
8 4 market D
9 1 market E
In [168]: df.groupby(['job','source']).agg({'count':sum})
Out[168]:
count
job source
market A 5
B 3
C 2
D 4
E 1
sales A 2
B 4
C 6
D 3
E 7
I would now like to sort the count column in descending order within each of the groups. And then take only the top three rows. To get something like:
count
job source
market A 5
D 4
B 3
sales E 7
C 6
B 4
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You could also just do it in one go, by doing the sort first and using head to take the first 3 of each group.
In[34]: df.sort_values(['job','count'],ascending=False).groupby('job').head(3)
Out[35]:
count job source
4 7 sales E
2 6 sales C
1 4 sales B
5 5 market A
8 4 market D
6 3 market B
Method 2
What you want to do is actually again a groupby (on the result of the first groupby): sort and take the first three elements per group.
Starting from the result of the first groupby:
In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})
We group by the first level of the index:
In [63]: g = df_agg['count'].groupby('job', group_keys=False)
Then we want to sort (‘order’) each group and take the first three elements:
In [64]: res = g.apply(lambda x: x.sort_values(ascending=False).head(3))
However, for this, there is a shortcut function to do this, nlargest:
In [65]: g.nlargest(3)
Out[65]:
job source
market A 5
D 4
B 3
sales E 7
C 6
B 4
dtype: int64
So in one go, this looks like:
df_agg['count'].groupby('job', group_keys=False).nlargest(3)
Method 3
Here’s other example of taking top 3 on sorted order, and sorting within the groups:
In [43]: import pandas as pd
In [44]: df = pd.DataFrame({"name":["Foo", "Foo", "Baar", "Foo", "Baar", "Foo", "Baar", "Baar"], "count_1":[5,10,12,15,20,25,30,35], "count_2" :[100,150,100,25,250,300,400,500]})
In [45]: df
Out[45]:
count_1 count_2 name
0 5 100 Foo
1 10 150 Foo
2 12 100 Baar
3 15 25 Foo
4 20 250 Baar
5 25 300 Foo
6 30 400 Baar
7 35 500 Baar
### Top 3 on sorted order:
In [46]: df.groupby(["name"])["count_1"].nlargest(3)
Out[46]:
name
Baar 7 35
6 30
4 20
Foo 5 25
3 15
1 10
dtype: int64
### Sorting within groups based on column "count_1":
In [48]: df.groupby(["name"]).apply(lambda x: x.sort_values(["count_1"], ascending = False)).reset_index(drop=True)
Out[48]:
count_1 count_2 name
0 35 500 Baar
1 30 400 Baar
2 20 250 Baar
3 12 100 Baar
4 25 300 Foo
5 15 25 Foo
6 10 150 Foo
7 5 100 Foo
Method 4
Try this Instead, which is a simple way to do groupby and sorting in descending order:
df.groupby(['companyName'])['overallRating'].sum().sort_values(ascending=False).head(20)
Method 5
If you don’t need to sum a column, then use @tvashtar’s answer. If you do need to sum, then you can use @joris’ answer or this one which is very similar to it.
df.groupby(['job']).apply(lambda x: (x.groupby('source')
.sum()
.sort_values('count', ascending=False))
.head(3))
Method 6
I was getting this error without using “by”:
TypeError: sort_values() missing 1 required positional argument: ‘by’
So, I changed it to this and now it’s working:
df.groupby(['job','source']).agg({'count':sum}).sort_values(by='count',ascending=False).head(20)
Method 7
You can do it in one line –
df.groupby(['job']).apply(lambda x: x.sort_values(['count'], ascending=False).head(3)
.drop('job', axis=1))
what apply() does is that it takes each group of groupby and assigns it to the x in lambda function.
Method 8
@joris answer helped a lot.
This is what worked for me.
df.groupby(['job'])['count'].nlargest(3)
Method 9
When grouped dataframe contains more than one grouped columns other methods erases other columns.
edf = pd.DataFrame({"job":["sales", "sales", "sales", "sales", "sales",
"market", "market", "market", "market", "market"],
"source":["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"],
"count":[2, 4,6,3,7,5,3,2,4,1],
"other_col":[1,2,3,4,56,6,3,4,6,11]})
gdf = edf.groupby(["job", "source"]).agg({"count":sum, "other_col":np.mean})
gdf.groupby(level=0, group_keys=False).apply(lambda g:g.sort_values("count", ascending=False))
This keeps other_col as well as ordering by count column within each group
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0