I have a dataframe say like this
>>> df = pd.DataFrame({'user_id':['a','a','s','s','s'],
'session':[4,5,4,5,5],
'revenue':[-1,0,1,2,1]})
>>> df
revenue session user_id
0 -1 4 a
1 0 5 a
2 1 4 s
3 2 5 s
4 1 5 s
And each value of session and revenue represents a kind of type, and I want to count the number of each kind say the number of revenue=-1 and session=4 of user_id=a is 1.
And I found simple call count() function after groupby() can’t output the result I want.
>>> df.groupby('user_id').count()
revenue session
user_id
a 2 2
s 3 3
How can I do that?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You seem to want to group by several columns at once:
df.groupby(['revenue','session','user_id'])['user_id'].count()
should give you what you want
Method 2
pandas >= 1.1: df.value_counts is available!
From pandas 1.1, this will be my recommended method for counting the number of rows in groups (i.e., the group size). To count the number of non-nan rows in a group for a specific column, check out the accepted answer.
Old
df.groupby(['A', 'B']).size() # df.groupby(['A', 'B'])['C'].count()
New [✓]
df.value_counts(subset=['A', 'B'])
Note that size and count are not identical, the former counts all rows per group, the latter counts non-null rows only. See this other answer of mine for more.
Minimal Example
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
'num_wings': [2, 0, 0, 0]},
index=['falcon', 'dog', 'cat', 'ant'])
df
num_legs num_wings
falcon 2 2
dog 4 0
cat 4 0
ant 6 0
df.value_counts(subset=['num_legs', 'num_wings'], sort=False) num_legs num_wings 2 2 1 4 0 2 6 0 1 dtype: int64
Compare this output with
df.groupby(['num_legs', 'num_wings'])['num_legs'].size() num_legs num_wings 2 2 1 4 0 2 6 0 1 Name: num_legs, dtype: int64
Performance
It’s also faster if you don’t sort the result:
%timeit df.groupby(['num_legs', 'num_wings'])['num_legs'].count() %timeit df.value_counts(subset=['num_legs', 'num_wings'], sort=False) 640 µs ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 568 µs ± 6.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 3
I struggled with the same issue, made use of the solution provided above. You can actually designate any of the columns to count:
df.groupby(['revenue','session','user_id'])['revenue'].count()
and
df.groupby(['revenue','session','user_id'])['session'].count()
would give the same answer.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0