Pandas, groupby and count

I have a dataframe say like this

>>> df = pd.DataFrame({'user_id':['a','a','s','s','s'],
                    'session':[4,5,4,5,5],
                    'revenue':[-1,0,1,2,1]})

>>> df
   revenue  session user_id
0       -1        4       a
1        0        5       a
2        1        4       s
3        2        5       s
4        1        5       s

And each value of session and revenue represents a kind of type, and I want to count the number of each kind say the number of revenue=-1 and session=4 of user_id=a is 1.

And I found simple call count() function after groupby() can’t output the result I want.

>>> df.groupby('user_id').count()
         revenue  session
user_id
a              2        2
s              3        3

How can I do that?

Contents hide

Answers:

Method 1

Method 2

pandas >= 1.1: df.value_counts is available!

Minimal Example

Performance

Method 3

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You seem to want to group by several columns at once:

df.groupby(['revenue','session','user_id'])['user_id'].count()

should give you what you want

Method 2

pandas >= 1.1: `df.value_counts` is available!

From pandas 1.1, this will be my recommended method for counting the number of rows in groups (i.e., the group size). To count the number of non-nan rows in a group for a specific column, check out the accepted answer.

Old

df.groupby(['A', 'B']).size()   # df.groupby(['A', 'B'])['C'].count()

New [✓]

df.value_counts(subset=['A', 'B'])

Note that size and count are not identical, the former counts all rows per group, the latter counts non-null rows only. See this other answer of mine for more.

Minimal Example

pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'

df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
                   'num_wings': [2, 0, 0, 0]},
                  index=['falcon', 'dog', 'cat', 'ant'])
df
        num_legs  num_wings
falcon         2          2
dog            4          0
cat            4          0
ant            6          0

df.value_counts(subset=['num_legs', 'num_wings'], sort=False)

num_legs  num_wings
2         2            1
4         0            2
6         0            1
dtype: int64

Compare this output with

df.groupby(['num_legs', 'num_wings'])['num_legs'].size()

num_legs  num_wings
2         2            1
4         0            2
6         0            1
Name: num_legs, dtype: int64

Performance

It’s also faster if you don’t sort the result:

%timeit df.groupby(['num_legs', 'num_wings'])['num_legs'].count()
%timeit df.value_counts(subset=['num_legs', 'num_wings'], sort=False)

640 µs ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
568 µs ± 6.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Method 3

I struggled with the same issue, made use of the solution provided above. You can actually designate any of the columns to count:

df.groupby(['revenue','session','user_id'])['revenue'].count()

and

df.groupby(['revenue','session','user_id'])['session'].count()

would give the same answer.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating