I have the following dataframe:
df = pd.DataFrame([
(1, 1, 'term1'),
(1, 2, 'term2'),
(1, 1, 'term1'),
(1, 1, 'term2'),
(2, 2, 'term3'),
(2, 3, 'term1'),
(2, 2, 'term1')
], columns=['id', 'group', 'term'])
I want to group it by id and group and calculate the number of each term for this id, group pair.
So in the end I am going to get something like this:
I was able to achieve what I want by looping over all the rows with df.iterrows() and creating a new dataframe, but this is clearly inefficient. (If it helps, I know the list of all terms beforehand and there are ~10 of them).
It looks like I have to group by and then count values, so I tried that with df.groupby(['id', 'group']).value_counts() which does not work because value_counts operates on the groupby series and not a dataframe.
Anyway I can achieve this without looping?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I use groupby and size
df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
Timing
1,000,000 rows
df = pd.DataFrame(dict(id=np.random.choice(100, 1000000),
group=np.random.choice(20, 1000000),
term=np.random.choice(10, 1000000)))
Method 2
using pivot_table() method:
In [22]: df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0) Out[22]: term term1 term2 term3 id group 1 1 2 1 0 2 0 1 0 2 2 1 0 1 3 1 0 0
Timing against 700K rows DF:
In [24]: df = pd.concat([df] * 10**5, ignore_index=True) In [25]: df.shape Out[25]: (700000, 3) In [3]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0) 1 loop, best of 3: 226 ms per loop In [4]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0) 1 loop, best of 3: 236 ms per loop In [5]: %timeit pd.crosstab([df.id, df.group], df.term) 1 loop, best of 3: 355 ms per loop In [6]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int) 1 loop, best of 3: 232 ms per loop In [7]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0) 1 loop, best of 3: 231 ms per loop
Timing against 7M rows DF:
In [9]: df = pd.concat([df] * 10, ignore_index=True) In [10]: df.shape Out[10]: (7000000, 3) In [11]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0) 1 loop, best of 3: 2.27 s per loop In [12]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0) 1 loop, best of 3: 2.3 s per loop In [13]: %timeit pd.crosstab([df.id, df.group], df.term) 1 loop, best of 3: 3.37 s per loop In [14]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int) 1 loop, best of 3: 2.28 s per loop In [15]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0) 1 loop, best of 3: 1.89 s per loop
Method 3
Instead of remembering lengthy solutions, how about the one that pandas has built in for you:
df.groupby(['id', 'group', 'term']).count()
Method 4
You can use crosstab:
print (pd.crosstab([df.id, df.group], df.term)) term term1 term2 term3 id group 1 1 2 1 0 2 0 1 0 2 2 1 0 1 3 1 0 0
Another solution with groupby with aggregating size, reshaping by unstack:
df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0) term term1 term2 term3 id group 1 1 2 1 0 2 0 1 0 2 2 1 0 1 3 1 0 0
Timings:
df = pd.concat([df]*10000).reset_index(drop=True) In [48]: %timeit (df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)) 100 loops, best of 3: 12.4 ms per loop In [49]: %timeit (df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)) 100 loops, best of 3: 12.2 ms per loop
Method 5
If you want to use value_counts you can use it on a given series, and resort to the following:
df.groupby(["id", "group"])["term"].value_counts().unstack(fill_value=0)
or in an equivalent fashion, using the .agg method:
df.groupby(["id", "group"]).agg({"term": "value_counts"}).unstack(fill_value=0)
Another option is to directly use value_counts on the DataFrame itself without resorting to groupby:
df.value_counts().unstack(fill_value=0)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0



