I need to count unique ID values in every domain.
I have data:
ID, domain 123, 'vk.com' 123, 'vk.com' 123, 'twitter.com' 456, 'vk.com' 456, 'facebook.com' 456, 'vk.com' 456, 'google.com' 789, 'twitter.com' 789, 'vk.com'
I try df.groupby(['domain', 'ID']).count()
But I want to get
domain, count vk.com 3 twitter.com 2 facebook.com 1 google.com 1
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You need nunique:
df = df.groupby('domain')['ID'].nunique()
print (df)
domain
'facebook.com' 1
'google.com' 1
'twitter.com' 2
'vk.com' 3
Name: ID, dtype: int64
If you need to strip ' characters:
df = df.ID.groupby([df.domain.str.strip("'")]).nunique()
print (df)
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
Name: ID, dtype: int64
Or as Jon Clements commented:
df.groupby(df.domain.str.strip("'"))['ID'].nunique()
You can retain the column name like this:
df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})
print(df)
domain ID
0 fb 1
1 ggl 1
2 twitter 2
3 vk 3
The difference is that nunique() returns a Series and agg() returns a DataFrame.
Method 2
Generally to count distinct values in single column, you can use Series.value_counts:
df.domain.value_counts() #'vk.com' 5 #'twitter.com' 2 #'facebook.com' 1 #'google.com' 1 #Name: domain, dtype: int64
To see how many unique values in a column, use Series.nunique:
df.domain.nunique() # 4
To get all these distinct values, you can use unique or drop_duplicates, the slight difference between the two functions is that unique return a numpy.array while drop_duplicates returns a pandas.Series:
df.domain.unique() # array(["'vk.com'", "'twitter.com'", "'facebook.com'", "'google.com'"], dtype=object) df.domain.drop_duplicates() #0 'vk.com' #2 'twitter.com' #4 'facebook.com' #6 'google.com' #Name: domain, dtype: object
As for this specific problem, since you’d like to count distinct value with respect to another variable, besides groupby method provided by other answers here, you can also simply drop duplicates firstly and then do value_counts():
import pandas as pd df.drop_duplicates().domain.value_counts() # 'vk.com' 3 # 'twitter.com' 2 # 'facebook.com' 1 # 'google.com' 1 # Name: domain, dtype: int64
Method 3
df.domain.value_counts()
>>> df.domain.value_counts() vk.com 5 twitter.com 2 google.com 1 facebook.com 1 Name: domain, dtype: int64
Method 4
If I understand correctly, you want the number of different IDs for every domain. Then you can try this:
output = df.drop_duplicates()
output.groupby('domain').size()
Output:
domain facebook.com 1 google.com 1 twitter.com 2 vk.com 3 dtype: int64
You could also use value_counts, which is slightly less efficient. But the best is Jezrael’s answer using nunique:
%timeit df.drop_duplicates().groupby('domain').size()
1000 loops, best of 3: 939 µs per loop
%timeit df.drop_duplicates().domain.value_counts()
1000 loops, best of 3: 1.1 ms per loop
%timeit df.groupby('domain')['ID'].nunique()
1000 loops, best of 3: 440 µs per loop
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0