I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.
My code:
import pandas as pd
from scipy import stats
source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name' : ['NY','New','Spb','NY']})
print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])
Last line of code doesn’t work, it says “Key error ‘Short name’” and if I try to group only by City, then I got an AssertionError. What can I do fix it?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Pandas >= 0.16
pd.Series.mode is available!
Use groupby, GroupBy.agg, and apply the pd.Series.mode function to each group:
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode) Country City Russia Sankt-Petersburg Spb USA New-York NY Name: Short name, dtype: object
If this is needed as a DataFrame, use
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()
Short name
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
The useful thing about Series.mode is that it always returns a Series, making it very compatible with agg and apply, especially when reconstructing the groupby output. It is also faster.
# Accepted answer. %timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0]) # Proposed in this post. %timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode) 5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Dealing with Multiple Modes
Series.mode also does a good job when there are multiple modes:
source2 = source.append(
pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
ignore_index=True)
# Now `source2` has two modes for the
# ("USA", "New-York") group, they are "NY" and "New".
source2
Country City Short name
0 USA New-York NY
1 USA New-York New
2 Russia Sankt-Petersburg Spb
3 USA New-York NY
4 USA New-York New
source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode) Country City Russia Sankt-Petersburg Spb USA New-York [NY, New] Name: Short name, dtype: object
Or, if you want a separate row for each mode, you can use GroupBy.apply:
source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)
Country City
Russia Sankt-Petersburg 0 Spb
USA New-York 0 NY
1 New
Name: Short name, dtype: object
If you don’t care which mode is returned as long as it’s either one of them, then you will need a lambda that calls mode and extracts the first result.
source2.groupby(['Country','City'])['Short name'].agg(
lambda x: pd.Series.mode(x)[0])
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
Name: Short name, dtype: object
Alternatives to (not) consider
You can also use statistics.mode from python, but…
source.groupby(['Country','City'])['Short name'].apply(statistics.mode) Country City Russia Sankt-Petersburg Spb USA New-York NY Name: Short name, dtype: object
…it does not work well when having to deal with multiple modes; a StatisticsError is raised. This is mentioned in the docs:
If data is empty, or if there is not exactly one most common value,
StatisticsError is raised.
But you can see for yourself…
statistics.mode([1, 2]) # --------------------------------------------------------------------------- # StatisticsError Traceback (most recent call last) # ... # StatisticsError: no unique mode; found 2 equally common values
Method 2
You can use value_counts() to get a count series, and get the first row:
import pandas as pd
source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name' : ['NY','New','Spb','NY']})
source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
In case you are wondering about performing other agg functions in the .agg()
try this.
# Let's add a new col, account
source['account'] = [1,2,3,3]
source.groupby(['Country','City']).agg(mod = ('Short name',
lambda x: x.value_counts().index[0]),
avg = ('account', 'mean')
)
Method 3
For agg, the lambba function gets a Series, which does not have a 'Short name' attribute.
stats.mode returns a tuple of two arrays, so you have to take the first element of the first array in this tuple.
With these two simple changements:
source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])
returns
Short name Country City Russia Sankt-Petersburg Spb USA New-York NY
Method 4
A little late to the game here, but I was running into some performance issues with HYRY’s solution, so I had to come up with another one.
It works by finding the frequency of each key-value, and then, for each key, only keeping the value that appears with it most often.
There’s also an additional solution that supports multiple modes.
On a scale test that’s representative of the data I’m working with, this reduced runtime from 37.4s to 0.5s!
Here’s the code for the solution, some example usage, and the scale test:
import numpy as np
import pandas as pd
import random
import time
test_input = pd.DataFrame(columns=[ 'key', 'value'],
data= [[ 1, 'A' ],
[ 1, 'B' ],
[ 1, 'B' ],
[ 1, np.nan ],
[ 2, np.nan ],
[ 3, 'C' ],
[ 3, 'C' ],
[ 3, 'D' ],
[ 3, 'D' ]])
def mode(df, key_cols, value_col, count_col):
'''
Pandas does not provide a `mode` aggregation function
for its `GroupBy` objects. This function is meant to fill
that gap, though the semantics are not exactly the same.
The input is a DataFrame with the columns `key_cols`
that you would like to group on, and the column
`value_col` for which you would like to obtain the mode.
The output is a DataFrame with a record per group that has at least one mode
(null values are not counted). The `key_cols` are included as columns, `value_col`
contains a mode (ties are broken arbitrarily and deterministically) for each
group, and `count_col` indicates how many times each mode appeared in its group.
'''
return df.groupby(key_cols + [value_col]).size()
.to_frame(count_col).reset_index()
.sort_values(count_col, ascending=False)
.drop_duplicates(subset=key_cols)
def modes(df, key_cols, value_col, count_col):
'''
Pandas does not provide a `mode` aggregation function
for its `GroupBy` objects. This function is meant to fill
that gap, though the semantics are not exactly the same.
The input is a DataFrame with the columns `key_cols`
that you would like to group on, and the column
`value_col` for which you would like to obtain the modes.
The output is a DataFrame with a record per group that has at least
one mode (null values are not counted). The `key_cols` are included as
columns, `value_col` contains lists indicating the modes for each group,
and `count_col` indicates how many times each mode appeared in its group.
'''
return df.groupby(key_cols + [value_col]).size()
.to_frame(count_col).reset_index()
.groupby(key_cols + [count_col])[value_col].unique()
.to_frame().reset_index()
.sort_values(count_col, ascending=False)
.drop_duplicates(subset=key_cols)
print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')
scale_test_data = [[random.randint(1, 100000),
str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
data=scale_test_data)
start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start
start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start
start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start
Running this code will print something like:
key value 0 1 A 1 1 B 2 1 B 3 1 NaN 4 2 NaN 5 3 C 6 3 C 7 3 D 8 3 D key value count 1 1 B 2 2 3 C 2 key count value 1 1 2 [B] 2 3 2 [C, D] 0.489614009857 9.19386196136 37.4375009537
Hope this helps!
Method 5
The two top answers here suggest:
df.groupby(cols).agg(lambda x:x.value_counts().index[0])
or, preferably
df.groupby(cols).agg(pd.Series.mode)
However both of these fail in simple edge cases, as demonstrated here:
df = pd.DataFrame({
'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})
The first:
df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])
yields IndexError (because of the empty Series returned by group C). The second:
df.groupby(['client_id', 'date']).agg(pd.Series.mode)
returns ValueError: Function does not reduce, since the first group returns a list of two (since there are two modes). (As documented here, if the first group returned a single mode this would work!)
Two possible solutions for this case are:
import scipy x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])
And the solution given to me by cs95 in the comments here:
def foo(x):
m = pd.Series.mode(x);
return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)
However, all of these are slow and not suited for large datasets. A solution I ended up using which a) can deal with these cases and b) is much, much faster, is a lightly modified version of abw33’s answer (which should be higher):
def get_mode_per_column(dataframe, group_cols, col):
return (dataframe.fillna(-1) # NaN placeholder to keep group
.groupby(group_cols + [col])
.size()
.to_frame('count')
.reset_index()
.sort_values('count', ascending=False)
.drop_duplicates(subset=group_cols)
.drop(columns=['count'])
.sort_values(group_cols)
.replace(-1, np.NaN)) # restore NaNs
group_cols = ['client_id', 'date']
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
output_df[col] = get_mode_per_column(df, group_cols, col)[col].values
Essentially, the method works on one col at a time and outputs a df, so instead of concat, which is intensive, you treat the first as a df, and then iteratively add the output array (values.flatten()) as a column in the df.
Method 6
Formally, the correct answer is the @eumiro Solution.
The problem of @HYRY solution is that when you have a sequence of numbers like [1,2,3,4] the solution is wrong, i. e., you don’t have the mode.
Example:
>>> import pandas as pd
>>> df = pd.DataFrame(
{
'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'],
'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4],
'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
}
)
If you compute like @HYRY you obtain:
>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
total bla
client
A 4 30
B 4 40
C 1 10
D 3 30
E 2 20
Which is clearly wrong (see the A value that should be 1 and not 4) because it can’t handle with unique values.
Thus, the other solution is correct:
>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
total bla
client
A 1 10
B 4 40
C 1 10
D 3 30
E 2 20
Method 7
If you don’t want to include NaN values, using Counter is much much faster than pd.Series.mode or pd.Series.value_counts()[0]:
def get_most_common(srs):
x = list(srs)
my_counter = Counter(x)
return my_counter.most_common(1)[0][0]
df.groupby(col).agg(get_most_common)
should work. This will fail when you have NaN values, as each NaN will be counted separately.
Method 8
If you want another approach for solving it that is does not depend on value_counts or scipy.stats you can use the Counter collection
from collections import Counter get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]
Which can be applied to the above example like this
src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short_name' : ['NY','New','Spb','NY']})
src.groupby(['Country','City']).agg(get_most_common)
Method 9
The problem here is the performance, if you have a lot of rows it will be a problem.
If it is your case, please try with this:
import pandas as pd
source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short_name' : ['NY','New','Spb','NY']})
source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()
Method 10
A slightly clumsier but faster approach for larger datasets involves getting the counts for a column of interest, sorting the counts highest to lowest, and then de-duplicating on a subset to only retain the largest cases. The code example is following:
>>> import pandas as pd
>>> source = pd.DataFrame(
{
'Country': ['USA', 'USA', 'Russia', 'USA'],
'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name': ['NY', 'New', 'Spb', 'NY']
}
)
>>> grouped_df = source
.groupby(['Country','City','Short name'])[['Short name']]
.count()
.rename(columns={'Short name':'count'})
.reset_index()
.sort_values('count', ascending=False)
.drop_duplicates(subset=['Country', 'City'])
.drop('count', axis=1)
>>> print(grouped_df)
Country City Short name
1 USA New-York NY
0 Russia Sankt-Petersburg Spb
Method 11
Use DataFrame.value_counts for fast solution
The top 3 answers here:
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0])
are incredibly slow for large datasets.
Solution using collections.Counter is much faster (20-40x faster than the top 3 methods)
source.groupby(['Country', 'City'])['Short name'].agg(lambda srs: Counter(list(srs)).most_common(1)[0][0])
but still very slow.
Solutions by abw333 and Josh Friedlander are much faster (~10x faster than the method using Counter). These solutions can be further optimized by using value_counts instead (DataFrame.value_counts is available since pandas 1.1.0.). value_counts sorts the output in descending order by default, so there’s no need to call sort_values so the code is a little shorter.
source.value_counts(['Country', 'City', 'Short name']).reset_index(name='Count').drop_duplicates(['Country', 'City'])
To make the function account for NaNs like in Josh Friedlander’s function, simply turn off dropna parameter:
source.value_counts(['Country', 'City', 'Short name'], dropna=False).reset_index(name='Count').drop_duplicates(['Country', 'City'])
Using abw333’s setup, if we test the runtime difference, for a DataFrame with 1mil rows, value_counts is marginally faster than abw333’s solution
scale_test_data = [[random.randint(1, 100),
str(random.randint(100, 900)),
str(random.randint(0,2))] for i in range(1000000)]
source = pd.DataFrame(data=scale_test_data, columns=['Country', 'City', 'Short name'])
%timeit u = source.value_counts(['Country', 'City', 'Short name']).reset_index(name='Count').drop_duplicates(['Country', 'City'])
# 709 ms ± 3.88 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit v = mode(source, ['Country', 'City'], 'Short name', 'count')
# 728 ms ± 5.49 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0