How do I filter a pandas DataFrame based on value counts?

I’m working in Python with a pandas DataFrame of video games, each with a genre. I’m trying to remove any video game with a genre that appears less than some number of times in the DataFrame, but I have no clue how to go about this. I did find a StackOverflow question that seems to be related, but I can’t decipher the solution at all (possibly because I’ve never heard of R and my memory of functional programming is rusty at best).

Help?

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use groupby filter:

In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [12]: df
Out[12]:
   A  B
0  1  2
1  1  4
2  5  6

In [13]: df.groupby("A").filter(lambda x: len(x) > 1)
Out[13]:
   A  B
0  1  2
1  1  4

I recommend reading the split-combine-section of the docs.

Method 2

Solutions with better performance should be GroupBy.transform with size for count per groups to Series with same size like original df, so possible filter by boolean indexing:

df1 = df[df.groupby("A")['A'].transform('size') > 1]

Or use Series.map with Series.value_counts:

df1 = df[df['A'].map(df['A'].value_counts()) > 1]

Method 3

@jezael solution works very well, Here is a different approach to filter based on values count :

For example, if the dataset is :

df = pd.DataFrame({'a': [1,2,3,3,1,6], 'b': [11,2,33,4,55,6]})

Convert and save the count as a dictionary

ount_freq = dict(df['a'].value_counts())

Create a new column and copy the target column, map the dictionary with newly created column

df['count_freq'] = df['a']
df['count_freq'] = df['count_freq'].map(count_freq)

Now we have a new column with count freq, you can now define a threshold and filter easily with this column.

df[df.count_freq>1]

Method 4

Additionlly, in case one wants to filter and have ‘count’ column:

attr = 'A'
limit = 10
df2 = df.groupby(attr)[attr].agg(count='count')
df2 = df2.loc[df2['count'] > limit].reset_index()
print(df2)

#outputs rows with grouped 'A' count > 10 and columns ==> index, count, A

Method 5

I might be a little late to this party but:

df = pd.DataFrame(df_you_have.groupby(['IdA', 'SomeOtherA'])['theA_you_want_to_count'].count())
df.reset_index(inplace=True)

This is how you create a new dataframe and then just filter it…

df[df['A']>100]

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating