Count the frequency that a value occurs in a dataframe column

I have a dataset

category
cat a
cat b
cat a

I’d like to be able to return something like (showing unique values and frequency)

category   freq 
cat a       2
cat b       1

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Method 9

Method 10

Method 11

Method 12

Timers

Method 13

Method 14

Method 15

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use groupby and count:

In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df.groupby('a').count()

Out[37]:

   a
a   
a  2
b  3
s  2

[3 rows x 1 columns]

See the online docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

Also value_counts() as @DSM has commented, many ways to skin a cat here

In [38]:
df['a'].value_counts()

Out[38]:

b    3
a    2
s    2
dtype: int64

If you wanted to add frequency back to the original dataframe use transform to return an aligned index:

In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df

Out[41]:

   a freq
0  a    2
1  b    3
2  s    2
3  s    2
4  b    3
5  a    2
6  b    3

[7 rows x 2 columns]

Method 2

If you want to apply to all columns you can use:

df.apply(pd.value_counts)

This will apply a column based aggregation function (in this case value_counts) to each of the columns.

Method 3

df.category.value_counts()

This short little line of code will give you the output you want.

If your column name has spaces you can use

df['category'].value_counts()

Method 4

df.apply(pd.value_counts).fillna(0)

value_counts – Returns object containing counts of unique values

apply – count frequency in every column. If you set axis=1, you get frequency in every row

fillna(0) – make output more fancy. Changed NaN to 0

Method 5

In 0.18.1 groupby together with count does not give the frequency of unique values:

>>> df
   a
0  a
1  b
2  s
3  s
4  b
5  a
6  b

>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]

However, the unique values and their frequencies are easily determined using size:

>>> df.groupby('a').size()
a
a    2
b    3
s    2

With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.

Method 6

Using list comprehension and value_counts for multiple columns in a df

[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]

https://stackoverflow.com/a/28192263/786326

Method 7

If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().

index, counts = np.unique(df.values,return_counts=True)

np.bincount() could be faster if your values are integers.

Method 8

As everyone said, the faster solution is to do:

df.column_to_analyze.value_counts()

But if you want to use the output in your dataframe, with this schema:

df input:

category
cat a
cat b
cat a

df output: 

category   counts
cat a        2
cat b        1 
cat a        2

you can do this:

df['counts'] = df.category.map(df.category.value_counts())
df

Method 9

Without any libraries, you could do this instead:

def to_frequency_table(data):
    frequencytable = {}
    for key in data:
        if key in frequencytable:
            frequencytable[key] += 1
        else:
            frequencytable[key] = 1
    return frequencytable

Example:

to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}

Method 10

You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.

cats = ['client', 'hotel', 'currency', 'ota', 'user_country']

df[cats] = df[cats].astype('category')

and then calling describe:

df[cats].describe()

This will give you a nice table of value counts and a bit more :):

    client  hotel   currency    ota user_country
count   852845  852845  852845  852845  852845
unique  2554    17477   132 14  219
top 2198    13202   USD Hades   US
freq    102562  8847    516500  242734  340992

Method 11

I believe this should work fine for any DataFrame columns list.

def column_list(x):
    column_list_df = []
    for col_name in x.columns:
        y = col_name, len(x[col_name].unique())
        column_list_df.append(y)
return pd.DataFrame(column_list_df)

column_list_df.rename(columns={0: "Feature", 1: "Value_count"})

The function “column_list” checks the columns names and then checks the uniqueness of each column values.

Method 12

@metatoaster has already pointed this out.
Go for Counter. It’s blazing fast.

import pandas as pd
from collections import Counter
import timeit
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])

Timers

%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop

%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop

%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop

%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop

Cheers!

Method 13

The following code creates frequency table for the various values in a column called “Total_score” in a dataframe called “smaller_dat1”, and then returns the number of times the value “300” appears in the column.

valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]

Method 14

n_values = data.income.value_counts()

First unique value count

n_at_most_50k = n_values[0]

Second unique value count

n_greater_50k = n_values[1]

n_values

Output:

<=50K    34014
>50K     11208

Name: income, dtype: int64

Output:

n_greater_50k,n_at_most_50k:-
(11208, 34014)

Method 15

your data:

|category|
cat a
cat b
cat a

solution:

 df['freq'] = df.groupby('category')['category'].transform('count')
 df =  df.drop_duplicates()

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating