Given a Pandas DataFrame that has multiple columns with categorical values (0 or 1), is it possible to conveniently get the value_counts for every column at the same time?
For example, suppose I generate a DataFrame as follows:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 4)), columns=list('abcd'))
I can get a DataFrame like this:
a b c d 0 0 1 1 0 1 1 1 1 1 2 1 1 1 0 3 0 1 0 0 4 0 0 0 1 5 0 1 1 0 6 0 1 1 1 7 1 0 1 0 8 1 0 1 1 9 0 1 1 0
How do I conveniently get the value counts for every column and obtain the following conveniently?
a b c d 0 6 3 2 6 1 4 7 8 4
My current solution is:
pieces = []
for col in df.columns:
tmp_series = df[col].value_counts()
tmp_series.name = col
pieces.append(tmp_series)
df_value_counts = pd.concat(pieces, axis=1)
But there must be a simpler way, like stacking, pivoting, or groupby?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Just call apply and pass pd.Series.value_counts:
In [212]:
df = pd.DataFrame(np.random.randint(0, 2, (10, 4)), columns=list('abcd'))
df.apply(pd.Series.value_counts)
Out[212]:
a b c d
0 4 6 4 3
1 6 4 6 7
Method 2
There is actually a fairly interesting and advanced way of doing this problem with crosstab and melt
df = pd.DataFrame({'a': ['table', 'chair', 'chair', 'lamp', 'bed'],
'b': ['lamp', 'candle', 'chair', 'lamp', 'bed'],
'c': ['mirror', 'mirror', 'mirror', 'mirror', 'mirror']})
df
a b c
0 table lamp mirror
1 chair candle mirror
2 chair chair mirror
3 lamp lamp mirror
4 bed bed mirror
We can first melt the DataFrame
df1 = df.melt(var_name='columns', value_name='index') df1 columns index 0 a table 1 a chair 2 a chair 3 a lamp 4 a bed 5 b lamp 6 b candle 7 b chair 8 b lamp 9 b bed 10 c mirror 11 c mirror 12 c mirror 13 c mirror 14 c mirror
And then use the crosstab function to count the values for each column. This preserves the data type as ints which wouldn’t be the case for the currently selected answer:
pd.crosstab(index=df1['index'], columns=df1['columns']) columns a b c index bed 1 1 0 candle 0 1 0 chair 2 1 0 lamp 1 2 0 mirror 0 0 5 table 1 0 0
Or in one line, which expands the column names to parameter names with ** (this is advanced)
pd.crosstab(**df.melt(var_name='columns', value_name='index'))
Also, value_counts is now a top-level function. So you can simplify the currently selected answer to the following:
df.apply(pd.value_counts)
Method 3
To get the counts only for specific columns:
df[['a', 'b']].apply(pd.Series.value_counts)
where df is the name of your dataframe and ‘a’ and ‘b’ are the columns for which you want to count the values.
Method 4
The solution that selects all categorical columns and makes a dataframe with all value counts at once:
df = pd.DataFrame({
'fruits': ['apple', 'mango', 'apple', 'mango', 'mango', 'pear', 'mango'],
'vegetables': ['cucumber', 'eggplant', 'tomato', 'tomato', 'tomato', 'tomato', 'pumpkin'],
'sauces': ['chili', 'chili', 'ketchup', 'ketchup', 'chili', '1000 islands', 'chili']})
cat_cols = df.select_dtypes(include=object).columns.tolist()
(pd.DataFrame(
df[cat_cols]
.melt(var_name='column', value_name='value')
.value_counts())
.rename(columns={0: 'counts'})
.sort_values(by=['column', 'counts']))
counts
column value
fruits pear 1
apple 2
mango 4
sauces 1000 islands 1
ketchup 2
chili 4
vegetables pumpkin 1
eggplant 1
cucumber 1
tomato 4
Method 5
You can also try this code:
for i in heart.columns:
x = heart[i].value_counts()
print("Column name is:",i,"and it value is:",x)
Method 6
Your solution wrapped in one line looks even simpler than using groupby, stacking etc:
pd.concat([df<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div>.value_counts() for column in df], axis = 1)
Method 7
This is what worked for me:
for column in df.columns:
print("n" + column)
print(df<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div>.value_counts())
Method 8
You can use a lambda function:
df.apply(lambda x: x.value_counts())
Method 9
Ran into this to see if there was a better way of doing what I was doing. Turns out calling df.apply(pd.value_counts) on a DataFrame whose columns each have their own many distinct values will result in a pretty substantial performance hit.
In this case, it is better to simply iterate over the non-numeric columns in a dictionary comprehension, and leave it as a dictionary:
types_to_count = {"object", "category", "string"}
result = {
col: df[col].value_counts()
for col in df.columns[df.dtypes.isin(types_to_count)]
}
The filtering by types_to_count helps to ensure you don’t try to take the value_counts of continuous data.
Method 10
Another solution which can be done:
df = pd.DataFrame(np.random.randint(0, 2, (10, 4)), columns=list('abcd'))
l1 = pd.Series()
for var in df.columns:
l2 = df[var].value_counts()
l1 = pd.concat([l1, l2], axis = 1)
l1
Method 11
Sometimes some columns are subsequent in hierarchy, in that case I recommend to “group” them and then make counts:
# note: "_id" is whatever column you have to make the counts with len()
cat_cols = ['column_1', 'column_2']
df.groupby(cat_cols).agg(count=('_id', lambda x: len(x)))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>count</th>
</tr>
<tr>
<th>column_1</th>
<th>column_2</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top">category_1</th>
<th>Excelent</th>
<td>19</td>
</tr>
<tr>
<th>Good</th>
<td>11</td>
</tr>
<tr>
<th>Bad</th>
<td>1</td>
</tr>
<tr>
<th rowspan="5" valign="top">category_2</th>
<th>Happy</th>
<td>48</td>
</tr>
<tr>
<th>Good mood</th>
<td>158</td>
</tr>
<tr>
<th>Serious</th>
<td>62</td>
</tr>
<tr>
<th>Sad</th>
<td>10</td>
</tr>
<tr>
<th>Depressed</th>
<td>8</td>
</tr>
</tbody>
</table>
Bonus: you can change len(x) to x.nunique() or other lambda functions you want.
Method 12
Applying the value_counts function gave be unexpected / not the most readable results. But this approach seems super simple and easy to read:
df[["col1", "col2", "col3"]].value_counts()
Here is an example of results if the cols have boolean values:
col1 col2 col3
False False False 1000
True False 1000
True False False 1000
True 1000
True False 1000
True 1000
dtype: int64
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0