I want to get a percentage of a particular value in a df column. Say I have a df with (col1, col2 , col3, gender) gender column has values of M, F, or Other. I want to get the percentage of M, F, Other values in the df.
I have tried this, which gives me the number M, F, Other instances, but I want these as a percentage of the total number of values in the df.
df.groupby('gender').size()
Can someone help?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Use value_counts with normalize=True:
df['gender'].value_counts(normalize=True) * 100
The result is a fraction in range (0, 1]. We multiply by 100 here in order to get the %.
Method 2
If you do not need to look M and F values other than gender column then, may be you can try using value_counts() and count() as following:
df = pd.DataFrame({'gender':['M','M','F', 'F', 'F']})
# Percentage calculation
(df['gender'].value_counts()/df['gender'].count())*100
Result:
F 60.0 M 40.0 Name: gender, dtype: float64
Or, using groupby:
(df.groupby('gender').size()/df['gender'].count())*100
Method 3
Let’s say there are 200 values out of which 120 are categorized as M and 80 as F
1)
df['gender'].value_counts() output: M=120 F=80
2)
df['gender'].value_counts(Normalize=True) output: M=0.60 F=0.40
3)
df['gender'].value_counts(Normalize=True)*100 #will convert output to percentages output: M=60 F=40
Method 4
finding the percentage of target variation to chenck imbalance/not.
g = data[Target_col_Y]
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1,keys=('counts','percentage'))
print (df)
counts percentage
0 36548 88.734583
1 4640 11.265417
finding the maximum in the columns percentage here, to check how much #imbalance there
df1=df.diff(periods=1,axis=0) difvalue=df1[[list(df1.columns)[-1]]].max()
Method 5
print('(Gender Male= 0):n {}%'.format(100 - round(df['Gender'].mean()*100, 2)))
print('(Gender Female=1):n{}%'.format(round(df['Gender'].mean()*100, 2)))
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0