I am trying to groupby a dataframe with 4000+ rows by ‘summing’ a column called ‘From Email Address’.
sum_df = df.groupby(['From Email Address' ], as_index=False).agg({'Name':'first', 'From Email Address':'first',
'Country':'first', 'Subject':'first','Gross': 'sum'}) #grouped by dataframe'
An ‘From email address’ called ‘[email protected]’ has 200+ entries(rows) in the file. The value returned for that particular row is peculiar:
-10-10-10-49-10-8-30-5.8-79-90-10-10-49-10-468-7-90-10-79-5-10-6.66-1,500.00-49-10-10-8-49-4.99-10-10-10-2-4.74-49-49-90-7.5-35-10-90-7.5-7.5-17.15-10-9-42.5-10-10-468-49-33.2-49-10-15.8-10-3.22-49-5-140.4-6.13-10-7.75-10-51.35-10-49-90-3.22-90-10-5-2.9-6-79-74-6.66-49-468-10-10-49-8.38-10-2.6-149-250-10-31.6-10-49-7-10-10-5-36.5-66.2-49-3-35.7-10-10-10-5-49-5-10-49-10-10-79-406.38-29.5-10-49-49-90-10-10-10-49-50-4.7-25.48-10-3.22-10-10-10-3.22-1.6-49-49-49-49-49
whereas all other rows outputs correct value.
Can anyone please give insight about how to solve this?
Thanks!
I tried:
sum_df2 = df.groupby('From Email Address', as_index=False)['Gross'].sum()
The same value returned for this too.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
It appears that the numbers are being stored as strings. For example, there are fragments like 1,500.00.
Because of the way the + operator acts on strings (concatenation), the summation is concatenating a bunch of them.
The solution is to make sure you first convert your data to numeric types.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0