Say my data looks like this:
date,name,id,dept,sale1,sale2,sale3,total_sale 1/1/17,John,50,Sales,50.0,60.0,70.0,180.0 1/1/17,Mike,21,Engg,43.0,55.0,2.0,100.0 1/1/17,Jane,99,Tech,90.0,80.0,70.0,240.0 1/2/17,John,50,Sales,60.0,70.0,80.0,210.0 1/2/17,Mike,21,Engg,53.0,65.0,12.0,130.0 1/2/17,Jane,99,Tech,100.0,90.0,80.0,270.0 1/3/17,John,50,Sales,40.0,50.0,60.0,150.0 1/3/17,Mike,21,Engg,53.0,55.0,12.0,120.0 1/3/17,Jane,99,Tech,80.0,70.0,60.0,210.0
I want a new column average, which is the average of total_sale for each name,id,dept tuple
I tried
df.groupby(['name', 'id', 'dept'])['total_sale'].mean()
And this does return a series with the mean:
name id dept Jane 99 Tech 240.000000 John 50 Sales 180.000000 Mike 21 Engg 116.666667 Name: total_sale, dtype: float64
but how would I reference the data? The series is a one dimensional one of shape (3,). Ideally I would like this put back into a dataframe with proper columns so I can reference properly by name/id/dept.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
If you call .reset_index() on the series that you have, it will get you a dataframe like you want (each level of the index will be converted into a column):
df.groupby(['name', 'id', 'dept'])['total_sale'].mean().reset_index()
EDIT: to respond to the OP’s comment, adding this column back to your original dataframe is a little trickier. You don’t have the same number of rows as in the original dataframe, so you can’t assign it as a new column yet. However, if you set the index the same, pandas is smart and will fill in the values properly for you. Try this:
cols = ['date','name','id','dept','sale1','sale2','sale3','total_sale'] data = [ ['1/1/17', 'John', 50, 'Sales', 50.0, 60.0, 70.0, 180.0], ['1/1/17', 'Mike', 21, 'Engg', 43.0, 55.0, 2.0, 100.0], ['1/1/17', 'Jane', 99, 'Tech', 90.0, 80.0, 70.0, 240.0], ['1/2/17', 'John', 50, 'Sales', 60.0, 70.0, 80.0, 210.0], ['1/2/17', 'Mike', 21, 'Engg', 53.0, 65.0, 12.0, 130.0], ['1/2/17', 'Jane', 99, 'Tech', 100.0, 90.0, 80.0, 270.0], ['1/3/17', 'John', 50, 'Sales', 40.0, 50.0, 60.0, 150.0], ['1/3/17', 'Mike', 21, 'Engg', 53.0, 55.0, 12.0, 120.0], ['1/3/17', 'Jane', 99, 'Tech', 80.0, 70.0, 60.0, 210.0] ] df = pd.DataFrame(data, columns=cols) mean_col = df.groupby(['name', 'id', 'dept'])['total_sale'].mean() # don't reset the index! df = df.set_index(['name', 'id', 'dept']) # make the same index here df['mean_col'] = mean_col df = df.reset_index() # to take the hierarchical index off again
Method 2
Adding to_frame
df.groupby(['name', 'id', 'dept'])['total_sale'].mean().to_frame()
Method 3
You are very close. You simply need to add a set of brackets around [['total_sale']] to tell python to select as a dataframe and not a series:
df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()
If you want all columns:
df.groupby(['name', 'id', 'dept'], as_index=False).mean()[['name', 'id', 'dept', 'total_sale']]
Method 4
The answer is in two lines of code:
The first line creates the hierarchical frame.
df_mean = df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()
The second line converts it to a dataframe with four columns(‘name’, ‘id’, ‘dept’, ‘total_sale’)
df_mean = df_mean.reset_index()
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0