Pandas groupby mean – into a dataframe?

Say my data looks like this:

date,name,id,dept,sale1,sale2,sale3,total_sale
1/1/17,John,50,Sales,50.0,60.0,70.0,180.0
1/1/17,Mike,21,Engg,43.0,55.0,2.0,100.0
1/1/17,Jane,99,Tech,90.0,80.0,70.0,240.0
1/2/17,John,50,Sales,60.0,70.0,80.0,210.0
1/2/17,Mike,21,Engg,53.0,65.0,12.0,130.0
1/2/17,Jane,99,Tech,100.0,90.0,80.0,270.0
1/3/17,John,50,Sales,40.0,50.0,60.0,150.0
1/3/17,Mike,21,Engg,53.0,55.0,12.0,120.0
1/3/17,Jane,99,Tech,80.0,70.0,60.0,210.0

I want a new column average, which is the average of total_sale for each name,id,dept tuple

I tried

df.groupby(['name', 'id', 'dept'])['total_sale'].mean()

And this does return a series with the mean:

name  id  dept 
Jane  99  Tech     240.000000
John  50  Sales    180.000000
Mike  21  Engg     116.666667
Name: total_sale, dtype: float64

but how would I reference the data? The series is a one dimensional one of shape (3,). Ideally I would like this put back into a dataframe with proper columns so I can reference properly by name/id/dept.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

If you call .reset_index() on the series that you have, it will get you a dataframe like you want (each level of the index will be converted into a column):

df.groupby(['name', 'id', 'dept'])['total_sale'].mean().reset_index()

EDIT: to respond to the OP’s comment, adding this column back to your original dataframe is a little trickier. You don’t have the same number of rows as in the original dataframe, so you can’t assign it as a new column yet. However, if you set the index the same, pandas is smart and will fill in the values properly for you. Try this:

cols = ['date','name','id','dept','sale1','sale2','sale3','total_sale']
data = [
['1/1/17', 'John', 50, 'Sales', 50.0, 60.0, 70.0, 180.0],
['1/1/17', 'Mike', 21, 'Engg', 43.0, 55.0, 2.0, 100.0],
['1/1/17', 'Jane', 99, 'Tech', 90.0, 80.0, 70.0, 240.0],
['1/2/17', 'John', 50, 'Sales', 60.0, 70.0, 80.0, 210.0],
['1/2/17', 'Mike', 21, 'Engg', 53.0, 65.0, 12.0, 130.0],
['1/2/17', 'Jane', 99, 'Tech', 100.0, 90.0, 80.0, 270.0],
['1/3/17', 'John', 50, 'Sales', 40.0, 50.0, 60.0, 150.0],
['1/3/17', 'Mike', 21, 'Engg', 53.0, 55.0, 12.0, 120.0],
['1/3/17', 'Jane', 99, 'Tech', 80.0, 70.0, 60.0, 210.0]
]
df = pd.DataFrame(data, columns=cols)

mean_col = df.groupby(['name', 'id', 'dept'])['total_sale'].mean() # don't reset the index!
df = df.set_index(['name', 'id', 'dept']) # make the same index here
df['mean_col'] = mean_col
df = df.reset_index() # to take the hierarchical index off again

Method 2

Adding to_frame

df.groupby(['name', 'id', 'dept'])['total_sale'].mean().to_frame()

Method 3

You are very close. You simply need to add a set of brackets around [['total_sale']] to tell python to select as a dataframe and not a series:

df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()

If you want all columns:

df.groupby(['name', 'id', 'dept'], as_index=False).mean()[['name', 'id', 'dept', 'total_sale']]

Method 4

The answer is in two lines of code:

The first line creates the hierarchical frame.

df_mean = df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()

The second line converts it to a dataframe with four columns(‘name’, ‘id’, ‘dept’, ‘total_sale’)

df_mean = df_mean.reset_index()


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x