What is the best way to do a groupby on a Pandas dataframe, but exclude some columns from that groupby? e.g. I have the following dataframe:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 Wheat 5312 Ha 10 20 30
2 Afghanistan 25 Maize 5312 Ha 10 20 30
4 Angola 15 Wheat 7312 Ha 30 40 50
4 Angola 25 Maize 7312 Ha 30 40 50
I want to groupby the column Country and Item_Code and only compute the sum of the rows falling under the columns Y1961, Y1962 and Y1963. The resulting dataframe should look like this:
Code Country Item_Code Item Ele_Code Unit Y1961 Y1962 Y1963
2 Afghanistan 15 C3 5312 Ha 20 40 60
4 Angola 25 C4 7312 Ha 60 80 100
Right now I am doing this:
df.groupby('Country').sum()
However this adds up the values in the Item_Code column as well. Is there any way I can specify which columns to include in the sum() operation and which ones to exclude?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can select the columns of a groupby:
In [11]: df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum()
Out[11]:
Y1961 Y1962 Y1963
Country Item_Code
Afghanistan 15 10 20 30
25 10 20 30
Angola 15 30 40 50
25 30 40 50
Note that the list passed must be a subset of the columns otherwise you’ll see a KeyError.
Method 2
The agg function will do this for you. Pass the columns and function as a dict with column, output:
df.groupby(['Country', 'Item_Code']).agg({'Y1961': np.sum, 'Y1962': [np.sum, np.mean]}) # Added example for two output columns from a single input column
This will display only the group by columns, and the specified aggregate columns. In this example I included two agg functions applied to ‘Y1962’.
To get exactly what you hoped to see, included the other columns in the group by, and apply sums to the Y variables in the frame:
df.groupby(['Code', 'Country', 'Item_Code', 'Item', 'Ele_Code', 'Unit']).agg({'Y1961': np.sum, 'Y1962': np.sum, 'Y1963': np.sum})
Method 3
If you are looking for a more generalized way to apply to many columns, what you can do is to build a list of column names and pass it as the index of the grouped dataframe. In your case, for example:
columns = ['Y'+str(i) for year in range(1967, 2011)]
df.groupby('Country')[columns].agg('sum')
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0