I’m starting with input data like this
df1 = pandas.DataFrame( { "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
Which when printed appears as this:
City Name 0 Seattle Alice 1 Seattle Bob 2 Portland Mallory 3 Seattle Mallory 4 Seattle Bob 5 Portland Mallory
Grouping is simple enough:
g1 = df1.groupby( [ "Name", "City"] ).count()
and printing yields a GroupBy
object:
City Name Name City Alice Seattle 1 1 Bob Seattle 2 2 Mallory Portland 2 2 Seattle 1 1
But what I want eventually is another DataFrame object that contains all the rows in the GroupBy object. In other words I want to get the following result:
City Name Name City Alice Seattle 1 1 Bob Seattle 2 2 Mallory Portland 2 2 Mallory Seattle 1 1
I can’t quite see how to accomplish this in the pandas documentation. Any hints would be welcome.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
g1
here is a DataFrame. It has a hierarchical index, though:
In [19]: type(g1) Out[19]: pandas.core.frame.DataFrame In [20]: g1.index Out[20]: MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'), ('Mallory', 'Seattle')], dtype=object)
Perhaps you want something like this?
In [21]: g1.add_suffix('_Count').reset_index() Out[21]: Name City City_Count Name_Count 0 Alice Seattle 1 1 1 Bob Seattle 2 2 2 Mallory Portland 2 2 3 Mallory Seattle 1 1
Or something like:
In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index() Out[36]: Name City count 0 Alice Seattle 1 1 Bob Seattle 2 2 Mallory Portland 2 3 Mallory Seattle 1
Method 2
I want to slightly change the answer given by Wes, because version 0.16.2 requires as_index=False
. If you don’t set it, you get an empty dataframe.
Aggregation functions will not return the groups that you are aggregating over if they are named columns, when
as_index=True
, the default. The grouped columns will be the indices of the returned object.Passing
as_index=False
will return the groups that you are aggregating over, if they are named columns.Aggregating functions are ones that reduce the dimension of the returned objects, for example:
mean
,sum
,size
,count
,std
,var
,sem
,describe
,first
,last
,nth
,min
,max
. This is what happens when you do for exampleDataFrame.sum()
and get back aSeries
.nth can act as a reducer or a filter, see here.
import pandas as pd df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"], "City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]}) print df1 # # City Name #0 Seattle Alice #1 Seattle Bob #2 Portland Mallory #3 Seattle Mallory #4 Seattle Bob #5 Portland Mallory # g1 = df1.groupby(["Name", "City"], as_index=False).count() print g1 # # City Name #Name City #Alice Seattle 1 1 #Bob Seattle 2 2 #Mallory Portland 2 2 # Seattle 1 1 #
EDIT:
In version 0.17.1
and later you can use subset
in count
and reset_index
with parameter name
in size
:
print df1.groupby(["Name", "City"], as_index=False ).count() #IndexError: list index out of range print df1.groupby(["Name", "City"]).count() #Empty DataFrame #Columns: [] #Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)] print df1.groupby(["Name", "City"])[['Name','City']].count() # Name City #Name City #Alice Seattle 1 1 #Bob Seattle 2 2 #Mallory Portland 2 2 # Seattle 1 1 print df1.groupby(["Name", "City"]).size().reset_index(name='count') # Name City count #0 Alice Seattle 1 #1 Bob Seattle 2 #2 Mallory Portland 2 #3 Mallory Seattle 1
The difference between count
and size
is that size
counts NaN values while count
does not.
Method 3
The key is to use the reset_index() method.
Use:
import pandas df1 = pandas.DataFrame( { "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } ) g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()
Now you have your new dataframe in g1:
Method 4
Simply, this should do the task:
import pandas as pd grouped_df = df1.groupby( [ "Name", "City"] ) pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))
Here, grouped_df.size()
pulls up the unique groupby count, and reset_index()
method resets the name of the column you want it to be.
Finally, the pandas Dataframe()
function is called upon to create a DataFrame object.
Method 5
Maybe I misunderstand the question but if you want to convert the groupby back to a dataframe you can use .to_frame(). I wanted to reset the index when I did this so I included that part as well.
example code unrelated to question
df = df['TIME'].groupby(df['Name']).min() df = df.to_frame() df = df.reset_index(level=['Name',"TIME"])
Method 6
I found this worked for me.
import numpy as np import pandas as pd df1 = pd.DataFrame({ "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]}) df1['City_count'] = 1 df1['Name_count'] = 1 df1.groupby(['Name', 'City'], as_index=False).count()
Method 7
Below solution may be simpler:
df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()
Method 8
I have aggregated with Qty wise data and store to dataframe
almo_grp_data = pd.DataFrame({'Qty_cnt' : almo_slt_models_data.groupby( ['orderDate','Item','State Abv'] )['Qty'].sum()}).reset_index()
Method 9
These solutions only partially worked for me because I was doing multiple aggregations. Here is a sample output of my grouped by that I wanted to convert to a dataframe:
Because I wanted more than the count provided by reset_index(), I wrote a manual method for converting the image above into a dataframe. I understand this is not the most pythonic/pandas way of doing this as it is quite verbose and explicit, but it was all I needed. Basically, use the reset_index() method explained above to start a “scaffolding” dataframe, then loop through the group pairings in the grouped dataframe, retrieve the indices, perform your calculations against the ungrouped dataframe, and set the value in your new aggregated dataframe.
df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']] df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False) # Grouped gives us the indices we want for each grouping # We cannot convert a groupedby object back to a dataframe, so we need to do it manually # Create a new dataframe to work against df_aggregated = df_grouped.size().to_frame('Total Count').reset_index() df_aggregated['Male Count'] = 0 df_aggregated['Female Count'] = 0 df_aggregated['Job Rate'] = 0 def manualAggregations(indices_array): temp_df = df.iloc[indices_array] return { 'Male Count': temp_df['Male Count'].sum(), 'Female Count': temp_df['Female Count'].sum(), 'Job Rate': temp_df['Hourly Rate'].max() } for name, group in df_grouped: ix = df_grouped.indices[name] calcDict = manualAggregations(ix) for key in calcDict: #Salary Basis, Job Title columns = list(name) df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) & (df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]
If a dictionary isn’t your thing, the calculations could be applied inline in the for loop:
df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) & (df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()
Method 10
grouped=df.groupby(['Team','Year'])['W'].count().reset_index() team_wins_df=pd.DataFrame(grouped) team_wins_df=team_wins_df.rename({'W':'Wins'},axis=1) team_wins_df['Wins']=team_wins_df['Wins'].astype(np.int32) team_wins_df.reset_index() print(team_wins_df)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0