I have a pandas dataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or “coalesce” these rows into a single row, without summing the numerical values.
Here is an example of what I’m working with:
Name Sid Use_Case Revenue A xx01 Voice $10.00 A xx01 SMS $10.00 B xx02 Voice $5.00 C xx03 Voice $15.00 C xx03 SMS $15.00 C xx03 Video $15.00
And here is what I would like:
Name Sid Use_Case Revenue A xx01 Voice, SMS $10.00 B xx02 Voice $5.00 C xx03 Voice, SMS, Video $15.00
The reason I don’t want to sum the “Revenue” column is because my table is the result of doing a pivot over several time periods where “Revenue” simply ends up getting listed multiple times instead of having a different value per “Use_Case”.
What would be the best way to tackle this issue? I’ve looked into the groupby() function but I still don’t understand it very well.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I think you can use groupby with aggregate first and custom function ', '.join:
df = df.groupby('Name').agg({'Sid':'first',
'Use_Case': ', '.join,
'Revenue':'first' }).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS $10.00
1 B xx02 Voice $5.00
2 C xx03 Voice, SMS, Video $15.00
Nice idea from comment, thanks Goyo:
df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS $10.00
1 B xx02 Voice $5.00
2 C xx03 Voice, SMS, Video $15.00
Method 2
You can groupby and apply the list function:
>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index()
Name Sid Revenue 0
0 A xx01 $10.00 [Voice, SMS]
1 B xx02 $5.00 [Voice]
2 C xx03 $15.00 [Voice, SMS, Video]
(In case you are concerned about duplicates, use set instead of list.)
Method 3
I was using some code that I didn’t think was optimal and eventually found jezrael’s answer. But after using it and running a timeit test, I actually went back to what I was doing, which was:
cmnts = {}
for i, row in df.iterrows():
while True:
try:
if row['Use_Case']:
cmnts<div class="su-row"></div>].append(row['Use_Case'])
else:
cmnts<div class="su-row"></div>].append('n/a')
break
except KeyError:
cmnts<div class="su-row"></div>] = []
df.drop_duplicates('Name', inplace=True)
df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]
According to my 100 run timeit test, the iterate and replace method is an order of magnitude faster than the groupby method.
import pandas as pd
from my_stuff import time_something
df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)],
'b': [i for i in range(1, 10001)]})
runs = 100
interim_dict = 'txt = {}n'
'for i, row in df.iterrows():n'
' try:n'
" txt<div class="su-row"></div>].append(row['b'])nn"
' except KeyError:n'
" txt<div class="su-row"></div>] = []n"
"df.drop_duplicates('a', inplace=True)n"
"df['b'] = ['; '.join(v) for v in txt.values()]"
grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()"
print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals()))
print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))
yields:
Interim Dict Total: 59.1164s Avg: 591163748.5887ns Group By Total: 430.6203s Avg: 4306203366.1827ns
where time_something is a function which times a snippet with timeit and returns the result in the above format.
Method 4
Following @jezrael and @leoschet answers, I would like to provide a more general example in case there are many more columns in the dataframe, something I had to do recently.
Specifically, my dataframe had a total of 184 columns.
The column REF is the one that should be used as a reference for the groupby and only another one, called IDS, of the remaining 182, was different and I wanted to collapse its elements into a list id1, id2, id3…
So:
# Create a dictionary {df_all_columns_name : 'first', 'IDS': join} for agg
# Also avoid REF column in dictionary (inserted after aggregation)
columns_collapse = {c: 'first' if c != 'IDS' else ', '.join for c in my_df.columns.tolist() if c != 'REF'}
my_df = my_df.groupby('REF').agg(columns_collapse).reset_index()
I hope this is also useful to someone!
Regards!
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0