pandas – Merge nearly duplicate rows based on column value

I have a pandas dataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or “coalesce” these rows into a single row, without summing the numerical values.

Here is an example of what I’m working with:

Name   Sid   Use_Case  Revenue
A      xx01  Voice     $10.00
A      xx01  SMS       $10.00
B      xx02  Voice     $5.00
C      xx03  Voice     $15.00
C      xx03  SMS       $15.00
C      xx03  Video     $15.00

And here is what I would like:

Name   Sid   Use_Case            Revenue
A      xx01  Voice, SMS          $10.00
B      xx02  Voice               $5.00
C      xx03  Voice, SMS, Video   $15.00

The reason I don’t want to sum the “Revenue” column is because my table is the result of doing a pivot over several time periods where “Revenue” simply ends up getting listed multiple times instead of having a different value per “Use_Case”.

What would be the best way to tackle this issue? I’ve looked into the groupby() function but I still don’t understand it very well.

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I think you can use groupby with aggregate first and custom function ', '.join:

df = df.groupby('Name').agg({'Sid':'first', 
                             'Use_Case': ', '.join, 
                             'Revenue':'first' }).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  $10.00
1    B  xx02              Voice   $5.00
2    C  xx03  Voice, SMS, Video  $15.00

Nice idea from comment, thanks Goyo:

df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  $10.00
1    B  xx02              Voice   $5.00
2    C  xx03  Voice, SMS, Video  $15.00

Method 2

You can groupby and apply the list function:

>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index()
    Name    Sid     Revenue     0
0   A   xx01    $10.00  [Voice, SMS]
1   B   xx02    $5.00   [Voice]
2   C   xx03    $15.00  [Voice, SMS, Video]

(In case you are concerned about duplicates, use set instead of list.)

Method 3

I was using some code that I didn’t think was optimal and eventually found jezrael’s answer. But after using it and running a timeit test, I actually went back to what I was doing, which was:

cmnts = {}
for i, row in df.iterrows():
    while True:
        try:
            if row['Use_Case']:
                cmnts<div class="su-row"></div>].append(row['Use_Case'])

            else:
                cmnts<div class="su-row"></div>].append('n/a')

            break

        except KeyError:
            cmnts<div class="su-row"></div>] = []

df.drop_duplicates('Name', inplace=True)
df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]

According to my 100 run timeit test, the iterate and replace method is an order of magnitude faster than the groupby method.

import pandas as pd
from my_stuff import time_something

df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)],
                   'b': [i for i in range(1, 10001)]})

runs = 100

interim_dict = 'txt = {}n' 
               'for i, row in df.iterrows():n' 
               '    try:n' 
               "        txt<div class="su-row"></div>].append(row['b'])nn" 
               '    except KeyError:n' 
               "        txt<div class="su-row"></div>] = []n" 
               "df.drop_duplicates('a', inplace=True)n" 
               "df['b'] = ['; '.join(v) for v in txt.values()]"

grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()"

print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals()))
print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))

yields:

Interim Dict
  Total: 59.1164s
  Avg: 591163748.5887ns

Group By
  Total: 430.6203s
  Avg: 4306203366.1827ns

where time_something is a function which times a snippet with timeit and returns the result in the above format.

Method 4

Following @jezrael and @leoschet answers, I would like to provide a more general example in case there are many more columns in the dataframe, something I had to do recently.

Specifically, my dataframe had a total of 184 columns.

The column REF is the one that should be used as a reference for the groupby and only another one, called IDS, of the remaining 182, was different and I wanted to collapse its elements into a list id1, id2, id3…

So:

# Create a dictionary {df_all_columns_name : 'first', 'IDS': join} for agg
# Also avoid REF column in dictionary (inserted after aggregation)
columns_collapse = {c: 'first' if c != 'IDS' else ', '.join for c in my_df.columns.tolist() if c != 'REF'}
my_df = my_df.groupby('REF').agg(columns_collapse).reset_index()

I hope this is also useful to someone!

Regards!

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating