The pandas drop_duplicates function is great for “uniquifying” a dataframe. However, one of the keyword arguments to pass is take_last=True or take_last=False, while I would like to drop all rows which are duplicates across a subset of columns. Is this possible?
A B C 0 foo 0 A 1 foo 1 A 2 foo 1 B 3 bar 1 A
As an example, I would like to drop rows which match on columns A and C so this should drop rows 0 and 1.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
This is much easier in pandas now with drop_duplicates and the keep parameter.
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)
Method 2
Just want to add to Ben’s answer on drop_duplicates:
keep : {‘first’, ‘last’, False}, default ‘first’
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
So setting keep to False will give you desired answer.
DataFrame.drop_duplicates(*args, **kwargs) Return DataFrame with
duplicate rows removed, optionally only considering certain columnsParameters: subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default
use all of the columns keep : {‘first’, ‘last’, False}, default
‘first’ first : Drop duplicates except for the first occurrence. last
: Drop duplicates except for the last occurrence. False : Drop all
duplicates. take_last : deprecated inplace : boolean, default False
Whether to drop duplicates in place or to return a copy cols : kwargs
only argument of subset [deprecated] Returns: deduplicated :
DataFrame
Method 3
If you want result to be stored in another dataset:
df.drop_duplicates(keep=False)
or
df.drop_duplicates(keep=False, inplace=False)
If same dataset needs to be updated:
df.drop_duplicates(keep=False, inplace=True)
Above examples will remove all duplicates and keep one, similar to DISTINCT * in SQL
Method 4
use groupby and filter
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.groupby(["A", "C"]).filter(lambda df:df.shape[0] == 1)
Method 5
Try these various things
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar","foo"], "B":[0,1,1,1,1], "C":["A","A","B","A","A"]})
>>>df.drop_duplicates( "A" , keep='first')
or
>>>df.drop_duplicates( keep='first')
or
>>>df.drop_duplicates( keep='last')
Method 6
Actually, drop rows 0 and 1 only requires (any observations containing matched A and C is kept.):
In [335]:
df['AC']=df.A+df.C
In [336]:
print df.drop_duplicates('C', take_last=True) #this dataset is a special case, in general, one may need to first drop_duplicates by 'c' and then by 'a'.
A B C AC
2 foo 1 B fooB
3 bar 1 A barA
[2 rows x 4 columns]
But I suspect what you really want is this (one observation containing matched A and C is kept.):
In [337]:
print df.drop_duplicates('AC')
A B C AC
0 foo 0 A fooA
2 foo 1 B fooB
3 bar 1 A barA
[3 rows x 4 columns]
Edit:
Now it is much clearer, therefore:
In [352]:
DG=df.groupby(['A', 'C'])
print pd.concat([DG.get_group(item) for item, value in DG.groups.items() if len(value)==1])
A B C
2 foo 1 B
3 bar 1 A
[2 rows x 3 columns]
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0