set difference for pandas

A simple pandas question:

Is there a drop_duplicates() functionality to drop every row involved in the duplication?

An equivalent question is the following: Does pandas have a set difference for dataframes?

For example:

In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})

In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

In [7]: df1
Out[7]: 
   col1  col2
0     1     2
1     2     3
2     3     4

In [8]: df2
Out[8]: 
   col1  col2
0     4     6
1     2     3
2     5     5

so maybe something like df2.set_diff(df1) will produce this:

   col1  col2
0     4     6
2     5     5

However, I don’t want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.

By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.

Thanks!

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Method 9

Method 10

Method 11

Method 12

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:

ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))

This step will get rid of any duplicates in the dataframes as well (index ignored)

set([(1, 2), (3, 4), (2, 3)])   # ds1

can then use set methods to find anything. Eg to find differences:

ds1.difference(ds2)

gives:
set([(1, 2), (3, 4)])

can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:

pd.DataFrame(list(ds1.difference(ds2)))

Method 2

Here’s another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)

pd.concat([df2, df1, df1]).drop_duplicates(keep=False)

It is fast and the result is

   col1  col2
0     4     6
2     5     5

Method 3

from pandas import  DataFrame

df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})


print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))

Method 4

Apply by the columns of the object you want to map (df2); find the rows that are not in the set (isin is like a set operator)

In [32]: df2.apply(lambda x: df2.loc[~x.isin(df1[x.name]),x.name])
Out[32]: 
   col1  col2
0     4     6
2     5     5

Same thing, but include all values in df1, but still per column in df2

In [33]: df2.apply(lambda x: df2.loc[~x.isin(df1.values.ravel()),x.name])
Out[33]: 
   col1  col2
0   NaN     6
2     5     5

2nd example

In [34]: g = pd.DataFrame({'x': [1.2,1.5,1.3], 'y': [4,4,4]})

In [35]: g.columns=df1.columns

In [36]: g
Out[36]: 
   col1  col2
0   1.2     4
1   1.5     4
2   1.3     4

In [32]: g.apply(lambda x: g.loc[~x.isin(df1[x.name]),x.name])
Out[32]: 
   col1  col2
0   1.2   NaN
1   1.5   NaN
2   1.3   NaN

Note, in 0.13, there will be an isin operator on the frame level, so something like: df2.isin(df1) should be possible

Method 5

There are 3 methods which work, but two of them have some flaws.

Method 1 (Hash method):

It worked for all cases I tested.

df1.loc[:, "hash"] = df1.apply(lambda x: hash(tuple(x)), axis = 1)
df2.loc[:, "hash"] = df2.apply(lambda x: hash(tuple(x)), axis = 1)
df1 = df1.loc[~df1["hash"].isin(df2["hash"]), :]

Method 2 (Dict method):

It fails if DataFrames contain datetime columns.

df1 = df1.loc[~df1.isin(df2.to_dict(orient="list")).all(axis=1), :]

Method 3 (MultiIndex method):

I encountered cases when it failed on columns with None’s or NaN’s.

df1 = df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)

Method 6

Get the indices of the intersection with a merge, then drop them:

>>> df_all = pd.DataFrame(np.arange(8).reshape((4,2)), columns=['A','B']); df_all
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
>>> df_completed = df_all.iloc[::2]; df_completed
   A  B
0  0  1
2  4  5
>>> merged = pd.merge(df_all.reset_index(), df_completed); merged
   index  A  B
0      0  0  1
1      2  4  5
>>> df_pending = df_all.drop(merged['index']); df_pending
   A  B
1  2  3
3  6  7

Method 7

Numpy’s setdiff1d would work and perhaps be faster.

For each column:
np.setdiff1(df1.col1.values, df2.col1.values)

So something like:

setdf = pd.DataFrame({
    col: np.setdiff1d(getattr(df1, col).values, getattr(df2, col).values)
    for col in df1.columns
})

numpy.setdiff1d docs

Method 8

Assumption:

df1 and df2 have identical columns

it is a set operation so duplicates are ignored

sets are not extremely large so you do not worry about memory

union = pd.concat([df1,df2])
sym_diff = union[~union.duplicated(keep=False)]
union_of_df1_and_sym_diff = pd.concat([df1, sym_diff])
diff = union_of_df1_and_sym_diff[union_of_df1_and_sym_diff.duplicated()]

Method 9

Edit: You can now make MultiIndex objects directly from data frames as of pandas 0.24.0 which greatly simplifies the syntax of this answer

df1mi = pd.MultiIndex.from_frame(df1)
df2mi = pd.MultiIndex.from_frame(df2)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)

Original Answer

Pandas MultiIndex objects have fast set operations implemented as methods, so you can convert the DataFrames to MultiIndexes, use the difference() method, then convert the result back to a DataFrame. This solution should be much faster (by ~100x or more from my brief testing) than the solutions given here so far, and it will not depend on the row indexing of the original frames. As Piotr mentioned for his answer, this will fail with null values, since np.nan != np.nan. Any row in df2 with a null value will always appear in the difference. Also, the columns should be in the same order for both DataFrames.

df1mi = pd.MultiIndex.from_arrays(df1.values.transpose(), names=df1.columns)
df2mi = pd.MultiIndex.from_arrays(df2.values.transpose(), names=df2.columns)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)

Method 10

I’m not sure how pd.concat() implicitly joins overlapping columns but I had to do a little tweak on @radream’s answer.

Conceptually, a set difference (symmetric) on multiple columns is a set union (outer join) minus a set intersection (or inner join):

df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
o = pd.merge(df1, df2, how='outer')
i = pd.merge(df1, df2)
set_diff = pd.concat([o, i]).drop_duplicates(keep=False)

This yields:

   col1  col2
0     1     2
2     3     4
3     4     6
4     5     5

Method 11

In Pandas 1.1.0 you can count unique rows with value_counts and find difference between counts:

df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

diff = df2.value_counts().sub(df1.value_counts(), fill_value=0)

Result:

col1  col2
1     2      -1.0
2     3       0.0
3     4      -1.0
4     6       1.0
5     5       1.0
dtype: float64

Get positive counts:

diff[diff > 0].reset_index(name='counts')


   col1  col2  counts
0     4     6     1.0
1     5     5     1.0

Method 12

this should work even if you have multiple columns in both dataframes. But make sure that the column names of both the dataframes are the exact same.

set_difference = pd.concat([df2, df1, df1]).drop_duplicates(keep=False)

With multiple columns you can also use:

col_names=['col_1','col_2']
set_difference = pd.concat([df2[col_names], df1[col_names], 
df1[col_names]]).drop_duplicates(keep=False)

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating