I have a data frame with two columns, A and B. The order of A and B is unimportant in this context; for example, I would consider (0,50) and (50,0) to be duplicates. In pandas, what is an efficient way to remove these duplicates from a dataframe?
import pandas as pd
# Initial data frame.
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50],
'B': [50, 22, 35, 5, 10, 11, 21, 0]})
data
A B
0 0 50
1 10 22
2 11 35
3 21 5
4 22 10
5 35 11
6 5 21
7 50 0
# Desired output with "duplicates" removed.
data2 = pd.DataFrame({'A': [0, 5, 10, 11],
'B': [50, 21, 22, 35]})
data2
A B
0 0 50
1 5 21
2 10 22
3 11 35
Ideally, the output would be sorted by values of column A.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can sort each row of the data frame before dropping the duplicates:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates() # A B #0 0 50 #1 10 22 #2 11 35 #3 5 21
If you prefer the result to be sorted by column A:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')
# A B
#0 0 50
#3 5 21
#1 10 22
#2 11 35
Method 2
Here is bit uglier, but faster solution:
In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
A B
0 0 50
1 10 22
2 11 35
3 5 21
Timing: for 8K rows DF
In [50]: big = pd.concat([data] * 10**3, ignore_index=True) In [51]: big.shape Out[51]: (8000, 2) In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates() 1 loop, best of 3: 3.04 s per loop In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates() 100 loops, best of 3: 3.96 ms per loop In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates() 1 loop, best of 3: 2.69 s per loop
Method 3
Now this solution works,
data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()
More columns could be added as well as per necessity.
e.g.
data.set_index(['A','B', 'C']).stack().drop_duplicates().unstack().reset_index()
Method 4
df.T.apply(sorted).T.drop_duplicates()
Method 5
Here is a bit lengthy solution, but might be helpful for beginners –
Creating new columns for sorting values from Column A & B across row –
data['C'] = np.where(data['A']<data['B'] , data['A'], data['B']) data['D'] = np.where(data['A']>data['B'] , data['A'], data['B'])
Removing Duplicates & sorting as per column ‘C’ as requested in question & renaming the columns
data2 = data[['C', 'D']].drop_duplicates().sort_values('C')
data2.columns = ['A', 'B']
data2
PS – “np.where” function works similar to If formula in excel (Logical Condition, Value if TRUE, Value if FALSE)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0