I’d like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don’t add):
Dataframe A:
I II 0 1 2 1 3 1
Dataframe B:
I II 0 5 6 1 3 1
New Dataframe:
I II 0 1 2 1 3 1 2 5 6
How can I do this?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1 A B 0 1 2 1 3 1 >>> df2 A B 0 5 6 1 3 1 >>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True) A B 0 1 2 1 3 1 2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn’t reset right away.
Method 2
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
Method 3
I’m surprised that pandas doesn’t offer a native solution for this task.
I don’t think that it’s efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to ‘row location’ (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don’t choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in ‘a’ will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0