Suppose I have two DataFrames like so:
left = pd.DataFrame({'key1': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key2': ['foo', 'bar'], 'rval': [4, 5]})
I want to merge them, so I try something like this:
pd.merge(left, right, left_on='key1', right_on='key2')
And I’m happy
key1 lval key2 rval 0 foo 1 foo 4 1 bar 2 bar 5
But I’m trying to use the join method, which I’ve been lead to believe is pretty similar.
left.join(right, on=['key1', 'key2'])
And I get this:
//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
406 if self.right_index:
407 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408 raise AssertionError()
409 self.right_on = [None] * n
410 elif self.right_on is not None:
AssertionError:
What am I missing?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
pandas.merge() is the underlying function used for all merge/join behavior.
DataFrames provide the pandas.DataFrame.merge() and pandas.DataFrame.join() methods as a convenient way to access the capabilities of pandas.merge(). For example, df1.merge(right=df2, ...) is equivalent to pandas.merge(left=df1, right=df2, ...).
These are the main differences between df.join() and df.merge():
- lookup on right table:
df1.join(df2)always joins via the index ofdf2, butdf1.merge(df2)can join to one or more columns ofdf2(default) or to the index ofdf2(withright_index=True). - lookup on left table: by default,
df1.join(df2)uses the index ofdf1anddf1.merge(df2)uses column(s) ofdf1. That can be overridden by specifyingdf1.join(df2, on=key_or_keys)ordf1.merge(df2, left_index=True). - left vs inner join:
df1.join(df2)does a left join by default (keeps all rows ofdf1), butdf.mergedoes an inner join by default (returns only matching rows ofdf1anddf2).
So, the generic approach is to use pandas.merge(df1, df2) or df1.merge(df2). But for a number of common situations (keeping all rows of df1 and joining to an index in df2), you can save some typing by using df1.join(df2) instead.
Some notes on these issues from the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging:
mergeis a function in the pandas namespace, and it is also
available as a DataFrame instance method, with the calling DataFrame
being implicitly considered the left object in the join.The related
DataFrame.joinmethod, usesmergeinternally for the
index-on-index and index-on-column(s) joins, but joins on indexes by
default rather than trying to join on common columns (the default
behavior formerge). If you are joining on index, you may wish to
useDataFrame.jointo save yourself some typing.
…
These two function calls are completely equivalent:
left.join(right, on=key_or_keys) pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)
Method 2
I always use join on indices:
import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')
val_l val_r
key
foo 1 4
bar 2 5
The same functionality can be had by using merge on the columns follows:
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))
key val_l val_r
0 foo 1 4
1 bar 2 5
Method 3
From this documentation
pandas provides a single function, merge, as the entry point for all
standard database join operations between DataFrame objects:merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False)
And :
DataFrame.joinis a convenient method for combining the columns of two
potentially differently-indexed DataFrames into a single result
DataFrame. Here is a very basic example: The data alignment here is on
the indexes (row labels). This same behavior can be achieved using
merge plus additional arguments instructing it to use the indexes:result = pd.merge(left, right, left_index=True, right_index=True, how='outer')
Method 4
One of the difference is that merge is creating a new index, and join is keeping the left side index. It can have a big consequence on your later transformations if you wrongly assume that your index isn’t changed with merge.
For example:
import pandas as pd
df1 = pd.DataFrame({'org_index': [101, 102, 103, 104],
'date': [201801, 201801, 201802, 201802],
'val': [1, 2, 3, 4]}, index=[101, 102, 103, 104])
df1
date org_index val
101 201801 101 1
102 201801 102 2
103 201802 103 3
104 201802 104 4
–
df2 = pd.DataFrame({'date': [201801, 201802], 'dateval': ['A', 'B']}).set_index('date')
df2
dateval
date
201801 A
201802 B
–
df1.merge(df2, on='date')
date org_index val dateval
0 201801 101 1 A
1 201801 102 2 A
2 201802 103 3 B
3 201802 104 4 B
–
df1.join(df2, on='date')
date org_index val dateval
101 201801 101 1 A
102 201801 102 2 A
103 201802 103 3 B
104 201802 104 4 B
Method 5
I believe that join() is just a convenience method. Try df1.merge(df2) instead, which allows you to specify left_on and right_on:
In [30]: left.merge(right, left_on="key1", right_on="key2") Out[30]: key1 lval key2 rval 0 foo 1 foo 4 1 bar 2 bar 5
Method 6
- Join: Default Index (If any same column name then it will throw an error in default mode because u have not defined lsuffix or rsuffix))
df_1.join(df_2)
- Merge: Default Same Column Names (If no same column name it will throw an error in default mode)
df_1.merge(df_2)
onparameter has different meaning in both cases
df_1.merge(df_2, on='column_1')
df_1.join(df_2, on='column_1') // It will throw error
df_1.join(df_2.set_index('column_1'), on='column_1')
Method 7
To put it analogously to SQL “Pandas merge is to outer/inner join and Pandas join is to natural join”. Hence when you use merge in pandas, you want to specify which kind of sqlish join you want to use whereas when you use pandas join, you really want to have a matching column label to ensure it joins
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0