pandas left join and update existing column

I am new to pandas and can’t seem to get this to work with merge function:

>>> left       >>> right
   a  b   c       a  c   d 
0  1  4   9    0  1  7  13
1  2  5  10    1  2  8  14
2  3  6  11    2  3  9  15
3  4  7  12

With a left join on column a, I would like to update common columns BY THE JOINED KEYS. Note last value in column c is from LEFT table since there is no match.

>>> final       
   a  b   c   d 
0  1  4   7   13
1  2  5   8   14
2  3  6   9   15
3  4  7   12  NAN

How should I do this with Pandas merge function? Thank you.

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

join_insertion()

Example Use

Method 5

Method 6

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can use merge() between left and right with how='left' on 'a' column.

In [74]: final = left.merge(right, on='a', how='left')

In [75]: final
Out[75]:
   a  b  c_x  c_y   d
0  1  4    9    7  13
1  2  5   10    8  14
2  3  6   11    9  15
3  4  7   12  NaN NaN

Replace NaN value from c_y with c_x value

In [76]: final['c'] = final['c_y'].fillna(final['c_x'])

In [77]: final
Out[77]:
   a  b  c_x  c_y   d   c
0  1  4    9    7  13   7
1  2  5   10    8  14   8
2  3  6   11    9  15   9
3  4  7   12  NaN NaN  12

Drop unwanted columns, and you have the result

In [79]: final.drop(['c_x', 'c_y'], axis=1)
Out[79]:
   a  b   d   c
0  1  4  13   7
1  2  5  14   8
2  3  6  15   9
3  4  7 NaN  12

Method 2

One way to do this is to set the a column as the index and update:

In [11]: left_a = left.set_index('a')

In [12]: right_a = right.set_index('a')

Note: update only does a left join (not merges), so as well as set_index you also need to include the additional columns not present in left_a.

In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))

In [14]: res.update(right_a)

In [15]: res.reset_index(inplace=True)

In [16]: res
Out[16]:
   a   b   c   d
0  1   4   7  13
1  2   5   8  14
2  3   6   9  15
3  4   7  12 NaN

Method 3

Here’s a way to do it with join:

In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')

In [633]: t
Out[633]: 
   b   c  c_right   d
a                    
1  4   9        7  13
2  5  10        8  14
3  6  11        9  15
4  7  12      NaN NaN

Now, we want to set null values of c_right (which is from the right dataframe) with values from c column from the left dataframe. Updated the below process with a method taking from @John Galt’s answer

In [657]: t['c_right'] = t['c_right'].fillna(t['c'])

In [658]: t
Out[658]: 
   b   c  c_right   d
a                    
1  4   9        7  13
2  5  10        8  14
3  6  11        9  15
4  7  12       12 NaN

In [659]: t.drop('c_right', axis=1)
Out[659]: 
   b   c   d
a           
1  4   9  13
2  5  10  14
3  6  11  15
4  7  12 NaN

Method 4

DataFrame.update() is nice, but it doesn’t let you specify columns to join on and more importantly, if the other dataframe has NaN values, those NaN values will not overwrite non-nan values in the original DataFrame. To me, this is undesirable behavior.

Here’s a custom method I rolled to fix these issues. It’s freshly written, so users beware..

join_insertion()

def join_insertion(into_df, from_df, cols, on, by=None, direction=None, mult='error'):
    """
    Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
    This function allows you to do an operation like:
    "where A and B match via the column foo, insert the values of baz and buz from B into A"
    Note that this'll update A's values for baz and it'll insert buz as a new column.
    This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!

    Optionally, direction can be given as 'backward', 'forward', or nearest to implement a rolling join
    insertion. forward means 'roll into_df values forward to match from_df values', etc. Additionally,
    when doing a rolling join, 'on' should be the roll column and 'by' should be the exact-match columns.
    See pandas.merge_asof() for details.

    Note that 'mult' gets ignored when doing a rolling join. In the case of a rolling join, the first
    appearing record is kept, even if two records match a key from the same distance. Perhaps this
    can be improved...

    :param into_df: dataframe you want to modify
    :param from_df: dataframe with the values you want to insert
    :param cols: list of column names (values to insert)
    :param on: list of column names (values to join on), or a dict of {into:from} column name pairs
    :param by: same format as on; when doing a rolling join insertion, what columns to exact-match on
    :param direction: 'forward', 'backward', or 'nearest'. forward means roll into_df values to match from_df
    :param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
    an error can be raised, or the first matching value can be inserted, or the last matching value
    can be inserted
    :return: a modified copy of into_df, with updated values using from_df
    """

    # Infer left_on, right_on
    if (isinstance(on, dict)):
        left_on = list(on.keys())
        right_on = list(on.values())
    elif(isinstance(on, list)):
        left_on = on
        right_on = on
    elif(isinstance(on, str)):
        left_on = [on]
        right_on = [on]
    else:
        raise Exception("on should be a list or dictionary")

    # Infer left_by, right_by
    if(by is not None):
        if (isinstance(by, dict)):
            left_by = list(by.keys())
            right_by = list(by.values())
        elif (isinstance(by, list)):
            left_by = by
            right_by = by
        elif (isinstance(by, str)):
            left_by = [by]
            right_by = [by]
        else:
            raise Exception("by should be a list or dictionary")
    else:
        left_by = None
        right_by = None

    # Make cols a list if it isn't already
    if(isinstance(cols, str)):
        cols = [cols]

    # Setup
    A = into_df.copy()
    B = from_df[right_on + cols + ([] if right_by is None else right_by)].copy()

    # Insert row ids
    A['_A_RowId_'] = np.arange(A.shape[0])
    B['_B_RowId_'] = np.arange(B.shape[0])

    # Merge
    if(direction is None):
        A = pd.merge(
            left=A,
            right=B,
            how='left',
            left_on=left_on,
            right_on=right_on,
            suffixes=(None, '_y'),
            indicator=True
        ).sort_values(['_A_RowId_', '_B_RowId_'])

        # Check for rows of A which got duplicated by the merge, and then handle appropriately
        if (mult == 'error'):
            if (A.groupby('_A_RowId_').size().max() > 1):
                raise Exception("At least one key of into_df matched multiple rows of from_df.")
        elif (mult == 'first'):
            A = A.groupby('_A_RowId_').first().reset_index()
        elif (mult == 'last'):
            A = A.groupby('_A_RowId_').last().reset_index()

    else:
        A.sort_values(left_on, inplace=True)
        B.sort_values(right_on, inplace=True)
        A = pd.merge_asof(
            left=A,
            right=B,
            direction=direction,
            left_on=left_on,
            right_on=right_on,
            left_by=left_by,
            right_by=right_by,
            suffixes=(None, '_y')
        ).sort_values(['_A_RowId_', '_B_RowId_'])

    # Insert values from new column(s) into pre-existing column(s)
    mask = A._merge == 'both' if direction is None else np.repeat(True, A.shape[0])
    cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
    for col in cols_in_both:
        A.loc[mask, col] = A.loc[mask, col + '_y']

    # Drop unwanted columns
    A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)

    return A

Example Use

into_df = pd.DataFrame({
    'foo': [1, 2, 3],
    'bar': [4, 5, 6],
    'baz': [7, 8, 9]
})
   foo  bar  baz
0    1    4    7
1    2    5    8
2    3    6    9

from_df = pd.DataFrame({
    'foo': [1, 3, 5, 7, 3],
    'baz': [70, 80, 90, 30, 40],
    'buz': [0, 1, 2, 3, 4]
})
   foo  baz  buz
0    1   70    0
1    3   80    1
2    5   90    2
3    7   30    3
4    3   40    4

# Use it!

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
  Exception: At least one key of into_df matched multiple rows of from_df.

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
   foo  bar   baz  buz
0    1    4  70.0  0.0
1    2    5   8.0  NaN
2    3    6  80.0  1.0

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
   foo  bar   baz  buz
0    1    4  70.0  0.0
1    2    5   8.0  NaN
2    3    6  40.0  4.0

As an aside, this is one of those things I severely miss from R’s data.table package. With data.table, this is as easy as x[y, Foo := i.Foo, on = c("a", "b")]

Method 5

One other way is to use pd.merge like so:

 >>> import pandas as pd

 >>> final = pd.merge(left=right, right=left, 
                      how='outer',
                      left_index=True,
                      right_index=True,
                      on=('a', 'c')
                     ).sort_index(axis=1)

 >>> final       
    a  b   c   d 
 0  1  4   7   13.0
 1  2  5   8   14.0
 2  3  6   9   15.0
 3  4  7   12  NaN

Provide the intersection of both dataframe’s columns to the ‘on=’ parameter of the function.

This does not create unwanted columns that have to be dropped like with Zero’s solution.

The NaN value might change integers to floats in the same column.

Edit: This works for Pandas versions <= 1.1.5

Method 6

Here is another way that should work using combine_first()

right.set_index('a').combine_first(left.set_index('a')).reset_index()

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating