Coalesce values from 2 columns into a single column in a pandas dataframe

I’m looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I’d like to create a new column using the following rules:

If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C

Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven’t found a good pythonic method for this; does one exist?

Contents hide

Answers:

Method 1

Method 2

Coalesce for multiple columns with DataFrame.bfill

Method 3

Method 4

Case #1: Non-mutually Exclusive NaNs

Case #2: Mutually Exclusive Positioned NaNs

Method 5

Build dummy data

coalesce a1 a2, a3 into a new column A

Method 6

Method 7

Method 8

Method 9

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

use combine_first():

In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))

In [17]: df.loc[::2, 'a'] = np.nan

In [18]: df
Out[18]:
     a  b
0  NaN  0
1  5.0  5
2  NaN  8
3  2.0  8
4  NaN  3
5  9.0  4
6  NaN  7
7  2.0  0
8  NaN  6
9  2.0  5

In [19]: df['c'] = df.a.combine_first(df.b)

In [20]: df
Out[20]:
     a  b    c
0  NaN  0  0.0
1  5.0  5  5.0
2  NaN  8  8.0
3  2.0  8  2.0
4  NaN  3  3.0
5  9.0  4  9.0
6  NaN  7  7.0
7  2.0  0  2.0
8  NaN  6  6.0
9  2.0  5  2.0

Method 2

Coalesce for multiple columns with `DataFrame.bfill`

All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:

example dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
                   'col2':[np.NaN, 5, 1, 0, np.NaN],
                   'col3':[2, np.NaN, 9, 1, np.NaN],
                   'col4':[np.NaN, 10, 11, 4, 8]})

print(df)

   col1  col2  col3  col4
0   NaN   NaN   2.0   NaN
1   2.0   5.0   NaN  10.0
2   4.0   1.0   9.0  11.0
3   5.0   0.0   1.0   4.0
4   NaN   NaN   NaN   8.0

Using DataFrame.bfill over the index axis (axis=1) we can get the values in a generalized way even for a big n amount of columns

Plus, this would also work for string type columns !!

df['coalesce'] = df.bfill(axis=1).iloc[:, 0]

   col1  col2  col3  col4  coalesce
0   NaN   NaN   2.0   NaN       2.0
1   2.0   5.0   NaN  10.0       2.0
2   4.0   1.0   9.0  11.0       4.0
3   5.0   0.0   1.0   4.0       5.0
4   NaN   NaN   NaN   8.0       8.0

Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow

df['coalesce'] = (
    df['col1'].combine_first(df['col2'])
        .combine_first(df['col3'])
        .combine_first(df['col4'])
)

   col1  col2  col3  col4  coalesce
0   NaN   NaN   2.0   NaN       2.0
1   2.0   5.0   NaN  10.0       2.0
2   4.0   1.0   9.0  11.0       4.0
3   5.0   0.0   1.0   4.0       5.0
4   NaN   NaN   NaN   8.0       8.0

Method 3

Try this also.. easier to remember:

df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )

This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )

%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop


%timeit  df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop

Method 4

combine_first is the most straightforward option. There are a couple of others which I outline below. I’m going to outline a few more solutions, some applicable to different cases.

Case #1: Non-mutually Exclusive NaNs

Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.

df = pd.DataFrame({
    'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
    'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})      
df

     a    b
0  1.0  5.0
1  2.0  3.0
2  3.0  NaN
3  NaN  4.0
4  5.0  NaN
5  7.0  6.0
6  NaN  7.0

Let’s combine first on a.

Series.mask

df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    7.0
6    7.0
Name: a, dtype: float64

Series.where

df['a'].where(pd.notnull, df['b'])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    7.0
6    7.0
Name: a, dtype: float64

You can use similar syntax using np.where.

Alternatively, to combine first on b, switch the conditions around.

Case #2: Mutually Exclusive Positioned NaNs

All rows have NaNs which are mutually exclusive between columns.

df = pd.DataFrame({
    'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
    'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df

     a    b
0  1.0  NaN
1  2.0  NaN
2  3.0  NaN
3  NaN  4.0
4  5.0  NaN
5  NaN  6.0
6  NaN  7.0

Series.update

This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.

df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df

     a    b
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0
3  NaN  4.0
4  5.0  5.0
5  NaN  6.0
6  NaN  7.0

Series.add

df['a'].add(df['b'], fill_value=0)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64

DataFrame.fillna + DataFrame.sum

df.fillna(0).sum(1)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64

Method 5

I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:

Build dummy data

import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
                   'a2': [2, None, 4, None],
                   'a3': [4, 5, None, None],
                   'a4': [None, None, None, None],
                   'b1': [9, 9, 9, 999]})

df

    a1   a2   a3    a4   b1
0  NaN  2.0  4.0  None    9
1  2.0  NaN  5.0  None    9
2  3.0  4.0  NaN  None    9
3  NaN  NaN  NaN  None  999

coalesce a1 a2, a3 into a new column A

def get_first_non_null(dfrow, columns_to_search):
    for c in columns_to_search:
        if pd.notnull(dfrow[c]):
            return dfrow[c]
    return None

# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)

print(df)

    a1   a2   a3    a4   b1    A
0  NaN  2.0  4.0  None    9  2.0
1  2.0  NaN  5.0  None    9  2.0
2  3.0  4.0  NaN  None    9  3.0
3  NaN  NaN  NaN  None  999  NaN

Method 6

I’m thinking a solution like this,

def coalesce(s: pd.Series, *series: List[pd.Series]):
    """coalesce the column information like a SQL coalesce."""
    for other in series:
        s = s.mask(pd.isnull, other)        
    return s

because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,

df['d'] = coalesce(df.a, df.b, df.c)

Method 7

For a more general case, where there are no NaNs but you want the same behavior:

Merge ‘left’, but override ‘right’ values where possible

Method 8

Good code, put you have a typo for python 3, correct one looks like this

    """coalesce the column information like a SQL coalesce."""
    for other in series:
        s = s.mask(pd.isnull, other)        
    return s

Method 9

Consider using DuckDB for efficient SQL on Pandas. It’s performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html

Sample Dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
                   'B':[np.NaN, 2, 3, 4, np.NaN]})

Coalesce using DuckDB:

import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)

Output:

     A    B    c
0  1.0  NaN  1.0
1  NaN  2.0  2.0
2  3.0  3.0  3.0
3  4.0  4.0  4.0
4  5.0  NaN  5.0

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating