Find the max of two or more columns with pandas

I have a dataframe with columns A,B. I need to create a column C such that for every record / row:

C = max(A, B).

How should I go about doing this?

Contents hide

Answers:

Method 1

Method 2

Method 3

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can get the maximum like this:

>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
>>> df
   A  B
0  1 -2
1  2  8
2  3  1
>>> df[["A", "B"]]
   A  B
0  1 -2
1  2  8
2  3  1
>>> df[["A", "B"]].max(axis=1)
0    1
1    8
2    3

and so:

>>> df["C"] = df[["A", "B"]].max(axis=1)
>>> df
   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

If you know that “A” and “B” are the only columns, you could even get away with

>>> df["C"] = df.max(axis=1)

And you could use .apply(max, axis=1) too, I guess.

Method 2

@DSM’s answer is perfectly fine in almost any normal scenario. But if you’re the type of programmer who wants to go a little deeper than the surface level, you might be interested to know that it is a little faster to call numpy functions on the underlying .to_numpy() (or .values for <0.24) array instead of directly calling the (cythonized) functions defined on the DataFrame/Series objects.

For example, you can use ndarray.max() along the first axis.

# Data borrowed from @DSM's post.
df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
df
   A  B
0  1 -2
1  2  8
2  3  1

df['C'] = df[['A', 'B']].values.max(1)
# Or, assuming "A" and "B" are the only columns, 
# df['C'] = df.values.max(1) 
df

   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

If your data has NaNs, you will need numpy.nanmax:

df['C'] = np.nanmax(df.values, axis=1)
df

   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

You can also use numpy.maximum.reduce. numpy.maximum is a ufunc (Universal Function), and every ufunc has a reduce:

df['C'] = np.maximum.reduce(df['A', 'B']].values, axis=1)
# df['C'] = np.maximum.reduce(df[['A', 'B']], axis=1)
# df['C'] = np.maximum.reduce(df, axis=1)
df

   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

np.maximum.reduce and np.max appear to be more or less the same (for most normal sized DataFrames)—and happen to be a shade faster than DataFrame.max. I imagine this difference roughly remains constant, and is due to internal overhead (indexing alignment, handling NaNs, etc).

The graph was generated using perfplot. Benchmarking code, for reference:

import pandas as pd
import perfplot

np.random.seed(0)
df_ = pd.DataFrame(np.random.randn(5, 1000))

perfplot.show(
    setup=lambda n: pd.concat([df_] * n, ignore_index=True),
    kernels=[
        lambda df: df.assign(new=df.max(axis=1)),
        lambda df: df.assign(new=df.values.max(1)),
        lambda df: df.assign(new=np.nanmax(df.values, axis=1)),
        lambda df: df.assign(new=np.maximum.reduce(df.values, axis=1)),
    ],
    labels=['df.max', 'np.max', 'np.maximum.reduce', 'np.nanmax'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N (* len(df))',
    logx=True,
    logy=True)

Method 3

For finding max among multiple columsn would be:

df[['A','B']].max(axis=1).max(axis=0)

Example:

df = 

                         A      B
timestamp                                
2019-11-20 07:00:16  14.037880  15.217879
2019-11-20 07:01:03  14.515359  15.878632
2019-11-20 07:01:33  15.056502  16.309152
2019-11-20 07:02:03  15.533981  16.740607
2019-11-20 07:02:34  17.221073  17.195145

print(df[['A','B']].max(axis=1).max(axis=0))
17.221073

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating