If I’ve got a DataFrame in pandas which looks something like:
A B C 0 1 NaN 2 1 NaN 3 NaN 2 NaN 4 5 3 NaN NaN NaN
How can I get the first non-null value from each row? E.g. for the above, I’d like to get: [1, 3, 4, None] (or equivalent Series).
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Fill the nans from the left with fillna, then get the leftmost column:
df.fillna(method='bfill', axis=1).iloc[:, 0]
Method 2
This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
A slightly cleaner way:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64
Method 3
Here is another way to do it:
In [183]: df.stack().groupby(level=0).first().reindex(df.index) Out[183]: 0 1 1 3 2 4 3 NaN dtype: float64
The idea here is to use stack to move the columns into a row index level:
In [184]: df.stack() Out[184]: 0 A 1 C 2 1 B 3 2 B 4 C 5 dtype: float64
Now, if you group by the first row level — i.e. the original index — and take the first value from each group, you essentially get the desired result:
In [185]: df.stack().groupby(level=0).first() Out[185]: 0 1 1 3 2 4 dtype: float64
All we need to do is reindex the result (using the original index) so as to
include rows that are completely NaN:
df.stack().groupby(level=0).first().reindex(df.index)
Method 4
I’m going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:
def get_first_non_null(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return [a<div class="su-row"></div> for row, col in enumerate(col_index)]
EDIT:
Here’s a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.
def get_first_non_null_vec(df):
a = df.values
n_rows, n_cols = a.shape
col_index = np.isnan(a).argmin(axis=1)
flat_index = n_cols * np.arange(n_rows) + col_index
return a.ravel()[flat_index]
If a row is completely null then the corresponding value will be null also.
Here’s some benchmarking against unutbu’s solution:
df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99))) #%timeit df.stack().groupby(level=0).first().reindex(df.index) %timeit get_first_non_null(df) %timeit get_first_non_null_vec(df) 1 loops, best of 3: 220 ms per loop 100 loops, best of 3: 16.2 ms per loop 100 loops, best of 3: 12.6 ms per loop In [109]: df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99))) #%timeit df.stack().groupby(level=0).first().reindex(df.index) %timeit get_first_non_null(df) %timeit get_first_non_null_vec(df) 1 loops, best of 3: 246 ms per loop 10 loops, best of 3: 48.2 ms per loop 100 loops, best of 3: 15.7 ms per loop df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99))) %timeit df.stack().groupby(level=0).first().reindex(df.index) %timeit get_first_non_null(df) %timeit get_first_non_null_vec(df) 1 loops, best of 3: 326 ms per loop 1 loops, best of 3: 326 ms per loop 10 loops, best of 3: 35.7 ms per loop
Method 5
This is nothing new, but it’s a combination of the best bits of @yangie’s approach with a list comprehension, and @EdChum’s df.apply approach that I think is easiest to understand.
First, which columns to we want to pick our values from?
In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1) In [96]: pick_cols Out[96]: 0 A 1 B 2 B 3 None dtype: object
Now how do we pick the values?
In [100]: [df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]
This is ok, but we really want the index to match that of the original DataFrame:
In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()})
Out[98]:
0 1
1 3
2 4
3 NaN
dtype: float64
Method 6
groupby in axis=1
If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy
df.groupby(lambda x: 'Z', 1).first()
Z
0 1.0
1 3.0
2 4.0
3 NaN
This returns a dataframe with the column name of the thing I was returning in my callable
lookup, notna, and idxmax
df.lookup(df.index, df.notna().idxmax(1)) array([ 1., 3., 4., nan])
argmin and slicing
v = df.values v[np.arange(len(df)), np.isnan(v).argmin(1)] array([ 1., 3., 4., nan])
Method 7
Here is a one line solution:
<div class="su-row"></div> if row.first_valid_index() else None for _, row in df.iterrows()]
Edit:
This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.
If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.
I packed everything into a list comprehension for brevity.
Method 8
JoeCondron’s answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:
def get_first_non_null_vect(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return a[np.arange(a.shape[0]), col_index]
The improvement is small if the DataFrame is relatively flat:
In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99))) In [5]: %timeit get_first_non_null(df) 10 loops, best of 3: 34.9 ms per loop In [6]: %timeit get_first_non_null_vect(df) 10 loops, best of 3: 31.6 ms per loop
… but can be relevant on slim DataFrames:
In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9))) In [8]: %timeit get_first_non_null(df) 100 loops, best of 3: 3.75 ms per loop In [9]: %timeit get_first_non_null_vect(df) 1000 loops, best of 3: 718 µs per loop
Compared to JoeCondron’s vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).
Method 9
df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})
df
A B C
0 1.0 NaN 2.0
1 NaN 3.0 NaN
2 NaN 4.0 5.0
3 NaN NaN NaN
df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0