Modifying a subset of rows in a pandas dataframe

Assume I have a pandas DataFrame with two columns, A and B. I’d like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?

I tried the following

df['A'==0]['B'] = np.nan

and

df['A'==0]['B'].values.fill(np.nan)

without success.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use .loc for label based indexing:

df.loc[df.A==0, 'B'] = np.nan

The df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. You can also use this to transform a subset of a column, e.g.:

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

I don’t know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I’ve found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.

Method 2

Here is from pandas docs on advanced indexing:

The section will explain exactly what you need! Turns out df.loc (as .ix has been deprecated — as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.

df.loc[selection criteria, columns I want] = value

So Bren’s answer is saying ‘find me all the places where df.A == 0, select column B and set it to np.nan

Method 3

Starting from pandas 0.20 ix is deprecated. The right way is to use df.loc

here is a working example

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>>

Explanation:

As explained in the doc here, .loc is primarily label based, but may also be used with a boolean array.

So, what we are doing above is applying df.loc[row_index, column_index] by:

  • Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
  • Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows that contain a 0, for that we can use df.A == 0, as you can see in the example below, this returns a series of booleans.

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>>

Then, we use the above array of booleans to select and modify the necessary rows:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here.

Method 4

For a massive speed increase, use NumPy’s where function.

Setup

Create a two-column DataFrame with 100,000 rows with some zeros.

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with numpy.where

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy’s where is about 4x faster

Method 5

To replace multiples columns convert to numpy array using .values:

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2

Method 6

To modify a DataFrame in Pandas you can use “syntactic sugar” operators like +=, *=, /= etc. So instead of:

df.loc[df.A == 0, 'B'] = df.loc[df.A == 0, 'B'] / 2

You can write:

df.loc[df.A == 0, 'B'] /= 2

To replace values with NaN you can use Pandas methods mask or where. For example:

df  = pd.DataFrame({'A': [1, 2, 3], 'B': [0, 0, 4]})

   A  B
0  1  0
1  2  0
2  3  4

df['A'].mask(df['B'] == 0, inplace=True) # other=np.nan by default
# df['A'].where(df['B'] != 0, inplace=True)

Result:

     A  B
0  NaN  0
1  NaN  0
2  3.0  4

Method 7

Alternatives:

no 1 looks best to me, but oddly I can’t find the supporting documentation for it

  1. Filter column as series (note: filter comes after column being written to, not before)

dataframe.column[filter condition]=values to change to

df.B[df.A==0] = np.nan
  1. loc

dataframe.loc[filter condition, column to change]=values to change to

df.loc[df.A == 0, 'B'] = np.nan
  1. numpy where

dataframe.column=np.where(filter condition, values if true, values if false)

import numpy as np
df.B = np.where(df.A== 0, np.nan, df.B)
  1. apply lambda

dataframe.column=df.apply(lambda row: value if condition true else value if false, use rows not columns)

df.B = df.apply(lambda x: np.nan if x['A']==0 else x['B'],axis=1)
  1. zip and list syntax

dataframe.column=[valuse if condition is true else value if false for elements a,b in list from zip function of columns a and b]

df.B = [np.nan if a==0 else b for a,b in zip(df.A,df.B)]


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x