Pandas mask with composite expression behaviour

this question was previously asked (and then deleted) by an user, I was looking to find a solution so I could give out an answer when the question disappeared and I, moreover, can’t seem to make sense of pandas’ behaviour so I would appreciate some clarity, the original question stated something along the lines of:

How can I replace every negative value except those in a given list with NaN in a Pandas dataframe?

my setup to reproduce the scenario is the following:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A' : [x for x in range(4)],
    'B' : [x for x in range(-2, 2)]
})

this should technically only be an issue of correctly passing a boolean expression to pd.where, my attemped solution looks like:

df[df >= 0 | df.isin([-2])] 

which produces:

index A B
0 0 NaN
1 1 NaN
2 2 0
3 3 1

which also cancels the number in the list!

moreover if I mask the dataframe with each of the two conditions I get the correct behavior:

with df[df >= 0] (identical to the compound result)

index A B
0 0 NaN
1 1 NaN
2 2 0
3 3 1

with df[df.isin([-2])] (identical to the compound result)

index A B
0 NaN -2.0
1 NaN NaN
2 NaN NaN
3 NaN NaN

So it seems like I am

  1. Running into some undefined behaviour as a result of performing logic on NaN values
  2. I have got something wrong

Anyone can clarify this situation to me?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Solution

df[(df >= 0) | (df.isin([-2]))]

Explanation

In python, bitwise OR, |, has a higher operator precedence than comparison operators like >=: https://docs.python.org/3/reference/expressions.html#operator-precedence

When filtering a pandas DataFrame on multiple boolean conditions, you need to enclose each condition in parentheses. More from the boolean indexing section of the pandas user guide:

Another common operation is the use of boolean vectors to filter the
data. The operators are: | for or, & for and, and ~ for not. These
must be grouped by using parentheses, since by default Python will
evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x