Ignoring NaNs with str.contains

I want to find rows that contain a string, like so:

DF[DF.col.str.contains("foo")]

However, this fails because some elements are NaN:

ValueError: cannot index with vector containing NA / NaN values

So I resort to the obfuscated

DF[DF.col.notnull()][DF.col.dropna().str.contains("foo")]

Is there a better way?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

There’s a flag for that:

In [11]: df = pd.DataFrame([["foo1"], ["foo2"], ["bar"], [np.nan]], columns=['a'])

In [12]: df.a.str.contains("foo")
Out[12]:
0     True
1     True
2    False
3      NaN
Name: a, dtype: object

In [13]: df.a.str.contains("foo", na=False)
Out[13]:
0     True
1     True
2    False
3    False
Name: a, dtype: bool

See the str.replace docs:

na : default NaN, fill value for missing values.


So you can do the following:

In [21]: df.loc[df.a.str.contains("foo", na=False)]
Out[21]:
      a
0  foo1
1  foo2

Method 2

In addition to the above answers, I would say for columns having no single word name, you may use:-

df[df['Product ID'].str.contains("foo") == True]

Hope this helps.

Method 3

df[df.col.str.contains("foo").fillna(False)]

Method 4

I’m not 100% on why (actually came here to search for the answer), but this also works, and doesn’t require replacing all nan values.

import pandas as pd
import numpy as np

df = pd.DataFrame([["foo1"], ["foo2"], ["bar"], [np.nan]], columns=['a'])

newdf = df.loc[df['a'].str.contains('foo') == True]

Works with or without .loc.

I have no idea why this works, as I understand it when you’re indexing with brackets pandas evaluates whatever’s inside the bracket as either True or False. I can’t tell why making the phrase inside the brackets ‘extra boolean’ has any effect at all.

Method 5

You can also use query method to query the columns of a DataFrame with a boolean expression as follows:

df.query('a.str.contains("foo", na=False)')

Note you might not get performance improvement, but it is more readable (arguably).

Method 6

You can also patern :

DF[DF.col.str.contains(pat = '(foo)', regex = True) ]


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x