I want to find rows that contain a string, like so:
DF[DF.col.str.contains("foo")]
However, this fails because some elements are NaN:
ValueError: cannot index with vector containing NA / NaN values
So I resort to the obfuscated
DF[DF.col.notnull()][DF.col.dropna().str.contains("foo")]
Is there a better way?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
There’s a flag for that:
In [11]: df = pd.DataFrame([["foo1"], ["foo2"], ["bar"], [np.nan]], columns=['a'])
In [12]: df.a.str.contains("foo")
Out[12]:
0 True
1 True
2 False
3 NaN
Name: a, dtype: object
In [13]: df.a.str.contains("foo", na=False)
Out[13]:
0 True
1 True
2 False
3 False
Name: a, dtype: bool
See the str.replace docs:
na : default NaN, fill value for missing values.
So you can do the following:
In [21]: df.loc[df.a.str.contains("foo", na=False)]
Out[21]:
a
0 foo1
1 foo2
Method 2
In addition to the above answers, I would say for columns having no single word name, you may use:-
df[df['Product ID'].str.contains("foo") == True]
Hope this helps.
Method 3
df[df.col.str.contains("foo").fillna(False)]
Method 4
I’m not 100% on why (actually came here to search for the answer), but this also works, and doesn’t require replacing all nan values.
import pandas as pd
import numpy as np
df = pd.DataFrame([["foo1"], ["foo2"], ["bar"], [np.nan]], columns=['a'])
newdf = df.loc[df['a'].str.contains('foo') == True]
Works with or without .loc.
I have no idea why this works, as I understand it when you’re indexing with brackets pandas evaluates whatever’s inside the bracket as either True or False. I can’t tell why making the phrase inside the brackets ‘extra boolean’ has any effect at all.
Method 5
You can also use query method to query the columns of a DataFrame with a boolean expression as follows:
df.query('a.str.contains("foo", na=False)')
Note you might not get performance improvement, but it is more readable (arguably).
Method 6
You can also patern :
DF[DF.col.str.contains(pat = '(foo)', regex = True) ]
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0