I’ve done some searching and can’t figure out how to filter a dataframe by
df["col"].str.contains(word)
however I’m wondering if there is a way to do the reverse: filter a dataframe by that set’s compliment. eg: to the effect of
!(df["col"].str.contains(word))
Can this be done through a DataFrame method?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can use the invert (~) operator (which acts like a not for boolean data):
new_df = df[~df["col"].str.contains(word)]
where new_df is the copy returned by RHS.
contains also accepts a regular expression…
If the above throws a ValueError, the reason is likely because you have mixed datatypes, so use na=False:
new_df = df[~df["col"].str.contains(word, na=False)]
Or,
new_df = df[df["col"].str.contains(word) == False]
Method 2
I was having trouble with the not (~) symbol as well, so here’s another way from another StackOverflow thread:
df[df["col"].str.contains('this|that')==False]
Method 3
You can use Apply and Lambda :
df[df["col"].apply(lambda x: word not in x)]
Or if you want to define more complex rule, you can use AND:
df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]
Method 4
I hope the answers are already posted
I am adding the framework to find multiple words and negate those from dataFrame.
Here 'word1','word2','word3','word4' = list of patterns to search
df = DataFrame
column_a = A column name from DataFrame df
values_to_remove = ['word1','word2','word3','word4'] pattern = '|'.join(values_to_remove) result = df.loc[~df['column_a'].str.contains(pattern, case=False)]
Method 5
I had to get rid of the NULL values before using the command recommended by Andy above. An example:
df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df
first second third
0 myword myword NaN
1 myword NaN myword
2 myword myword NaN
Now running the command:
~df["second"].str.contains(word)
I get the following error:
TypeError: bad operand type for unary ~: 'float'
I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.
Method 6
Additional to nanselm2’s answer, you can use 0 instead of False:
df["col"].str.contains(word)==0
Method 7
To negate your query use ~. Using query has the advantage of returning the valid observations of df directly:
df.query('~col.str.contains("word").values')
Method 8
To compliment to the above question, if someone wants to remove all the rows with strings, one could do:
df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]
Method 9
somehow ‘.contains’ didn’t work for me but when I tried with ‘.isin’ as mentioned by @kenan in the answer (How to drop rows from pandas data frame that contains a particular string in a particular column?) it works. Adding further, if you want to look at the entire dataframe and remove those rows which has the specific word (or set of words) just use the loop below
for col in df.columns:
df = df[~df[col].isin(['string or string list separeted by comma'])]
just remove ~ to get the dataframe that contains the word
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0