pandas dataframe str.contains() AND operation

I have a df (Pandas Dataframe) with three rows:

some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"

The function df.col_name.str.contains("apple|banana") will catch all of the rows:

"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".

How do I apply AND operator to the str.contains() method, so that it only grabs strings that contain BOTH “apple” & “banana”?

"apple and banana both are delicious"

I’d like to grab strings that contains 10-20 different words (grape, watermelon, berry, orange, …, etc.)

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can do that as follows:

df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]

Method 2

You can also do it in regex expression style:

df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]

You can then, build your list of words into a regex string like so:

base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat']  # example
base.format(''.join(expr.format(w) for w in words))

will render:

'^(?=.*apple)(?=.*banana)(?=.*cat)'

Then you can do your stuff dynamically.

Method 3

df = pd.DataFrame({'col': ["apple is delicious",
                           "banana is delicious",
                           "apple and banana both are delicious"]})

targets = ['apple', 'banana']

# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0    True
1    True
2    True
Name: col, dtype: bool

# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0    False
1    False
2     True
Name: col, dtype: bool

Method 4

This works

df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)

Method 5

If you only want to use native methods and avoid writing regexps, here is a vectorized version with no lambdas involved:

targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]

Method 6

Try this regex

apple.*banana|banana.*apple

Code is:

import pandas as pd

df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))

print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]

Output

   ID                           String_Col
2   3  apple and banana both are delicious

Method 7

if you want to catch in the minimum atleast two words in the sentence, maybe this will work (taking the tip from @Alexander) :

target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]

output:

                                   col
2  apple and banana both are delicious

if you have more than two words to catch which are separated by comma ‘,’ than add it to the connector_list and modify the second condition from all to any

df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]

output:

                                        col
2        apple and banana both are delicious
3  orange,banana and apple all are delicious

Method 8

Enumerating all possibilities for large lists is cumbersome. A better way is to use reduce() and the bitwise AND operator (&).

For example, consider the following DataFrame:

df = pd.DataFrame({'col': ["apple is delicious",
                       "banana is delicious",
                       "apple and banana both are delicious",
                       "i love apple, banana, and strawberry"]})

#                                    col
#0                    apple is delicious
#1                   banana is delicious
#2   apple and banana both are delicious
#3  i love apple, banana, and strawberry

Suppose we wanted to search for all of the following:

targets = ['apple', 'banana', 'strawberry']

We can do:

#from functools import reduce  # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])

#                                    col
#3  i love apple, banana, and strawberry

Method 9

You can create masks

apple_mask = df.colname.str.contains('apple')
bannana_mask = df.colname.str.contains('bannana')
df = df [apple_mask & bannana_mask]

Method 10

From @Anzel’s answer, I wrote a function since I’m going to be applying this a lot:

def regify(words, base=str(r'^{}'), expr=str('(?=.*{})')):
    return base.format(''.join(expr.format(w) for w in words))

So if you have words defined:

words = ['apple', 'banana']

And then call it with something like:

dg = df.loc[
    df['col_name'].str.contains(regify(words), case=False, regex=True)
]

then you should get what you’re after.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x