I have a df (Pandas Dataframe) with three rows:
some_col_name "apple is delicious" "banana is delicious" "apple and banana both are delicious"
The function df.col_name.str.contains("apple|banana") will catch all of the rows:
"apple is delicious", "banana is delicious", "apple and banana both are delicious".
How do I apply AND operator to the str.contains() method, so that it only grabs strings that contain BOTH “apple” & “banana”?
"apple and banana both are delicious"
I’d like to grab strings that contains 10-20 different words (grape, watermelon, berry, orange, …, etc.)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can do that as follows:
df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
Method 2
You can also do it in regex expression style:
df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]
You can then, build your list of words into a regex string like so:
base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat'] # example
base.format(''.join(expr.format(w) for w in words))
will render:
'^(?=.*apple)(?=.*banana)(?=.*cat)'
Then you can do your stuff dynamically.
Method 3
df = pd.DataFrame({'col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious"]})
targets = ['apple', 'banana']
# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0 True
1 True
2 True
Name: col, dtype: bool
# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0 False
1 False
2 True
Name: col, dtype: bool
Method 4
This works
df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)
Method 5
If you only want to use native methods and avoid writing regexps, here is a vectorized version with no lambdas involved:
targets = ['apple', 'banana', 'strawberry'] fruit_masks = (df['col'].str.contains(string) for string in targets) combined_mask = np.vstack(fruit_masks).all(axis=0) df[combined_mask]
Method 6
Try this regex
apple.*banana|banana.*apple
Code is:
import pandas as pd
df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))
print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]
Output
ID String_Col 2 3 apple and banana both are delicious
Method 7
if you want to catch in the minimum atleast two words in the sentence, maybe this will work (taking the tip from @Alexander) :
target=['apple','banana','grapes','orange'] connector_list=['and'] df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]
output:
col 2 apple and banana both are delicious
if you have more than two words to catch which are separated by comma ‘,’ than add it to the connector_list and modify the second condition from all to any
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]
output:
col 2 apple and banana both are delicious 3 orange,banana and apple all are delicious
Method 8
Enumerating all possibilities for large lists is cumbersome. A better way is to use reduce() and the bitwise AND operator (&).
For example, consider the following DataFrame:
df = pd.DataFrame({'col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious",
"i love apple, banana, and strawberry"]})
# col
#0 apple is delicious
#1 banana is delicious
#2 apple and banana both are delicious
#3 i love apple, banana, and strawberry
Suppose we wanted to search for all of the following:
targets = ['apple', 'banana', 'strawberry']
We can do:
#from functools import reduce # needed for python3 print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))]) # col #3 i love apple, banana, and strawberry
Method 9
You can create masks
apple_mask = df.colname.str.contains('apple')
bannana_mask = df.colname.str.contains('bannana')
df = df [apple_mask & bannana_mask]
Method 10
From @Anzel’s answer, I wrote a function since I’m going to be applying this a lot:
def regify(words, base=str(r'^{}'), expr=str('(?=.*{})')):
return base.format(''.join(expr.format(w) for w in words))
So if you have words defined:
words = ['apple', 'banana']
And then call it with something like:
dg = df.loc[
df['col_name'].str.contains(regify(words), case=False, regex=True)
]
then you should get what you’re after.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0