I have a dataframe, which contains info about movies. It has a column called genre, which contains a list of genres it belongs to. For example:
df['genre'] ## returns 0 ['comedy', 'sci-fi'] 1 ['action', 'romance', 'comedy'] 2 ['documentary'] 3 ['crime','horror'] ...
I want to know how can I query the dataframe, so it returns the movie belongs to a cerain genre?
For example, something may like df['genre'].contains('comedy') returns 0 or 1.
I know for a list, I can do things like:
'comedy' in ['comedy', 'sci-fi']
However, in pandas, I didn’t find something similar, the only thing I know is df['genre'].str.contains(), but it didn’t work for the list type.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can use apply for create mask and then boolean indexing:
mask = df.genre.apply(lambda x: 'comedy' in x)
df1 = df[mask]
print (df1)
genre
0 [comedy, sci-fi]
1 [action, romance, comedy]
Method 2
using sets
df.genre.map(set(['comedy']).issubset) 0 True 1 True 2 False 3 False dtype: bool
df.genre[df.genre.map(set(['comedy']).issubset)] 0 [comedy, sci-fi] 1 [action, romance, comedy] dtype: object
presented in a way I like better
comedy = set(['comedy']) iscomedy = comedy.issubset df[df.genre.map(iscomedy)]
more efficient
comedy = set(['comedy']) iscomedy = comedy.issubset df[[iscomedy(l) for l in df.genre.values.tolist()]]
using str in two passes
slow! and not perfectly accurate!
df[df.genre.str.join(' ').str.contains('comedy')]
Method 3
According to the source code, you can use .str.contains(..., regex=False).
Method 4
You need to set regex=False and .str.contains will work for list values as you would expect:
In : df['genre'].str.contains('comedy', regex=False)
Out:
0 True
1 True
2 False
3 False
Name: genre, dtype: bool
Method 5
A complete example:
import pandas as pd
data = pd.DataFrame([[['foo', 'bar']],
[['bar', 'baz']]], columns=['list_column'])
print(data)
list_column
0 [foo, bar]
1 [bar, baz]
filtered_data = data.loc[
lambda df: df.list_column.apply(
lambda l: 'foo' in l
)
]
print(filtered_data)
list_column
0 [foo, bar]
Method 6
This can be done in all three ways as suggested, using str.contains, set or apply and in. Although using set is the most efficient way to achieve this.
Here’s a performance comparison of the three methods on an extrapolated dataframe with 10,000 rows:
set
%%timeit -n 500 -r 35 df[df.genre.map(set(['comedy']).issubset)] 2.23 ms ± 154 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
apply & in
%%timeit -n 500 -r 35 df[df.genre.apply(lambda x: 'comedy' in x)] 2.36 ms ± 359 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
str.contains
%%timeit -n 500 -r 35
df[df['genre'].str.contains('comedy', regex=False)]
2.83 ms ± 299 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0