Python & Pandas: How to query if a list-type column contains something?

I have a dataframe, which contains info about movies. It has a column called genre, which contains a list of genres it belongs to. For example:

df['genre']

## returns 

0       ['comedy', 'sci-fi']
1       ['action', 'romance', 'comedy']
2       ['documentary']
3       ['crime','horror']
...

I want to know how can I query the dataframe, so it returns the movie belongs to a cerain genre?

For example, something may like df['genre'].contains('comedy') returns 0 or 1.

I know for a list, I can do things like:

'comedy' in  ['comedy', 'sci-fi']

However, in pandas, I didn’t find something similar, the only thing I know is df['genre'].str.contains(), but it didn’t work for the list type.

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

set

apply & in

str.contains

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can use apply for create mask and then boolean indexing:

mask = df.genre.apply(lambda x: 'comedy' in x)
df1 = df[mask]
print (df1)
                       genre
0           [comedy, sci-fi]
1  [action, romance, comedy]

Method 2

using sets

df.genre.map(set(['comedy']).issubset)

0     True
1     True
2    False
3    False
dtype: bool

df.genre[df.genre.map(set(['comedy']).issubset)]

0             [comedy, sci-fi]
1    [action, romance, comedy]
dtype: object

presented in a way I like better

comedy = set(['comedy'])
iscomedy = comedy.issubset
df[df.genre.map(iscomedy)]

more efficient

comedy = set(['comedy'])
iscomedy = comedy.issubset
df[[iscomedy(l) for l in df.genre.values.tolist()]]

using str in two passes
slow! and not perfectly accurate!

df[df.genre.str.join(' ').str.contains('comedy')]

Method 3

According to the source code, you can use .str.contains(..., regex=False).

Method 4

You need to set regex=False and .str.contains will work for list values as you would expect:

In : df['genre'].str.contains('comedy', regex=False)
Out:
0     True
1     True
2    False
3    False
Name: genre, dtype: bool

Method 5

A complete example:

import pandas as pd

data = pd.DataFrame([[['foo', 'bar']],
                    [['bar', 'baz']]], columns=['list_column'])
print(data)
  list_column
0  [foo, bar]
1  [bar, baz]

filtered_data = data.loc[
    lambda df: df.list_column.apply(
        lambda l: 'foo' in l
    )
]
print(filtered_data)
  list_column
0  [foo, bar]

Method 6

This can be done in all three ways as suggested, using str.contains, set or apply and in. Although using set is the most efficient way to achieve this.

Here’s a performance comparison of the three methods on an extrapolated dataframe with 10,000 rows:

set

%%timeit -n 500 -r 35
df[df.genre.map(set(['comedy']).issubset)]
2.23 ms ± 154 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)

apply & in

%%timeit -n 500 -r 35
df[df.genre.apply(lambda x: 'comedy' in x)]
2.36 ms ± 359 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)

str.contains

%%timeit -n 500 -r 35
df[df['genre'].str.contains('comedy', regex=False)]
2.83 ms ± 299 µs per loop (mean ± std. dev. of 35 runs, 500 loops each)

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating