I am trying to determine whether there is an entry in a Pandas column that has a particular value. I tried to do this with if x in df['id']. I thought this was working, except when I fed it a value that I knew was not in the column 43 in df['id'] it still returned True. When I subset to a data frame only containing entries matching the missing id df[df['id'] == 43] there are, obviously, no entries in it. How to I determine if a column in a Pandas data frame contains a particular value and why doesn’t my current method work? (FYI, I have the same problem when I use the implementation in this answer to a similar question).
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
in of a Series checks whether the value is in the index:
In [11]: s = pd.Series(list('abc'))
In [12]: s
Out[12]:
0 a
1 b
2 c
dtype: object
In [13]: 1 in s
Out[13]: True
In [14]: 'a' in s
Out[14]: False
One option is to see if it’s in unique values:
In [21]: s.unique() Out[21]: array(['a', 'b', 'c'], dtype=object) In [22]: 'a' in s.unique() Out[22]: True
or a python set:
In [23]: set(s)
Out[23]: {'a', 'b', 'c'}
In [24]: 'a' in set(s)
Out[24]: True
As pointed out by @DSM, it may be more efficient (especially if you’re just doing this for one value) to just use in directly on the values:
In [31]: s.values Out[31]: array(['a', 'b', 'c'], dtype=object) In [32]: 'a' in s.values Out[32]: True
Method 2
You can also use pandas.Series.isin although it’s a little bit longer than 'a' in s.values:
In [2]: s = pd.Series(list('abc'))
In [3]: s
Out[3]:
0 a
1 b
2 c
dtype: object
In [3]: s.isin(['a'])
Out[3]:
0 True
1 False
2 False
dtype: bool
In [4]: s[s.isin(['a'])].empty
Out[4]: False
In [5]: s[s.isin(['z'])].empty
Out[5]: True
But this approach can be more flexible if you need to match multiple values at once for a DataFrame (see DataFrame.isin)
>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 4, 7]})
>>> df.isin({'A': [1, 3], 'B': [4, 7, 12]})
A B
0 True False # Note that B didn't match 1 here.
1 False True
2 True True
Method 3
found = df[df['Column'].str.contains('Text_to_search')]
print(found.count())
the found.count() will contains number of matches
And if it is 0 then means string was not found in the Column.
Method 4
I did a few simple tests:
In [10]: x = pd.Series(range(1000000)) In [13]: timeit 999999 in x.values 567 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [24]: timeit 9 in x.values 666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [16]: timeit (x == 999999).any() 6.86 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [21]: timeit x.eq(999999).any() 7.03 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [22]: timeit x.eq(9).any() 7.04 ms ± 60 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [15]: timeit x.isin([999999]).any() 9.54 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [17]: timeit 999999 in set(x) 79.8 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Interestingly it doesn’t matter if you look up 9 or 999999, it seems like it takes about the same amount of time using the in syntax (must be using some vectorized computation)
In [24]: timeit 9 in x.values 666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [25]: timeit 9999 in x.values 647 µs ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [26]: timeit 999999 in x.values 642 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [27]: timeit 99199 in x.values 644 µs ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [28]: timeit 1 in x.values 667 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Seems like using x.values is the fastest, but maybe there is a more elegant way in pandas?
Method 5
Or use Series.tolist or Series.any:
>>> s = pd.Series(list('abc'))
>>> s
0 a
1 b
2 c
dtype: object
>>> 'a' in s.tolist()
True
>>> (s=='a').any()
True
Series.tolist makes a list about of a Series, and the other one i am just getting a boolean Series from a regular Series, then checking if there are any Trues in the boolean Series.
Method 6
You can try this to check a particular value ‘x’ in a particular column named ‘id’
if x in df['id'].values
Method 7
Simple condition:
if any(str(elem) in ['a','b'] for elem in df['column'].tolist()):
Method 8
Use
df[df['id']==x].index.tolist()
If x is present in id then it’ll return the list of indices where it is present, else it gives an empty list.
Method 9
Use query() to find the rows where the condition holds and get the number of rows with shape[0]. If there exists at least on entry, this statement is True:
df.query('id == 123').shape[0] > 0
Method 10
Suppose you dataframe looks like :
Now you want to check if filename “80900026941984” is present in the dataframe or not.
You can simply write :
if sum(df["filename"].astype("str").str.contains("80900026941984")) > 0:
print("found")
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0
