I want to count the number of occurrences of each of certain words in a data frame. I currently do it using str.contains:
a = df2[df2['col1'].str.contains("sample")].groupby('col2').size()
n = a.apply(lambda x: 1).sum()
Is there a method to match regular expression and get the count of occurrences? In my case I have a large dataframe and I want to match around 100 strings.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Update: Original answer counts those rows which contain a substring.
To count all the occurrences of a substring you can use .str.count:
In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words'])
In [22]: df.words.str.count("he|wo")
Out[22]:
0 1
1 1
2 2
Name: words, dtype: int64
In [23]: df.words.str.count("he|wo").sum()
Out[23]: 4
The str.contains method accepts a regular expression:
Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan)
Docstring:
Check whether given pattern is contained in each string in the array
Parameters
----------
pat : string
Character sequence or regular expression
case : boolean, default True
If True, case sensitive
flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE
na : default NaN, fill value for missing values.
For example:
In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words']) In [12]: df Out[12]: words 0 hello 1 world In [13]: df.words.str.contains(r'[hw]') Out[13]: 0 True 1 True Name: words, dtype: bool In [14]: df.words.str.contains(r'he|wo') Out[14]: 0 True 1 True Name: words, dtype: bool
To count the occurences you can just sum this boolean Series:
In [15]: df.words.str.contains(r'he|wo').sum() Out[15]: 2 In [16]: df.words.str.contains(r'he').sum() Out[16]: 1
Method 2
To count the total number of matches, use s.str.match(...).str.get(0).count().
If your regex will be matching several unique words, to be tallied individually, use
s.str.match(...).str.get(0).groupby(lambda x: x).count()
It works like this:
In [12]: s Out[12]: 0 ax 1 ay 2 bx 3 by 4 bz dtype: object
The match string method handles regular expressions…
In [13]: s.str.match('(b[x-y]+)')
Out[13]:
0 []
1 []
2 (bx,)
3 (by,)
4 []
dtype: object
…but the results, as given, are not very convenient. The string method get takes the matches as strings and converts empty results to NaNs…
In [14]: s.str.match('(b[x-y]+)').str.get(0)
Out[14]:
0 NaN
1 NaN
2 bx
3 by
4 NaN
dtype: object
…which are not counted.
In [15]: s.str.match('(b[x-y]+)').str.get(0).count()
Out[15]: 2
Method 3
You can use value_count function.
import pandas as pd
# URL to .csv file
data_url = 'https://vincentarelbundock.github.io/Rdatasets/csv/carData/Arrests.csv'
# Reading the data
df = pd.read_csv(data_url, index_col=0)
# pandas count distinct values in column
df['sex'].value_counts()
Source: link
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

