Count occurrences of each of certain words in pandas dataframe

I want to count the number of occurrences of each of certain words in a data frame. I currently do it using str.contains:

a = df2[df2['col1'].str.contains("sample")].groupby('col2').size()
n = a.apply(lambda x: 1).sum()

Is there a method to match regular expression and get the count of occurrences? In my case I have a large dataframe and I want to match around 100 strings.

Contents hide

Answers:

Method 1

Method 2

Method 3

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Update: Original answer counts those rows which contain a substring.

To count all the occurrences of a substring you can use .str.count:

In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words'])

In [22]: df.words.str.count("he|wo")
Out[22]:
0    1
1    1
2    2
Name: words, dtype: int64

In [23]: df.words.str.count("he|wo").sum()
Out[23]: 4

The str.contains method accepts a regular expression:

Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan)
Docstring:
Check whether given pattern is contained in each string in the array

Parameters
----------
pat : string
    Character sequence or regular expression
case : boolean, default True
    If True, case sensitive
flags : int, default 0 (no flags)
    re module flags, e.g. re.IGNORECASE
na : default NaN, fill value for missing values.

For example:

In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])

In [12]: df
Out[12]:
   words
0  hello
1  world

In [13]: df.words.str.contains(r'[hw]')
Out[13]:
0    True
1    True
Name: words, dtype: bool

In [14]: df.words.str.contains(r'he|wo')
Out[14]:
0    True
1    True
Name: words, dtype: bool

To count the occurences you can just sum this boolean Series:

In [15]: df.words.str.contains(r'he|wo').sum()
Out[15]: 2

In [16]: df.words.str.contains(r'he').sum()
Out[16]: 1

Method 2

To count the total number of matches, use s.str.match(...).str.get(0).count().

If your regex will be matching several unique words, to be tallied individually, use
s.str.match(...).str.get(0).groupby(lambda x: x).count()

It works like this:

In [12]: s
Out[12]: 
0    ax
1    ay
2    bx
3    by
4    bz
dtype: object

The match string method handles regular expressions…

In [13]: s.str.match('(b[x-y]+)')
Out[13]: 
0       []
1       []
2    (bx,)
3    (by,)
4       []
dtype: object

…but the results, as given, are not very convenient. The string method get takes the matches as strings and converts empty results to NaNs…

In [14]: s.str.match('(b[x-y]+)').str.get(0)
Out[14]: 
0    NaN
1    NaN
2     bx
3     by
4    NaN
dtype: object

…which are not counted.

In [15]: s.str.match('(b[x-y]+)').str.get(0).count()
Out[15]: 2

Method 3

You can use value_count function.

import pandas as pd

# URL to .csv file
data_url = 'https://vincentarelbundock.github.io/Rdatasets/csv/carData/Arrests.csv'
# Reading the data
df = pd.read_csv(data_url, index_col=0)

# pandas count distinct values in column
df['sex'].value_counts()

Source: link

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating