How to find the count of a word in a string?

I have a string “Hello I am going to I with hello am“. I want to find how many times a word occur in the string. Example hello occurs 2 time. I tried this approach that only prints characters –

def countWord(input_string):
    d = {}
    for word in input_string:
        try:
            d[word] += 1
        except:
            d[word] = 1

    for k in d.keys():
        print "%s: %d" % (k, d[k])
print countWord("Hello I am going to I with Hello am")

I want to learn how to find the word count.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

If you want to find the count of an individual word, just use count:

input_string.count("Hello")

Use collections.Counter and split() to tally up all the words:

from collections import Counter

words = input_string.split()
wordCount = Counter(words)

Method 2

Counter from collections is your friend:

>>> from collections import Counter
>>> counts = Counter(sentence.lower().split())

Method 3

from collections import *
import re

Counter(re.findall(r"[w']+", text.lower()))

Using re.findall is more versatile than split, because otherwise you cannot take into account contractions such as “don’t” and “I’ll”, etc.

Demo (using your example):

>>> countWords("Hello I am going to I with hello am")
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

If you expect to be making many of these queries, this will only do O(N) work once, rather than O(N*#queries) work.

Method 4

The vector of occurrence counts of words is called bag-of-words.

Scikit-learn provides a nice module to compute it, sklearn.feature_extraction.text.CountVectorizer. Example:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             min_df = 0,          
                             max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

Output:

2 am
1 going
2 hello
1 to
1 with

Part of the code was taken from this Kaggle tutorial on bag-of-words.

FYI: How to use sklearn’s CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

Method 5

Considering Hello and hello as same words, irrespective of their cases:

>>> from collections import Counter
>>> strs="Hello I am going to I with hello am"
>>> Counter(map(str.lower,strs.split()))
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

Method 6

Here is an alternative, case-insensitive, approach

sum(1 for w in s.lower().split() if w == 'Hello'.lower())
2

It matches by converting the string and target into lower-case.

ps: Takes care of the "am ham".count("am") == 2 problem with str.count() pointed out by @DSM below too 🙂

Method 7

You can divide the string into elements and calculate their number

count = len(my_string.split())

Method 8

You can use the Python regex library re to find all matches in the substring and return the array.

import re

input_string = "Hello I am going to I with Hello am"

print(len(re.findall('hello', input_string.lower())))

Prints:

2

Method 9

def countSub(pat,string):
    result = 0
    for i in range(len(string)-len(pat)+1):
          for j in range(len(pat)):
              if string[i+j] != pat[j]:
                 break
          else:   
                 result+=1
    return result


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x