I’m trying to create a new column in a DataFrame that contains the word count for the respective row. I’m looking for the total number of words, not frequencies of each distinct word. I assumed there would be a simple/quick way to do this common task, but after googling around and reading a handful of SO posts (1, 2, 3, 4) I’m stuck. I’ve tried the solutions put forward in the linked SO posts, but got lots of attribute errors back.
words = df['col'].split() df['totalwords'] = len(words)
results in
AttributeError: 'Series' object has no attribute 'split'
and
f = lambda x: len(x["col"].split()) -1 df['totalwords'] = df.apply(f, axis=1)
results in
AttributeError: ("'list' object has no attribute 'split'", 'occurred at index 0')
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
str.split + str.len
str.len works nicely for any non-numeric column.
df['totalwords'] = df['col'].str.split().str.len()
str.count
If your words are single-space separated, you may simply count the spaces plus 1.
df['totalwords'] = df['col'].str.count(' ') + 1
List Comprehension
This is faster than you think!
df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]
Method 2
Here is a way using .apply():
df['number_of_words'] = df.col.apply(lambda x: len(x.split()))
example
Given this df:
>>> df
col
0 This is one sentence
1 and another
After applying the .apply()
df['number_of_words'] = df.col.apply(lambda x: len(x.split()))
>>> df
col number_of_words
0 This is one sentence 4
1 and another 2
Note: As pointed out by in comments, and in this answer, .apply is not necessarily the fastest method. If speed is important, better go with one of @cᴏʟᴅsᴘᴇᴇᴅ’s methods.
Method 3
This is one way using pd.Series.str.split and pd.Series.map:
df['word_count'] = df['col'].str.split().map(len)
The above assumes that df['col'] is a series of strings.
Example:
df = pd.DataFrame({'col': ['This is an example', 'This is another', 'A third']})
df['word_count'] = df['col'].str.split().map(len)
print(df)
# col word_count
# 0 This is an example 4
# 1 This is another 3
# 2 A third 2
Method 4
With list and map data from cold
list(map(lambda x : len(x.split()),df.col)) Out[343]: [4, 3, 2]
Method 5
You could also map split and len methods to the strings in the DataFrame column:
df['word_count'] = [*map(len, map(str.split, df['col'].tolist()))]
Here’s some preliminary benchmark of the answers given here. map seems to do well on very large Series:
df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside',
'one banana', 'fruits']*100000,
columns=['col'])
>>> df.shape
(600000, 1)
>>> %timeit df['word_count'] = df['col'].str.split().str.len()
761 ms ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = df['col'].str.count(' ').add(1)
691 ms ± 71.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = [len(x.split()) for x in df['col'].tolist()]
405 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = df['col'].apply(lambda x: len(x.split()))
450 ms ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = df['col'].str.split().map(len)
657 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = list(map(lambda x : len(x.split()), df['col'].tolist()))
435 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = [*map(len, map(str.split, df['col'].tolist()))]
329 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0