Python remove stop words from pandas dataframe

I want to remove the stop words from my column “tweets”. How do I iterative over each row and each item?

pos_tweets = [('I love this car', 'positive'),
    ('This view is amazing', 'positive'),
    ('I feel great this morning', 'positive'),
    ('I am so excited about the concert', 'positive'),
    ('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]
test["tweet"] = test["tweet"].str.lower().str.split()

from nltk.corpus import stopwords
stop = stopwords.words('english')

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

We can import stopwords from nltk.corpus as below. With that, We exclude stopwords with Python’s list comprehension and pandas.DataFrame.apply.

# Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

pos_tweets = [('I love this car', 'positive'),
    ('This view is amazing', 'positive'),
    ('I feel great this morning', 'positive'),
    ('I am so excited about the concert', 'positive'),
    ('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]

# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test)
# Out[40]:
#                                tweet     class tweet_without_stopwords
# 0                    I love this car  positive              I love car
# 1               This view is amazing  positive       This view amazing
# 2          I feel great this morning  positive    I feel great morning
# 3  I am so excited about the concert  positive       I excited concert
# 4               He is my best friend  positive          He best friend

It can also be excluded by using pandas.Series.str.replace.

pat = r'b(?:{})b'.format('|'.join(stop))
test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '')
test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r's+', ' ')
# Same results.
# 0              I love car
# 1       This view amazing
# 2    I feel great morning
# 3       I excited concert
# 4          He best friend

If you can not import stopwords, you can download as follows.

import nltk
nltk.download('stopwords')

Another way to answer is to import text.ENGLISH_STOP_WORDS from sklearn.feature_extraction.

# Import stopwords with scikit-learn
from sklearn.feature_extraction import text
stop = text.ENGLISH_STOP_WORDS

Notice that the number of words in the scikit-learn stopwords and nltk stopwords are different.

Method 2

Using List Comprehension

test['tweet'].apply(lambda x: [item for item in x if item not in stop])

Returns:

0               [love, car]
1           [view, amazing]
2    [feel, great, morning]
3        [excited, concert]
4            [best, friend]

Method 3

Check out pd.DataFrame.replace(), it might work for you:

In [42]: test.replace(to_replace='I', value="",regex=True)
Out[42]:
                              tweet     class
0                     love this car  positive
1              This view is amazing  positive
2           feel great this morning  positive
3   am so excited about the concert  positive
4              He is my best friend  positive

Edit : replace() would search for string(and even substrings). For e.g. it would replace rk from work if rk is a stopword which sometimes is not expected.

Hence the use of regex here :

for i in stop :
    test = test.replace(to_replace=r'b%sb'%i, value="",regex=True)

Method 4

If you would like something simple but not get back a list of words:

test["tweet"].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stop))

Where stop is defined as OP did.

from nltk.corpus import stopwords
stop = stopwords.words('english')

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating