Why is my NLTK function slow when processing the DataFrame?

I am trying to run through a function with my million lines in a datasets.

  1. I read the data from CSV in a dataframe
  2. I use drop list to drop data i don’t need
  3. I pass it through a NLTK function in a for loop.

code:

def nlkt(val):
    val=repr(val)
    clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]
    nopunc = [char for char in str(clean_txt) if char not in string.punctuation]
    nonum = [char for char in nopunc if not char.isdigit()]
    words_string = ''.join(nonum)
    return words_string

Now i am calling the above function using a for loop to run through by million records. Even though i am on a heavy weight server with 24 core cpu and 88 GB Ram i see the loop is taking too much time and not using the computational power that is there

I am calling the above function like this

data = pd.read_excel(scrPath + "UserData_Full.xlsx", encoding='utf-8')
droplist = ['Submitter', 'Environment']
data.drop(droplist,axis=1,inplace=True)

#Merging the columns company and detailed description

data['Anylize_Text']= data['Company'].astype(str) + ' ' + data['Detailed_Description'].astype(str)

finallist =[]

for eachlist in data['Anylize_Text']:
    z = nlkt(eachlist)
    finallist.append(z)

The above code works perfectly OK just too slow when we have few million record. It is just a sample record in excel but actual data will be in DB which will run in few hundred millions. Is there any way I can speed up the operation to pass the data through the function faster – use more computational power instead?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Your original nlkt() loops through each row 3 times.

def nlkt(val):
    val=repr(val)
    clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]
    nopunc = [char for char in str(clean_txt) if char not in string.punctuation]
    nonum = [char for char in nopunc if not char.isdigit()]
    words_string = ''.join(nonum)
    return words_string

Also, each time you’re calling nlkt(), you’re re-initializing these again and again.

  • stopwords.words('english')
  • string.punctuation

These should be global.

stoplist = stopwords.words('english') + list(string.punctuation)

Going through things line by line:

val=repr(val)

I’m not sure why you need to do this. But you could easy cast a column to a str type. This should be done outside of your preprocessing function.

Hopefully this is self-explanatory:

>>> import pandas as pd
>>> df = pd.DataFrame([[0, 1, 2], [2, 'xyz', 4], [5, 'abc', 'def']])
>>> df
   0    1    2
0  0    1    2
1  2  xyz    4
2  5  abc  def
>>> df[1]
0      1
1    xyz
2    abc
Name: 1, dtype: object
>>> df[1].astype(str)
0      1
1    xyz
2    abc
Name: 1, dtype: object
>>> list(df[1])
[1, 'xyz', 'abc']
>>> list(df[1].astype(str))
['1', 'xyz', 'abc']

Now going to the next line:

clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]

Using str.split() is awkward, you should use a proper tokenizer. Otherwise, your punctuations might be stuck with the preceding word, e.g.

>>> from nltk.corpus import stopwords
>>> from nltk import word_tokenize
>>> import string
>>> stoplist = stopwords.words('english') + list(string.punctuation)
>>> stoplist = set(stoplist)

>>> text = 'This is foo, bar and doh.'

>>> [word for word in text.split() if word.lower() not in stoplist]
['foo,', 'bar', 'doh.']

>>> [word for word in word_tokenize(text) if word.lower() not in stoplist]
['foo', 'bar', 'doh']

Also checking for .isdigit() should be checked together:

>>> text = 'This is foo, bar, 234, 567 and doh.'
>>> [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit()]
['foo', 'bar', 'doh']

Putting it all together your nlkt() should look like this:

def preprocess(text):
    return [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit()]

And you can use the DataFrame.apply:

data['Anylize_Text'].apply(preprocess)


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x