The punctuation and numerical,lowercase are not working while using nltk.
My code
stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']
new_stop_words=stopwords+user_defined_stop_words
def preprocess(text):
return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]
miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)
Sample Input
23FLOOR 9 DES VOEUX RD WEST HONG KONG PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG
Expected Output
floor des voeux west pag consulting flat aia central connaught central co city lost studios flat f hillier sheung
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Your function is slow and is incomplete. First, with the issues –
- You’re not lowercasing your data.
- You’re not getting rid of digits and punctuation properly.
- You’re not returning a string (you should join the list using
str.joinand return it) - Furthermore, a list comprehension with text processing is a prime way to introduce readability issues, not to mention possible redundancies (you may call a function multiple times, for each
ifcondition it appears in.
Next, there are a couple of glaring inefficiencies with your function, especially with the stopword removal code.
-
Your
stopwordsstructure is a list, andinchecks on lists are slow. The first thing to do would be to convert that to aset, making thenot incheck constant time. -
You’re using
nltk.word_tokenizewhich is unnecessarily slow. -
Lastly, you shouldn’t always rely on
apply, even if you are working with NLTK where there’s rarely any vectorised solution available. There are almost always other ways to do the exact same thing. Oftentimes, even a python loop is faster. But this isn’t set in stone.
First, create your enhanced stopwords as a set –
user_defined_stop_words = ['st','rd','hong','kong']
i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words
stopwords = set(i).union(j)
The next fix is to get rid of the list comprehension and convert this into a multi-line function. This makes things so much easier to work with. Each line of your function should be dedicated to solving a particular task (example, getting rid of digits/punctuation, or getting rid of stopwords, or lowercasing) –
def preprocess(x):
x = re.sub('[^a-zs]', '', x.lower()) # get rid of noise
x = [w for w in x.split() if w not in set(stopwords)] # remove stopwords
return ' '.join(x) # join the list
As an example. This would then be applyied to your column –
df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)
As an alternative, here’s an approach that doesn’t rely on apply. This should be work well for small sentences.
Load your data into a series –
v = miss_data['Adj_Addr'] v 0 23FLOOR 9 DES VOEUX RD WEST HONG KONG 1 PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT... 2 C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE... Name: Adj_Addr, dtype: object
Now comes the heavy lifting.
- Lowercase with
str.lower - Remove noise using
str.replace - Split words into separate cells using
str.split - Apply stopword removal using
pd.DataFrame.isin+pd.DataFrame.where - Finally, join the dataframe using
agg.
v = v.str.lower().str.replace('[^a-zs]', '').str.split(expand=True)
v.where(~v.isin(stopwords) & v.notnull(), '')
.agg(' '.join, axis=1)
.str.replace('s+', ' ')
.str.strip()
0 floor des voeux west
1 pag consulting flat aia central connaught central
2 co city lost studios flat f hillier sheung
dtype: object
To use this on multiple columns, place this code in a function preprocess2 and call apply –
def preprocess2(v):
v = v.str.lower().str.replace('[^a-zs]', '').str.split(expand=True)
return v.where(~v.isin(stopwords) & v.notnull(), '')
.agg(' '.join, axis=1)
.str.replace('s+', ' ')
.str.strip()
c = ['Col1', 'Col2', ...] # columns to operate df[c] = df[c].apply(preprocess2, axis=0)
You’ll still need an apply call, but with a small number of columns, it shouldn’t scale too badly. If you dislike apply, then here’s a loopy variant for you –
for _c in c:
df[_c] = preprocess2(df[_c])
Let’s see the difference between our non-loopy version and the original –
s = pd.concat([s] * 100000, ignore_index=True) s.size 300000
First, a sanity check –
preprocess2(s).eq(s.apply(preprocess)).all() True
Now come the timings.
%timeit preprocess2(s) 1 loop, best of 3: 13.8 s per loop
%timeit s.apply(preprocess) 1 loop, best of 3: 9.72 s per loop
This is surprising, because apply is seldom faster than a non-loopy solution. But this makes sense in this case because we’ve optimised preprocess quite a bit, and string operations in pandas are seldom vectorised (they usually are, but the performance gain isn’t as much as you’d expect).
Let’s see if we can do better, bypassing the apply, using np.vectorize
preprocess3 = np.vectorize(preprocess) %timeit preprocess3(s) 1 loop, best of 3: 9.65 s per loop
Which is identical to apply but happens to be a bit faster because of the reduced overhead around the “hidden” loop.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0