I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex, but I am not sure how to implement it
def mapfn(k, v):
print v
import re, string
pattern = re.compile('[W_]+')
v = pattern.match(v)
print v
for w in v.split():
yield w, 1
I’m afraid I am not sure how to use the library re or even regex for that matter. I am not sure how to apply the regex pattern to the incoming string (line of a book) v properly to retrieve the new line without any non-alphanumeric chars.
Suggestions?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Use re.sub
import re
regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'
Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input…)
regex = re.compile('[,.!?]') #etc.
Method 2
If you prefer not to use regex, you might try
''.join([i for i in s if i.isalpha()])
Method 3
You can use the re.sub() function to remove these characters:
>>> import re
>>> re.sub("[^a-zA-Z]+", "", "ABC12abc345def")
'ABCabcdef'
re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH)
"[^a-zA-Z]+"– look for any group of characters that are NOT
a-zA-z.""– Replace the matched characters with “”
Method 4
Try:
s = ''.join(filter(str.isalnum, s))
This will take every char from the string, keep only alphanumeric ones and build a string back from them.
Method 5
The fastest method is regex
#Try with regex first
t0 = timeit.timeit("""
s = r2.sub('', st)
""", setup = """
import re
r2 = re.compile(r'[^a-zA-Z0-9]', re.MULTILINE)
st = '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="4a2b28292e2f2c2d22232021262724253a3b38393e3f3c3d3233307b78797e7f7c7d72736b0a">[email protected]</a>#$%^&*()-=_+'
""", number = 1000000)
print(t0)
#Try with join method on filter
t0 = timeit.timeit("""
s = ''.join(filter(str.isalnum, st))
""", setup = """
st = '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="610003020504070609080b0a0d0c0f0e111013121514171619181b5053525554575659584021">[email protected]</a>#$%^&*()-=_+'
""",
number = 1000000)
print(t0)
#Try with only join
t0 = timeit.timeit("""
s = ''.join(c for c in st if c.isalnum())
""", setup = """
st = '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="29484b4a4d4c4f4e414043424544474659585b5a5d5c5f5e515053181b1a1d1c1f1e11100869">[email protected]</a>#$%^&*()-=_+'
""", number = 1000000)
print(t0)
2.6002226710006653 Method 1 Regex
5.739747313000407 Method 2 Filter + Join
6.540099570000166 Method 3 Join
Method 6
It is advisable to use PyPi regex module if you plan to match specific Unicode property classes. This library has also proven to be more stable, especially handling large texts, and yields consistent results across various Python versions. All you need to do is to keep it up-to-date.
If you install it (using pip install regex or pip3 install regex), you may use
import regex
print ( regex.sub(r'P{L}+', '', 'ABCŁąć1-2!Абв3§4“5def”') )
// => ABCŁąćАбвdef
to remove all chunks of 1 or more characters other than Unicode letters from text. See an online Python demo. You may also use "".join(regex.findall(r'p{L}+', 'ABCŁąć1-2!Абв3§4“5def”')) to get the same result.
In Python re, in order to match any Unicode letter, one may use the [^Wd_] construct (Match any unicode letter?).
So, to remove all non-letter characters, you may either match all letters and join the results:
result = "".join(re.findall(r'[^Wd_]', text))
Or, remove all chars matching the [Wd_] pattern (opposite to [^Wd_]):
result = re.sub(r'[Wd_]+', '', text)
See the regex demo online. However, you may get inconsistent results across various Python versions because the Unicode standard is evolving, and the set of chars matched with w will depend on the Python version. Using PyPi regex library is highly recommended to get consistent results.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0