Match any unicode letter?

In .net you can use p{L} to match any letter, how can I do the same in Python? Namely, I want to match any uppercase, lowercase, and accented letters.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Python’s re module doesn’t support Unicode properties yet. But you can compile your regex using the re.UNICODE flag, and then the character class shorthand w will match Unicode letters, too.

Since w will also match digits, you need to then subtract those from your character class, along with the underscore:

[^Wd_]

will match any Unicode letter.

>>> import re
>>> r = re.compile(r'[^Wd_]', re.U)
>>> r.match('x')
<_sre.SRE_Match object at 0x0000000001DBCF38>
>>> r.match(u'é')
<_sre.SRE_Match object at 0x0000000002253030>

Method 2

PyPi regex module supports p{L} Unicode property class, and many more, see “Unicode codepoint properties, including scripts and blocks” section in the documentation and full list at http://www.unicode.org/Public/UNIDATA/PropList.txt. Using regex module is convenient because you get consistent results across any Python version (mind that the Unicode standard is constantly evolving and the number of supported letters grows).

Install the library using pip install regex (or pip3 install regex) and use

p{L}        # To match any Unicode letter
p{Lu}       # To match any uppercase Unicode letter
p{Ll}       # To match any lowercase Unicode letter
p{L}p{M}*  # To match any Unicode letter and any amount of diacritics after it

See some usage examples below:

import regex
text = r'Abc-++-Абв. It’s “Łąć”!'
# Removing letters:
print( regex.sub(r'p{L}+', '', text) ) # => -++-. ’ “”!
# Extracting letter chunks:
print( regex.findall(r'p{L}+', text) ) # => ['Abc', 'Абв', 'It', 's', 'Łąć']
# Removing all but letters:
print( regex.sub(r'P{L}+', '', text) ) # => AbcАбвItsŁąć
# Removing all letters but ASCII letters:
print( regex.sub(r'[^P{L}a-zA-Z]+', '', text) ) # => Abc-++-. It’s “”!

See a Python demo online


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x