remove unicode emoji using re in python

I tried to remove the emoji from a unicode tweet text and print out the result in python 2.7 using

myre = re.compile(u'[u1F300-u1F5FFu1F600-u1F64Fu1F680-u1F6FFu2600-u26FFu2700-u27BF]+',re.UNICODE)
print myre.sub('', text)

but it seems almost all the characters are removed from the text. I have checked several answers from other posts, unfortunately, none of them works here. Did I do anything wrong in re.compile()?

here is an example output that all the characters were removed:

“   '   //./” ! # # # …

Contents hide

Answers:

Method 1

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You are not using the correct notation for non-BMP unicode points; you want to use U0001FFFF, a capital U and 8 digits:

myre = re.compile(u'['
    u'U0001F300-U0001F5FF'
    u'U0001F600-U0001F64F'
    u'U0001F680-U0001F6FF'
    u'u2600-u26FFu2700-u27BF]+', 
    re.UNICODE)

This can be reduced to:

myre = re.compile(u'['
    u'U0001F300-U0001F64F'
    u'U0001F680-U0001F6FF'
    u'u2600-u26FFu2700-u27BF]+', 
    re.UNICODE)

as your first two ranges are adjacent.

Your version was specifying (with added spaces for readability):

[u1F30 0-u1F5F Fu1F60 0-u1F64 Fu1F68 0-u1F6F F u2600-u26FFu2700-u27BF]+

That’s because the uxxxx escape sequence always takes only 4 hex digits, not 5.

The largest of those ranges is 0-u1F6F (so from the digit 0 through to Ὧ), which covers a very large swathe of the Unicode standard.

The corrected expression works, provided you use a UCS-4 wide Python executable:

>>> import re
>>> myre = re.compile(u'['
...     u'U0001F300-U0001F64F'
...     u'U0001F680-U0001F6FF'
...     u'u2600-u26FFu2700-u27BF]+', 
...     re.UNICODE)
>>> myre.sub('', u'Some example text with a sleepy face: U0001f62a')
u'Some example text with a sleepy face: '

The UCS-2 equivalent is:

myre = re.compile(u'('
    u'ud83c[udf00-udfff]|'
    u'ud83d[udc00-ude4fude80-udeff]|'
    u'[u2600-u26FFu2700-u27BF])+', 
    re.UNICODE)

You can combine the two into your script with a exception handler:

try:
    # Wide UCS-4 build
    myre = re.compile(u'['
        u'U0001F300-U0001F64F'
        u'U0001F680-U0001F6FF'
        u'u2600-u26FFu2700-u27BF]+', 
        re.UNICODE)
except re.error:
    # Narrow UCS-2 build
    myre = re.compile(u'('
        u'ud83c[udf00-udfff]|'
        u'ud83d[udc00-ude4fude80-udeff]|'
        u'[u2600-u26FFu2700-u27BF])+', 
        re.UNICODE)

Of course, the regex is already out of date, as it doesn’t cover Emoji defined in newer Unicode releases; it appears to cover Emoji’s defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0).

If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:

import emoji

def remove_emoji(text):
    return emoji.get_emoji_regexp().sub(u'', text)

The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating