How to replace unicode characters in string with something else python?

I have a string that I got from reading a HTML webpage with bullets that have a symbol like “•” because of the bulleted list. Note that the text is an HTML source from a webpage using Python 2.7’s urllib2.read(webaddress).

I know the unicode character for the bullet character as U+2022, but how do I actually replace that unicode character with something else?

I tried doing
str.replace("•", "something")

but it does not appear to work… how do I do this?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

  1. Decode the string to Unicode. Assuming it’s UTF-8-encoded:
    str.decode("utf-8")
  2. Call the replace method and be sure to pass it a Unicode string as its first argument:
    str.decode("utf-8").replace(u"u2022", "*")
  3. Encode back to UTF-8, if needed:
    str.decode("utf-8").replace(u"u2022", "*").encode("utf-8")

(Fortunately, Python 3 puts a stop to this mess. Step 3 should really only be performed just prior to I/O. Also, mind you that calling a string str shadows the built-in type str.)

Method 2

Encode string as unicode.

>>> special = u"u2022"
>>> abc = u'ABC•def'
>>> abc.replace(special,'X')
u'ABCXdef'

Method 3

import re
regex = re.compile("u'2022'",re.UNICODE)
newstring = re.sub(regex, something, yourstring, <optional flags>)

Method 4

Try this one.

you will get the output in a normal string

str.encode().decode('unicode-escape')

and after that, you can perform any replacement.

str.replace('•','something')

Method 5

str1 = "This is Pythonu500cPool"

Encode the string to ASCII and replace all the utf-8 characters with ‘?’.

str1 = str1.encode("ascii", "replace")

Decode the byte stream to string.

str1 = str1.decode(encoding="utf-8", errors="ignore")

Replace the question mark with the desired character.

str1 = str1.replace("?"," ")

Method 6

Funny the answer is hidden in among the answers.

str.replace("•", "something")

would work if you use the right semantics.

str.replace(u"u2022","something")

works wonders 😉 , thnx to RParadox for the hint.

Method 7

If you want to remove all u character. Code below for you

def replace_unicode_character(self, content: str):
    content = content.encode('utf-8')
    if "\x80" in str(content):
        count_unicode = 0
        i = 0
        while i < len(content):
            if "\x" in str(content[i:i + 1]):
                if count_unicode % 3 == 0:
                    content = content[:i] + b'x80x80x80' + content[i + 3:]
                i += 2
                count_unicode += 1
            i += 1

        content = content.replace(b'x80x80x80', b'')
    return content.decode('utf-8')


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x