Replace non-ASCII characters with a single space

I need to replace all non-ASCII (x00-x7F) characters with a space. I’m surprised that this is not dead-easy in Python, unless I’m missing something. The following function simply removes all non-ASCII characters:

def remove_non_ascii_1(text):

    return ''.join(i for i in text if ord(i)<128)

And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the character is replaced with 3 spaces):

def remove_non_ascii_2(text):

    return re.sub(r'[^x00-x7F]',' ', text)

How can I replace all non-ASCII characters with a single space?

Of the myriad of similar SO questions, none address character replacement as opposed to stripping, and additionally address all non-ascii characters not a specific character.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^x00-x7F]+',' ', text)

Note the + there.

Method 2

For you the get the most alike representation of your original string I recommend the unidecode module:

# python 2.x:
from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

Then you can use it in a string:

remove_non_ascii("Ceñía")
Cenia

Method 3

For character processing, use Unicode strings:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^x00-x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^x00-x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^x00-x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^x00-x7f]',r' ',n) # only combining mark replaced
'man ana'

Method 4

If the replacement character can be ‘?’ instead of a space, then I’d suggest result = text.encode('ascii', 'replace').decode():

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

Results:

0.7208260721400134
0.009975979187503592

Method 5

What about this one?

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

Method 6

As a native and efficient approach, you don’t need to use ord or any loop over the characters. Just encode with ascii and ignore the errors.

The following will just remove the non-ascii characters:

new_string = old_string.encode('ascii',errors='ignore')

Now if you want to replace the deleted characters just do the following:

final_string = new_string + b' ' * (len(old_string) - len(new_string))

Method 7

When we use the ascii() it escapes the non-ascii characters and it doesn’t change ascii characters correctly. So my main thought is, it doesn’t change the ASCII characters, so I am iterating through the string and checking if the character is changed. If it changed then replacing it with the replacer, what you give.
For example: ‘ ‘(a single space) or ‘?’ (with a question mark).

def remove(x, replacer):

     for i in x:
        if f"'{i}'" == ascii(i):
            pass
        else:
            x=x.replace(i,replacer)
     return x
remove('hái',' ')

Result: “h i” (with single space between).

Syntax : remove(str,non_ascii_replacer)
str = Here you will give the string you want to work with.
non_ascii_replacer = Here you will give the replacer which you want to replace all the non ASCII characters with.

Method 8

My problem was that my string contained things like Belgià for België and &#x20AC for the € sign. And I didn’t want to replace them with spaces. But wth the right symbol itself.

my solution was string.encode('Latin1').decode('utf-8')

Method 9

Potentially for a different question, but I’m providing my version of @Alvero’s answer (using unidecode). I want to do a “regular” strip on my strings, i.e. the beginning and end of my string for whitespace characters, and then replace only other whitespace characters with a “regular” space, i.e.

"Ceñíaㅤmañanaㅤㅤㅤㅤ"

to

"Ceñía mañana"

,

def safely_stripped(s: str):
    return ' '.join(
        stripped for stripped in
        (bit.strip() for bit in
         ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
        if stripped)

We first replace all non-unicode spaces with a regular space (and join it back again),

''.join((c if unidecode(c) else ' ') for c in s)

And then we split that again, with python’s normal split, and strip each “bit”,

(bit.strip() for bit in s.split())

And lastly join those back again, but only if the string passes an if test,

' '.join(stripped for stripped in s if stripped)

And with that, safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ') correctly returns 'Ceñía mañana'.

Method 10

To replace all non-ASCII (x00-x7F) characters with a space:

''.join(map(lambda x: x if ord(x) in range(0, 128) else ' ', text))

To replace all visible characters, try this:

import string

''.join(map(lambda x: x if x in string.printable and x not in string.whitespace else ' ', text))

This will give the same result:

''.join(map(lambda x: x if ord(x) in range(32, 128) else ' ', text))


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x