Remove non-ASCII characters from pandas column

I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:

+-----------------------------------------------------------
|      DB_user                            source   count  |                                             
+-----------------------------------------------------------
| ???/"Ò|Z?)?]??C %??J                      A        10   |                                       
| ?D$ZGU   ;@D??_???T(?)                    B         3   |                                       
| ?Q`H??M'?Y??KTK$?Ù‹???Ð©JL4??*?_??        C         2   |                                        
+-----------------------------------------------------------

I was using this function, which I had come across while researching the problem on SO.

def filter_func(string):
   for i in range(0,len(string)):


      if (ord(string[i])< 32 or ord(string[i])>126
           break

      return ''

And then using the apply function:

df['DB_user'] = df.apply(filter_func,axis=1)

I keep getting the error:

'ord() expected a character, but string of length 66 found', u'occurred at index 2'

However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into ‘ord’. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.

Could somebody help me out?

Thanks!

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

you may try this:

df.DB_user.replace({r'[^x00-x7F]+':''}, regex=True, inplace=True)

Method 2

A common trick is to perform ASCII encoding with the errors="ignore" flag, then subsequently decoding it into ASCII:

df['DB_user'].str.encode('ascii', 'ignore').str.decode('ascii')

From python3.x and above, this is my recommended solution.

Minimal Code Sample

s = pd.Series(['Déjà vu', 'Ò|zz', ';test 123'])
s

0      Déjà vu
1         Ò|zz
2    ;test 123
dtype: object


s.str.encode('ascii', 'ignore').str.decode('ascii')

0        Dj vu
1          |zz
2    ;test 123
dtype: object

P.S.: This can also be extended to cases where you need to filter out characters that do not belong to any character encoding scheme (not just ASCII).

Method 3

You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

You can also simplify the join using a chained comparison:

   ''.join([i if 32 < ord(i) < 126 else " " for i in x])

You could also use string.printable to filter the chars:

from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

The fastest is to use translate:

from string import maketrans

del_chars =  " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))

df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

Interestingly that is faster than:

  df['DB_user'] = df["DB_user"].str.translate(trans)

Method 4

A couple of the answers given here aren’t correct. Simple validation:

s = pd.Series([chr(x) for x in range(256)])
s.loc[0]
>> 'x00'
s.replace({r'[^x00-x7F]+':''}, regex=True).loc[0]
>> 'x00'  # FAIL
s.str.encode('ascii', 'ignore').str.decode('ascii').loc[0]
>> 'x00'  # FAIL
s.apply(lambda x: ''.join([i if 32 < ord(i) < 126 else " " for i in x])).loc[0]
>> ' '  # Success!
import string
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[0]
>> ' '  # Looks good, but...
s.apply(lambda x: ''.join([" " if  i not in string.printable else i for i in x])).loc[11]
>> 'x0b'  # FAIL
del_chars =  " ".join([chr(i) for i in list(range(32)) + list(range(127, 256))])
trans = str.maketrans(del_chars, " " * len(del_chars))
s.apply(lambda x: x.translate(trans)).loc[11]
>> ' '  # Success!

Conclusion: only the options in the accepted answer (from Padraic Cunningham) work reliably. There are some bizarre Python errors/typos in his second answer, amended here, but otherwise it should be the fastest.

Method 5

from string import printable

def printable_mapper(x): 
    return ''.join([_ if _ in printable else " " for _ in x])

df.DB_user = df.DB_user.map(printable_mapper)

Method 6

This worked for me. Given the series has some NaN values, it performs only on strings:

from string import printable

import pandas as pd

df["text_data"] = df["text_data"].str.split().str.join(' ')

df["text_data"] = df["text_data"].apply(lambda string_var: ''.join(filter(lambda y: y in printable, string_var)) if isinstance(string_var, str) else string_var)

Method 7

Here is a one liner that I use:

df = df.replace(to_replace="/[^ -~]+/g", value="", regex=True)

Using regex, it globally removes characters not in the range of ‘ ‘(space) and ~

Method 8

This worked for me:

import re
def replace_foreign_characters(s):
    return re.sub(r'[^x00-x7f]',r'', s)

df['column_name'] = df['column_name'].apply(lambda x: replace_foreign_characters(x))

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating