Remove punctuation from Unicode formatted strings

I have a function that removes punctuation from a list of strings:

def strip_punctuation(input):
    x = 0
    for word in input:
        input[x] = re.sub(r'[^A-Za-z0-9 ]', "", input[x])
        x += 1
    return input

I recently modified my script to use Unicode strings so I could handle other non-Western characters. This function breaks when it encounters these special characters and just returns empty Unicode strings. How can I reliably remove punctuation from Unicode formatted strings?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You could use unicode.translate() method:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

You could also use r'p{P}' that is supported by regex module:

import regex as re

def remove_punctuation(text):
    return re.sub(ur"p{P}+", "", text)

Method 2

If you want to use J.F. Sebastian’s solution in Python 3:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

Method 3

You can iterate through the string using the unicodedata module’s category function to determine if the character is punctuation.

For possible outputs of category, see unicode.org’s doc on General Category Values

import unicodedata.category as cat
def strip_punctuation(word):
    return "".join(char for char in word if cat(char).startswith('P'))
filtered = [strip_punctuation(word) for word in input]

Additionally, make sure that you’re handling encodings and types correctly. This presentation is a good place to start: http://bit.ly/unipain

Method 4

A little shorter version based on Daenyth answer

import unicodedata

def strip_punctuation(text):
    """
    >>> strip_punctuation(u'something')
    u'something'

    >>> strip_punctuation(u'something.,:else really')
    u'somethingelse really'
    """
    punctutation_cats = set(['Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po'])
    return ''.join(x for x in text
                   if unicodedata.category(x) not in punctutation_cats)

input_data = [u'somehting', u'something, else', u'nothing.']
without_punctuation = map(strip_punctuation, input_data)


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x