I needed to strip the Chinese out of a bunch of strings today and was looking for a simple Python regex. Any suggestions?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Python 2:
#!/usr/bin/env python
# -*- encoding: utf8 -*-
import re
sample = u'I am from 美国。We should be friends. 朋友。'
for n in re.findall(ur'[u4e00-u9fff]+',sample):
print n
Python 3:
sample = 'I am from 美国。We should be friends. 朋友。'
for n in re.findall(r'[u4e00-u9fff]+', sample):
print(n)
Output:
美国 朋友
About Unicode code blocks:
The 4E00—9FFF range covers CJK Unified Ideographs (CJK=Chinese, Japanese and Korean). There are a number of lower ranges that relate, to some degree, to CJK:
31C0—31EF CJK Strokes 31F0—31FF Katakana Phonetic Extensions 3200—32FF Enclosed CJK Letters and Months 3300—33FF CJK Compatibility 3400—4DBF CJK Unified Ideographs Extension A 4DC0—4DFF Yijing Hexagram Symbols 4E00—9FFF CJK Unified Ideographs
Method 2
The short, but relatively comprehensive answer for narrow Unicode builds of python (excluding ordinals > 65535 which can only be represented in narrow Unicode builds via surrogate pairs):
RE = re.compile(u'[⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]', re.UNICODE)
nochinese = RE.sub('', mystring)
The code for building the RE, and if you need to detect Chinese characters in the supplementary plane for wide builds:
# -*- coding: utf-8 -*-
import re
LHan = [[0x2E80, 0x2E99], # Han # So [26] CJK RADICAL REPEAT, CJK RADICAL RAP
[0x2E9B, 0x2EF3], # Han # So [89] CJK RADICAL CHOKE, CJK RADICAL C-SIMPLIFIED TURTLE
[0x2F00, 0x2FD5], # Han # So [214] KANGXI RADICAL ONE, KANGXI RADICAL FLUTE
0x3005, # Han # Lm IDEOGRAPHIC ITERATION MARK
0x3007, # Han # Nl IDEOGRAPHIC NUMBER ZERO
[0x3021, 0x3029], # Han # Nl [9] HANGZHOU NUMERAL ONE, HANGZHOU NUMERAL NINE
[0x3038, 0x303A], # Han # Nl [3] HANGZHOU NUMERAL TEN, HANGZHOU NUMERAL THIRTY
0x303B, # Han # Lm VERTICAL IDEOGRAPHIC ITERATION MARK
[0x3400, 0x4DB5], # Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400, CJK UNIFIED IDEOGRAPH-4DB5
[0x4E00, 0x9FC3], # Han # Lo [20932] CJK UNIFIED IDEOGRAPH-4E00, CJK UNIFIED IDEOGRAPH-9FC3
[0xF900, 0xFA2D], # Han # Lo [302] CJK COMPATIBILITY IDEOGRAPH-F900, CJK COMPATIBILITY IDEOGRAPH-FA2D
[0xFA30, 0xFA6A], # Han # Lo [59] CJK COMPATIBILITY IDEOGRAPH-FA30, CJK COMPATIBILITY IDEOGRAPH-FA6A
[0xFA70, 0xFAD9], # Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70, CJK COMPATIBILITY IDEOGRAPH-FAD9
[0x20000, 0x2A6D6], # Han # Lo [42711] CJK UNIFIED IDEOGRAPH-20000, CJK UNIFIED IDEOGRAPH-2A6D6
[0x2F800, 0x2FA1D]] # Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800, CJK COMPATIBILITY IDEOGRAPH-2FA1D
def build_re():
L = []
for i in LHan:
if isinstance(i, list):
f, t = i
try:
f = unichr(f)
t = unichr(t)
L.append('%s-%s' % (f, t))
except:
pass # A narrow python build, so can't use chars > 65535 without surrogate pairs!
else:
try:
L.append(unichr(i))
except:
pass
RE = '[%s]' % ''.join(L)
print 'RE:', RE.encode('utf-8')
return re.compile(RE, re.UNICODE)
RE = build_re()
print RE.sub('', u'美国').encode('utf-8')
print RE.sub('', u'blah').encode('utf-8')
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0