I have some escaped strings that need to be unescaped. I’d like to do this in Python.
For example, in Python 2.7 I can do this:
>>> "\123omething special".decode('string-escape')
'Something special'
>>>
How do I do it in Python 3? This doesn’t work:
>>> b"\123omething special".decode('string-escape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: string-escape
>>>
My goal is to be able to take a string like this:
s00u00p00p00o00r00t<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6d5d5d5d2d">[email protected]</a>00p00s00i00l00o00c00.00c00o00m00
And turn it into:
"<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="eb989e9b9b84999fab9b9882878488c5888486">[email protected]</a>"
After I do the conversion, I’ll probe to see if the string I have is encoded in UTF-8 or UTF-16.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You’ll have to use unicode_escape instead:
>>> b"\123omething special".decode('unicode_escape')
If you start with a str object instead (equivalent to the python 2.7 unicode) you’ll need to encode to bytes first, then decode with unicode_escape.
If you need bytes as end result, you’ll have to encode again to a suitable encoding (.encode('latin1') for example, if you need to preserve literal byte values; the first 256 Unicode code points map 1-on-1).
Your example is actually UTF-16 data with escapes. Decode from unicode_escape, back to latin1 to preserve the bytes, then from utf-16-le (UTF 16 little endian without BOM):
>>> value = b's\000u\000p\000p\000o\000r\000t\<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="deeeeeee9e">[email protected]</a>\000p\000s\000i\000l\000o\000c\000.\000c\000o\000m\000'
>>> value.decode('unicode_escape').encode('latin1') # convert to bytes
b'sx00ux00px00px00ox00rx00t<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3149010171">[email protected]</a>x00px00sx00ix00lx00ox00cx00.x00cx00ox00mx00'
>>> _.decode('utf-16-le') # decode from UTF-16-LE
'<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="8af9fffafae5f8fecafaf9e3e6e5e9a4e9e5e7">[email protected]</a>'
Method 2
The old “string-escape” codec maps bytestrings to bytestrings, and there’s been a lot of debate about what to do with such codecs, so it isn’t currently available through the standard encode/decode interfaces.
BUT, the code is still there in the C-API (as PyBytes_En/DecodeEscape), and this is still exposed to Python via the undocumented codecs.escape_encode and codecs.escape_decode.
>>> import codecs >>> codecs.escape_decode(b"ab\xff") (b'abxff', 6) >>> codecs.escape_encode(b"abxff") (b'ab\xff', 3)
These functions return the transformed bytes object, plus a number indicating how many bytes were processed… you can just ignore the latter.
>>> value = b's\000u\000p\000p\000o\000r\000t\<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1d2d2d2d5d">[email protected]</a>\000p\000s\000i\000l\000o\000c\000.\000c\000o\000m\000' >>> codecs.escape_decode(value)[0] b'sx00ux00px00px00ox00rx00t<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7f074f4f3f">[email protected]</a>x00px00sx00ix00lx00ox00cx00.x00cx00ox00mx00'
Method 3
If you want str-to-str decoding of escape sequences, so both input and output are Unicode:
def string_escape(s, encoding='utf-8'):
return (s.encode('latin1') # To bytes, required by 'unicode-escape'
.decode('unicode-escape') # Perform the actual octal-escaping decode
.encode('latin1') # 1:1 mapping back to bytes
.decode(encoding)) # Decode original encoding
Testing:
>>> string_escape('\123omething special')
'Something special'
>>> string_escape(r's00u00p00p00o00r00t[email protected]'
r'00p00s00i00l00o00c00.00c00o00m00',
'utf-16-le')
'[email protected]'
Method 4
You can’t use unicode_escape on byte strings (or rather, you can, but it doesn’t always return the same thing as string_escape does on Python 2) – beware!
This function implements string_escape using a regular expression and custom replacement logic.
def unescape(text):
regex = re.compile(b'\\(\\|[0-7]{1,3}|x.[0-9a-f]?|['"abfnrt]|.|$)')
def replace(m):
b = m.group(1)
if len(b) == 0:
raise ValueError("Invalid character escape: '\'.")
i = b[0]
if i == 120:
v = int(b[1:], 16)
elif 48 <= i <= 55:
v = int(b, 8)
elif i == 34: return b'"'
elif i == 39: return b"'"
elif i == 92: return b'\'
elif i == 97: return b'a'
elif i == 98: return b'b'
elif i == 102: return b'f'
elif i == 110: return b'n'
elif i == 114: return b'r'
elif i == 116: return b't'
else:
s = b.decode('ascii')
raise UnicodeDecodeError(
'stringescape', text, m.start(), m.end(), "Invalid escape: %r" % s
)
return bytes((v, ))
result = regex.sub(replace, text)
Method 5
py2
"\123omething special".decode('string-escape')
py3
"\123omething special".encode('utf-8').decode('unicode-escape')
Method 6
At least in my case this was equivalent:
Py2: my_input.decode('string_escape')
Py3: bytes(my_input.decode('unicode_escape'), 'latin1')
convertutils.py:
def string_escape(my_bytes):
return bytes(my_bytes.decode('unicode_escape'), 'latin1')
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0