I have a list containing URLs with escaped characters in them. Those characters have been set by urllib2.urlopen when it recovers the html page:
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh
Is there a way to transform them back to their unescaped form in python?
P.S.: The URLs are encoded in utf-8
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Using urllib package (import urllib) :
Python 2.7
From official documentation :
urllib.unquote(string)Replace
%xxescapes by their single-character equivalent.Example:
unquote('/%7Econnolly/')yields'/~connolly/'.
Python 3
From official documentation :
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
[…]
Example:unquote('/El%20Ni%C3%B1o/')yields'/El Niño/'.
Method 2
And if you are using Python3 you could use:
import urllib.parse urllib.parse.unquote(url)
Method 3
or urllib.unquote_plus
>>> import urllib
>>> urllib.unquote('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte+membrane+protein+1,+PfEMP1+(VAR)'
>>> urllib.unquote_plus('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte membrane protein 1, PfEMP1 (VAR)'
Method 4
You can use urllib.unquote
Method 5
import re
def unquote(url):
return re.compile('%([0-9a-fA-F]{2})',re.M).sub(lambda m: chr(int(m.group(1),16)), url)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0