Url decode UTF-8 in Python

I have spent plenty of time as far as I am newbie in Python.
How could I ever decode such a URL:

example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0

to this one in python 2.7: example.com?title==правовая+защита

url=urllib.unquote(url.encode("utf8")) is returning something very ugly.

Still no solution, any help is appreciated.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:

from urllib.parse import unquote

url = unquote(url)

Demo:

>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'

The Python 2 equivalent is urllib.unquote(), but this returns a bytestring, so you’d have to decode manually:

from urllib import unquote

url = unquote(url).decode('utf8')

Method 2

If you are using Python 3, you can use urllib.parse

url = """example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0"""

import urllib.parse
urllib.parse.unquote(url)

gives:

'example.com?title=правовая+защита'

Method 3

You can achieve an expected result with requests library as well:

import requests

url = "http://www.mywebsite.org/Data%20Set.zip"

print(f"Before: {url}")
print(f"After:  {requests.utils.unquote(url)}")

Output:

$ python3 test_url_unquote.py

Before: http://www.mywebsite.org/Data%20Set.zip
After:  http://www.mywebsite.org/Data Set.zip

Might be handy if you are already using requests, without using another library for this job.

Method 4

In HTML the URLs can contain html entities.
This replaces them, too.

#from urllib import unquote #earlier python version
from urllib.request import unquote
from html import unescape
unescape(unquote('https://v.w.xy/p1/p22?userId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&confirmationToken=7uAf%2fxJoxRTFAZdxslCn2uwVR9vV7cYrlHs%2fl9sU%2frix9f9CnVx8uUT%2bu8y1%2fWCs99INKDnfA2ayhGP1ZD0z%2bodXjK9xL5I4gjKR2xp7p8Sckvb04mddf%2fiG75QYiRevgqdMnvd9N5VZp2ksBc83lDg7%2fgxqIwktteSI9RA3Ux9VIiNxx%2fZLe9dZSHxRq9AA'))


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x