I’m trying to open a webpage using urllib.request.urlopen() then search it with regular expressions, but that gives the following error:
TypeError: can’t use a string pattern on a bytes-like object
I understand why, urllib.request.urlopen() returns a bytestream, so re doesn’t know the encoding to use. What am I supposed to do in this situation? Is there a way to specify the encoding method in a urlrequest maybe or will I need to re-encode the string myself? If so what am I looking to do, I assume I should read the encoding from the header info or the encoding type if specified in the html and then re-encode it to that?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
As for me, the solution is as following (python3):
resource = urllib.request.urlopen(an_url) content = resource.read().decode(resource.headers.get_content_charset())
Method 2
You just need to decode the response, using the Content-Type header typically the last value. There is an example given in the tutorial too.
output = response.decode('utf-8')
Method 3
I had the same issues for the last two days. I finally have a solution.
I’m using the info() method of the object returned by urlopen():
req=urllib.request.urlopen(URL) charset=req.info().get_content_charset() content=req.read().decode(charset)
Method 4
With requests:
import requests response = requests.get(URL).text
Method 5
Here is an example simple http request (that I tested and works)…
address = "http://stackoverflow.com"
urllib.request.urlopen(address).read().decode('utf-8')
Make sure to read the documentation.
If you want to do something more detailed GET/POST REQUEST.
import urllib.request
# HTTP REQUEST of some address
def REQUEST(address):
req = urllib.request.Request(address)
req.add_header('User-Agent', 'NAME (Linux/MacOS; FROM, USA)')
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8') # make sure its all text not binary
print("REQUEST (ONLINE): " + address)
return html
Method 6
urllib.urlopen(url).headers.getheader('Content-Type')
Will output something like this:
text/html; charset=utf-8
Method 7
after you make a request req = urllib.request.urlopen(...) you have to read the request by calling html_string = req.read() that will give you the string response that you can then parse the way you want.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0