I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage:
import urllib.request import re url = "http://www.google.com" regex = r'<title>(,+?)</title>' pattern = re.compile(regex) with urllib.request.urlopen(url) as response: html = response.read() title = re.findall(pattern, html) print(title)
And I get this unexpected error:
Traceback (most recent call last):
File "pathtofileCrawler.py", line 11, in <module>
title = re.findall(pattern, html)
File "C:Python33libre.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
What am I doing wrong?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode('utf-8').
See Convert bytes to a Python String
Method 2
The problem is that your regex is a string, but html is bytes:
>>> type(html) <class 'bytes'>
Since python doesn’t know how those bytes are encoded, it throws an exception when you try to use a string regex on them.
You can either decode the bytes to a string:
html = html.decode('ISO-8859-1') # encoding may vary!
title = re.findall(pattern, html) # no more error
Or use a bytes regex:
regex = rb'<title>(,+?)</title>' # ^
In this particular context, you can get the encoding from the response headers:
with urllib.request.urlopen(url) as response:
encoding = response.info().get_param('charset', 'utf8')
html = response.read().decode(encoding)
See the urlopen documentation for more details.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0