I’m writing a script that goes to a list of links and parses the information.
It works for most sites but It’s choking on some with
“UnicodeEncodeError: ‘ascii’ codec can’t encode character ‘xe9’ in position 13: ordinal not in range(128)”
It stops on client.py which is part of urlib on python3
the exact link is:
http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html
There are quite a few similar postings here but none of the answers seems to work for me.
my code is:
from urllib import request
def __request(link,debug=0):
try:
html = request.urlopen(link, timeout=35).read() #made this long as I was getting lots of timeouts
unicode_html = html.decode('utf-8','ignore')
# NOTE the except HTTPError must come first, otherwise except URLError will also catch an HTTPError.
except HTTPError as e:
if debug:
print('The server couldn't fulfill the request for ' + link)
print('Error code: ', e.code)
return ''
except URLError as e:
if isinstance(e.reason, socket.timeout):
print('timeout')
return ''
else:
return unicode_html
this calls the request function
link = ‘http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html’
page = __request(link)
And the traceback is:
Traceback (most recent call last):
File "<string>", line 250, in run_nodebug
File "C:readerget_news.py", line 276, in <module>
main()
File "C:readerget_news.py", line 255, in main
body = get_article_body(item['link'],debug=0)
File "C:readerget_news.py", line 155, in get_article_body
page = __request('na',url)
File "C:readerget_news.py", line 50, in __request
html = request.urlopen(link, timeout=35).read()
File "C:Python33Liburllibrequest.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "C:Python33Liburllibrequest.py", line 469, in open
response = self._open(req, data)
File "C:Python33Liburllibrequest.py", line 487, in _open
'_open', req)
File "C:Python33Liburllibrequest.py", line 447, in _call_chain
result = func(*args)
File "C:Python33Liburllibrequest.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:Python33Liburllibrequest.py", line 1248, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "C:Python33Libhttpclient.py", line 1061, in request
self._send_request(method, url, body, headers)
File "C:Python33Libhttpclient.py", line 1089, in _send_request
self.putrequest(method, url, **skips)
File "C:Python33Libhttpclient.py", line 953, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character 'xe9' in position 13: ordinal not in range(128)
Any help appreciated It’s driving me crazy , I think I’ve tried all combinations of x.decode and similar
(I could ignore the offending characters if that is possible.)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Use a percent-encoded URL:
link = 'http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html'
I found the above percent-encoded URL by pointing the browser at
http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html
going to the page, then copying-and-pasting the
encoded url supplied by the browser back into the text editor. However, you can generate a percent-encoded URL programmatically using:
from urllib import parse link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html' scheme, netloc, path, query, fragment = parse.urlsplit(link) path = parse.quote(path) link = parse.urlunsplit((scheme, netloc, path, query, fragment))
which yields
http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html
Method 2
Your URL contains characters that cannot be represented as ASCII characters.
You’ll have to ensure that all characters have been properly URL encoded; use urllib.parse.quote_plus for example; it’ll use UTF-8 URL-encoded escaping to represent any non-ASCII characters.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0