UnicodeEncodeError: ‘ascii’ codec can’t encode character ‘xe9’ – -when using urlib.request python3

I’m writing a script that goes to a list of links and parses the information.

It works for most sites but It’s choking on some with
“UnicodeEncodeError: ‘ascii’ codec can’t encode character ‘xe9’ in position 13: ordinal not in range(128)”

It stops on client.py which is part of urlib on python3

the exact link is:
http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

There are quite a few similar postings here but none of the answers seems to work for me.

my code is:

from urllib import request

def __request(link,debug=0):      

try:
    html = request.urlopen(link, timeout=35).read() #made this long as I was getting lots of timeouts
    unicode_html = html.decode('utf-8','ignore')

# NOTE the except HTTPError must come first, otherwise except URLError will also catch an HTTPError.
except HTTPError as e:
    if debug:
        print('The server couldn't fulfill the request for ' + link)
        print('Error code: ', e.code)
    return ''
except URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('timeout')
        return ''    
else:
    return unicode_html

this calls the request function

link = ‘http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html’
page = __request(link)

And the traceback is:

Traceback (most recent call last):
  File "<string>", line 250, in run_nodebug
  File "C:readerget_news.py", line 276, in <module>
    main()
  File "C:readerget_news.py", line 255, in main
    body = get_article_body(item['link'],debug=0)
  File "C:readerget_news.py", line 155, in get_article_body
    page = __request('na',url)
  File "C:readerget_news.py", line 50, in __request
    html = request.urlopen(link, timeout=35).read()
  File "C:Python33Liburllibrequest.py", line 156, in urlopen
    return opener.open(url, data, timeout)
  File "C:Python33Liburllibrequest.py", line 469, in open
    response = self._open(req, data)
  File "C:Python33Liburllibrequest.py", line 487, in _open
    '_open', req)
  File "C:Python33Liburllibrequest.py", line 447, in _call_chain
    result = func(*args)
  File "C:Python33Liburllibrequest.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "C:Python33Liburllibrequest.py", line 1248, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:Python33Libhttpclient.py", line 1061, in request
    self._send_request(method, url, body, headers)
  File "C:Python33Libhttpclient.py", line 1089, in _send_request
    self.putrequest(method, url, **skips)
  File "C:Python33Libhttpclient.py", line 953, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character 'xe9' in position 13: ordinal not in range(128)

Any help appreciated It’s driving me crazy , I think I’ve tried all combinations of x.decode and similar

(I could ignore the offending characters if that is possible.)

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use a percent-encoded URL:

link = 'http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html'

I found the above percent-encoded URL by pointing the browser at

http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

going to the page, then copying-and-pasting the
encoded url supplied by the browser back into the text editor. However, you can generate a percent-encoded URL programmatically using:

from urllib import parse

link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'

scheme, netloc, path, query, fragment = parse.urlsplit(link)
path = parse.quote(path)
link = parse.urlunsplit((scheme, netloc, path, query, fragment))

which yields

http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html

Method 2

Your URL contains characters that cannot be represented as ASCII characters.

You’ll have to ensure that all characters have been properly URL encoded; use urllib.parse.quote_plus for example; it’ll use UTF-8 URL-encoded escaping to represent any non-ASCII characters.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x