Here is my code,
for line in open('u.item'):
# Read each line
Whenever I run this code it gives the following error:
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe9 in position 2892: invalid continuation byte
I tried to solve this and add an extra parameter in open(). The code looks like:
for line in open('u.item', encoding='utf-8'):
# Read each line
But again it gives the same error. What should I do then?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.
Method 2
The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.
Example:
file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")
Method 3
Your file doesn’t actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open call.
In Windows-1252 encoding, for example, the 0xe9 would be the character é.
Method 4
Try this to read using Pandas:
pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')
Method 5
This works:
open('filename', encoding='latin-1')
Or:
open('filename', encoding="ISO-8859-1")
Method 6
If you are using Python 2, the following will be the solution:
import io
for line in io.open("u.item", encoding="ISO-8859-1"):
# Do something
Because the encoding parameter doesn’t work with open(), you will be getting the following error:
TypeError: ‘encoding’ is an invalid keyword argument for this function
Method 7
You could resolve the problem with:
for line in open(your_file_path, 'rb'):
‘rb’ is reading the file in binary mode. Read more here.
Method 8
You can try this way:
open('u.item', encoding='utf8', errors='ignore')
Method 9
Based on another question on Stackoverflow and previous answers in this post, I would like to add a help to find the right encoding.
If your script runs on a Linux OS, you can get the encoding with the file command:
file --mime-encoding <filename>
Here is a python script to do that for you:
import sys
import subprocess
if len(sys.argv) < 2:
print("Usage: {} <filename>".format(sys.argv[0]))
sys.exit(1)
def find_encoding(fname):
"""Find the encoding of a file using file command
"""
# find fullname of file command
which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
if which_run.returncode != 0:
print("Unable to find 'file' command ({})".format(which_run.returncode))
return None
file_cmd = which_run.stdout.decode().replace('n', '')
# run file command to get MIME encoding
file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
if file_run.returncode != 0:
print(file_run.stderr.decode(), file=sys.stderr)
# return encoding name only
return file_run.stdout.decode().split()[1]
# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))
Method 10
This is an example for converting a CSV file in Python 3:
try:
inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
pass
Method 11
Sometimes when using open(filepath) in which filepath actually is not a file would get the same error, so firstly make sure the file you’re trying to open exists:
import os assert os.path.isfile(filepath)
Method 12
Open your file with Notepad++, select “Encoding” or “Encodage” menu to identify or to convert from ANSI to UTF-8 or the ISO 8859-1 code page.
Method 13
I was using a dataset downloaded from Kaggle while reading this dataset it threw this error:
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf1 in position
183: invalid continuation byte
So this is how I fixed it.
import pandas as pd
pd.read_csv('top50.csv', encoding='ISO-8859-1')
Method 14
So that the web-page is searched faster for the google-request on a similar question (about error with UTF-8), I leave my solvation here for others.
I had problem with .csv file opening with that description:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte
I opened the file with NotePad & counted 150th position: that was a Cyrillic symbol.
I resaved that file with ‘Save as..’ command with Encoding ‘UTF-8’ & my program started to work.
Method 15
The encoding replaced with encoding=’ISO-8859-1′
for line in open(‘u.item’, encoding=’ISO-8859-1′):
print(line)
Method 16
Use this, if you are directly loading data from github or kaggle DF=pd.read_csv(file,encoding=’ISO-8859-1′)
Method 17
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xed in position 7044: invalid continuation byte
The above error is occuring due to encoding
Solution:- Use “encoding=’latin-1’”
Reference:- https://pandas.pydata.org/docs/search.html?q=encoding
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0