‘utf-8’ codec can’t decode byte 0xa0 in position 4276: invalid start byte

I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)

According to the SEC the data set is provided in a single encoding, as follows:

Tab Delimited Value (.txt): utf-8, tab-delimited, n- terminated lines, with the first line containing the field names in lowercase.

My current code:

import csv

with open('txt.tsv') as tsvfile:
    reader = csv.DictReader(tsvfile, dialect='excel-tab')
    for row in reader:
        print(row)

All attempts ended with the following error message:

‘utf-8’ codec can’t decode byte 0xa0 in position 4276: invalid start byte

I am a bit lost. Can anyone help me? Many thanks in advance.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Encoding in the file is ‘windows-1252’. Use:

open('txt.tsv', encoding='windows-1252')

Method 2

If someone works on Turkish data, then I suggest this line:

df = pd.read_csv("text.txt",encoding='windows-1254')

Method 3

ds = pd.read_csv('/Dataset/test.csv', encoding='windows-1252')

Works fine for me, thanks.

Method 4

i have the same error message for .csv file, and This Worked for me :

     df = pd.read_csv('Text.csv',encoding='ANSI')

Method 5

I also encountered the same issue and worked while using latin1 encoding, refer to the sample code to apply in your codebase. Give a try if above resolution doesn’t work.

df=pd.read_csv("../CSV_FILE.csv",na_values=missing, encoding='latin1')

Method 6

If the input has a stray 'xa0', then it’s not in UTF-8, full stop.

Yes, you have to either recode it to UTF-8 (see: iconv, recode commands, or a lot of text editors and IDEs can do it), or read it using an 8-bit encoding (as all the other answers suggest).

What you should ask yourself is – what is this character after all (0xa0 or 160)?
Well, in many 8-bit encodings it’s a non-breaking space (like   in HTML). For at least one DOS encoding it’s an accented “a” character. That’s why you need to look at the result of decoding it from the 8-bit encoding.

BTW, sometimes people say “UTF-8”, and they mean “mostly ASCII, I guess”. And if it was a non-breaking space, they weren’t that far:

In [1]: 'xa0'.encode()
Out[1]: b'xc2xa0'

One exptra preceeding 'xc2' byte would do the trick.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x