Write to UTF-8 file in Python

I’m really confused with the codecs.open function. When I do:

file = codecs.open("temp", "w", "utf-8")
file.write(codecs.BOM_UTF8)
file.close()

It gives me the error

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xef in position
0: ordinal not in range(128)

If I do:

file = open("temp", "w")
file.write(codecs.BOM_UTF8)
file.close()

It works fine.

Question is why does the first method fail? And how do I insert the bom?

If the second method is the correct way of doing it, what the point of using codecs.open(filename, "w", "utf-8")?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on “I’m meant to be writing Unicode as UTF-8-encoded text, but you’ve given me a byte string!”

Try writing the Unicode string for the byte order mark (i.e. Unicode U+FEFF) directly, so that the file just encodes that as UTF-8:

import codecs

file = codecs.open("lol", "w", "utf-8")
file.write(u'ufeff')
file.close()

(That seems to give the right answer – a file with bytes EF BB BF.)

EDIT: S. Lott’s suggestion of using “utf-8-sig” as the encoding is a better one than explicitly writing the BOM yourself, but I’ll leave this answer here as it explains what was going wrong before.

Method 2

Read the following: http://docs.python.org/library/codecs.html#module-encodings.utf_8_sig

Do this

with codecs.open("test_output", "w", "utf-8-sig") as temp:
    temp.write("hi momn")
    temp.write(u"This has ♭")

The resulting file is UTF-8 with the expected BOM.

Method 3

It is very simple just use this. Not any library needed.

with open('text.txt', 'w', encoding='utf-8') as f:
    f.write(text)

Method 4

@S-Lott gives the right procedure, but expanding on the Unicode issues, the Python interpreter can provide more insights.

Jon Skeet is right (unusual) about the codecs module – it contains byte strings:

>>> import codecs
>>> codecs.BOM
'xffxfe'
>>> codecs.BOM_UTF8
'xefxbbxbf'
>>>

Picking another nit, the BOM has a standard Unicode name, and it can be entered as:

>>> bom= u"N{ZERO WIDTH NO-BREAK SPACE}"
>>> bom
u'ufeff'

It is also accessible via unicodedata:

>>> import unicodedata
>>> unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')
u'ufeff'
>>>

Method 5

I use the file *nix command to convert a unknown charset file in a utf-8 file

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()

Method 6

python 3.4 >= using pathlib:

import pathlib
pathlib.Path("text.txt").write_text(text, encoding='utf-8') #or utf-8-sig for BOM

Method 7

If you are using Pandas I/O methods like pandas.to_excel(), add an encoding parameter, e.g.

pd.to_excel("somefile.xlsx", sheet_name="export", encoding='utf-8')

This works for most international characters I believe.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x