Python UTF-16 CSV reader

I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.

I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.

Answers for John Machin questions below

print repr(open('test.csv', 'rb').read(100))

Output with test.csv having just abc as content

'xffxfeax00bx00cx00'

I think csv file got created on windows machine in USA. I am using Mac OSX Lion.

If I use code provided by phihag and test.csv containing one record.

sample test.csv content used. Below is print repr(open(‘test.csv’, ‘rb’).read(1000)) output

'xffxfe1x00,x002x00,x00Gx00,x00Sx00,x00Hx00 x00fx00xfcx00rx00 x00ex00 x00x96x00 x00mx00 x00x85x00,x00,x00Ix00rx00nx00'

Code by phihag

import codecs
import csv
with open('test.csv','rb') as f:
      sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))      
      for row in csv.reader(sr):
         print row

Output of the above code

['1', '2', 'G', 'S', 'H fxc3xbcr e xc2x96 m xc2x85']
['', '', 'I']

expected output is

['1', '2', 'G', 'S', 'H fxc3xbcr e xc2x96 m xc2x85','','I']

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

At the moment, the csv module does not support UTF-16.

In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:

# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
    for line in csv.reader(csvf):
        print(line) # do something with the line

In Python 2.x, you can recode the input:

# Python 2.x only
import codecs
import csv

class Recoder(object):
    def __init__(self, stream, decoder, encoder, eol='rn'):
        self._stream = stream
        self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
        self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
        self._buf = ''
        self._eol = eol
        self._reachedEof = False

    def read(self, size=None):
        r = self._stream.read(size)
        raw = self._decoder.decode(r, size is None)
        return self._encoder.encode(raw)

    def __iter__(self):
        return self

    def __next__(self):
        if self._reachedEof:
            raise StopIteration()
        while True:
            line,eol,rest = self._buf.partition(self._eol)
            if eol == self._eol:
                self._buf = rest
                return self._encoder.encode(line + eol)
            raw = self._stream.read(1024)
            if raw == '':
                self._decoder.decode(b'', True)
                self._reachedEof = True
                return self._encoder.encode(self._buf)
            self._buf += self._decoder.decode(raw)
    next = __next__

    def close(self):
        return self._stream.close()

with open('test.csv','rb') as f:
    sr = Recoder(f, 'utf-16', 'utf-8')

    for row in csv.reader(sr):
        print (row)

open and codecs.open require the file to start with a BOM. If it doesn’t (or you’re on Python 2.x), you can still convert it in memory, like this:

try:
    from io import BytesIO
except ImportError: # Python < 2.6
    from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
    c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
    print(line) # do something with the line

Method 2

The Python 2.x csv module documentation example shows how to handle other encodings.

Method 3

I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don’t have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.

Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:

print repr(open('thefile.csv', 'rb').read(100))

Four possible ways of encoding u'abc'

xfexffx00ax00bx00c -> utf_16
xffxfeax00bx00cx00 -> utf_16
x00ax00bx00c -> utf_16_be
ax00bx00cx00 -> utf_16_le

If you have any trouble with this step, edit your question to include the results of the above print repr()

Step 2: Here’s a Python 2.X recode-UTF-16*-to-UTF-8 script:

import sys
infname, outfname, enc = sys.argv[1:4]
fi = open(infname, 'rb')
fo = open(outfname, 'wb')
BUFSIZ = 64 * 1024 * 1024
first = True
while 1:
    buf = fi.read(BUFSIZ)
    if not buf: break
    if first and enc == 'utf_16':
        bom = buf[:2]
        buf = buf[2:]
        enc = {'xfexff': 'utf_16_be', 'xffxfe': 'utf_16_le'}[bom]
        # KeyError means file doesn't start with a valid BOM
    first = False
    fo.write(buf.decode(enc).encode('utf8'))
fi.close()
fo.close()

Other matters:

You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.

The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?

Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net

Update based on 1-line sample data provided.

This confirms my suspicions. Read this. Here’s a quote from it:

… the C1 control characters … are rarely used directly, except on
specific platforms such as OpenVMS. When they turn up in documents,
Web pages, e-mail messages, etc., which are ostensibly in an
ISO-8859-n encoding, their code positions generally refer instead to
the characters at that position in a proprietary, system-specific
encoding such as Windows-1252 or the Apple Macintosh (“MacRoman”)
character set that use the codes provided for representation of the C1
set with a single 8-bit byte to instead provide additional graphic
characters

This code:

s1 = 'xffxfe1x00,x002x00,x00Gx00,x00Sx00,x00Hx00 x00fx00xfcx00rx00 x00ex00 x00x96x00 x00mx00 x00x85x00,x00,x00Ix00rx00nx00'
s2 = s1.decode('utf16')
print 's2 repr:', repr(s2)
from unicodedata import name
from collections import Counter
non_ascii = Counter(c for c in s2 if c >= u'x80')
print 'non_ascii:', non_ascii
for c in non_ascii:
    print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
    c2 = c.encode('latin1').decode('cp1252')
    print "to:   U+%04X %s" % (ord(c2), name(c2, "<no name>"))

s3 = u''.join(
    c.encode('latin1').decode('1252') if u'x80' <= c < u'xA0' else c
    for c in s2
    )
print 's3 repr:', repr(s3)
print 's3:', s3

produces the following (Python 2.7.2 IDLE, Windows 7):

s2 repr: u'1,2,G,S,H fxfcr e x96 m x85,,Irn'
non_ascii: Counter({u'x85': 1, u'xfc': 1, u'x96': 1})
from: U+0085 <no name>
to:   U+2026 HORIZONTAL ELLIPSIS
from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
to:   U+00FC LATIN SMALL LETTER U WITH DIAERESIS
from: U+0096 <no name>
to:   U+2013 EN DASH
s3 repr: u'1,2,G,S,H fxfcr e u2013 m u2026,,Irn'
s3: 1,2,G,S,H für e – m …,,I

Which do you think is a more reasonable interpretation of x96:

SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
or
EN DASH
?

Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.

Method 4

Just open your file with codecs.open like in

import codecs, csv

stream = codecs.open(<yourfile.csv>, encoding="utf-16")
reader = csv.reader(stream)

And work through your program with unicode strings, as you should do anyway if you are processing text


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x