Here is the situation:
-
I get gzipped xml documents from Amazon S3
import boto from boto.s3.connection import S3Connection from boto.s3.key import Key conn = S3Connection('access Id', 'secret access key') b = conn.get_bucket('mydev.myorg') k = Key(b) k.key('documents/document.xml.gz') -
I read them in file as
import gzip f = open('/tmp/p', 'w') k.get_file(f) f.close() r = gzip.open('/tmp/p', 'rb') file_content = r.read() r.close()
Question
How can I ungzip the streams directly and read the contents?
I do not want to create temp files, they don’t look good.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Yes, you can use the zlib module to decompress byte streams:
import zlib
def stream_gzip_decompress(stream):
dec = zlib.decompressobj(32 + zlib.MAX_WBITS) # offset 32 to skip the header
for chunk in stream:
rv = dec.decompress(chunk)
if rv:
yield rv
The offset of 32 signals to the zlib header that the gzip header is expected but skipped.
The S3 key object is an iterator, so you can do:
for data in stream_gzip_decompress(k):
# do something with the decompressed data
Method 2
I had to do the same thing and this is how I did it:
import gzip f = StringIO.StringIO() k.get_file(f) f.seek(0) #This is crucial gzf = gzip.GzipFile(fileobj=f) file_content = gzf.read()
Method 3
For Python3x and boto3-
So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.
import io
import zipfile
import boto3
import sys
s3 = boto3.resource('s3', 'us-east-1')
def stream_zip_file():
count = 0
obj = s3.Object(
bucket_name='MonkeyBusiness',
key='/Daily/Business/Banana/{current-date}/banana.zip'
)
buffer = io.BytesIO(obj.get()["Body"].read())
print (buffer)
z = zipfile.ZipFile(buffer)
foo2 = z.open(z.infolist()[0])
print(sys.getsizeof(foo2))
line_counter = 0
for _ in foo2:
line_counter += 1
print (line_counter)
z.close()
if __name__ == '__main__':
stream_zip_file()
Method 4
You can try PIPE and read contents without downloading file
import subprocess
c = subprocess.Popen(['-c','zcat -c <gzip file name>'], shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
for row in c.stdout:
print row
In addition “/dev/fd/” + str(c.stdout.fileno()) will provide you FIFO file name (Named pipe) which can be passed to other program.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0