hadoop Archives - Magenaut

How to turn off INFO logging in Spark?

August 17, 2022 by Magenaut

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully.

How can I include a python package with Hadoop streaming job?

August 14, 2022 by Magenaut

I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, “-file”.

Python read file as stream from HDFS

August 14, 2022 by Magenaut

Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory)