How can I include a python package with Hadoop streaming job?

I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, “-file”.

Edit: One solution would be to install this package on all the slaves, but I don’t have that option currently.

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Just came across this gem of a solution: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

first create zip w/ the libraries desired

zip -r nltkandyaml.zip nltk yaml
mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod

next, include via Hadoop stream “-file” argument:

hadoop -file nltkandyaml.zip

finally, load the libaries via python:

import zipimport
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk')

Additionally, this page summarizes how to include a corpus: http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

download and unzip the wordnet corpus

cd wordnet
zip -r ../wordnet-flat.zip *

in python:

wn = WordNetCorpusReader(nltk.data.find('lib/wordnet-flat.zip'))

Method 2

I would zip up the package into a .tar.gz or a .zip and pass the entire tarball or archive in a -file option to your hadoop command. I’ve done this in the past with Perl but not Python.

That said, I would think this would still work for you if you use Python’s zipimport at http://docs.python.org/library/zipimport.html, which allows you to import modules directly from a zip.

Method 3

You can use zip lib like this:

import sys
sys.path.insert(0, 'nltkandyaml.mod')
import ntlk
import yaml

Method 4

An example of loading external python package nltk

refer to the answer
Running extrnal python lib like (NLTK) with hadoop streaming
I followed following approach and ran the nltk package in with hadoop streaming successfully.

Assumption, you have already your package or (nltk in my case)in your system

first:

zip -r nltk.zip nltk
mv ntlk.zip /place/it/anywhere/you/like/nltk.mod

Why any where will work?

Ans :- Because we will provide path to this .mod zipped file through command line, we don’t need to worry much about it.

second:
changes in your mapper or .py file

#Hadoop cannot unzip files by default thus you need to unzip it   
import zipimport
importer = zipimport.zipimporter('nltk.mod')
nltk = importer.load_module('nltk')

#now import what ever you like from nltk
from nltk import tree
from nltk import load_parser
from nltk.corpus import stopwords
nltk.data.path += ["."]

third:
command line argument to run map-reduce

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar 
-file /your/path/to/mapper/mapper.py 
-mapper '/usr/local/bin/python3.4 mapper.py' 
-file /your/path/to/reducer/reducer.py 
-reducer '/usr/local/bin/python3.4 reducer.py' 
-file /your/path/to/nltkzippedmodfile/nltk.mod 
-input /your/path/to/HDFS/input/check.txt -output /your/path/to/HDFS/output/

Thus, above step solved my problem and I think it should solve others as well.
cheers,

Method 5

If you are using much more complex libs such as numpy、pandas, virtualenv is a better way.
You can add -archives to send the env to cluster.

Refer to the writing:
https://henning.kropponline.de/2014/07/18/virtualenv-hadoop-streaming/

Updated:

I tried above virtualenv in our online env, and find some problems.In the cluster，there is some errors like “Could not find platform independent libraries “。Then i tried the conda to create python env, it worked well.

If you are Chinese, you can look this:https://blog.csdn.net/Jsin31/article/details/53495423

If not, i can translate it briefly:

create an env by conda：
conda create -n test python=2.7.12 numpy pandas
Go to the conda env path.You can find it by cmd:
conda env list

Then,you can pack it:

tar cf test.tar test
submit the job through hadoop stream：

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar 
-archives test.tar 
-input /user/testfiles 
-output /user/result  
-mapper "test.tar/test/bin/python mapper.py" 
-file mapper.py 
-reducer"test.tar/test/bin/python reducer.py" 
-file reducer.py

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating