I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, “-file”.
Edit: One solution would be to install this package on all the slaves, but I don’t have that option currently.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Just came across this gem of a solution: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
first create zip w/ the libraries desired
zip -r nltkandyaml.zip nltk yaml mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod
next, include via Hadoop stream “-file” argument:
hadoop -file nltkandyaml.zip
finally, load the libaries via python:
import zipimport
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk')
Additionally, this page summarizes how to include a corpus: http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/
download and unzip the wordnet corpus
cd wordnet zip -r ../wordnet-flat.zip *
in python:
wn = WordNetCorpusReader(nltk.data.find('lib/wordnet-flat.zip'))
Method 2
I would zip up the package into a .tar.gz or a .zip and pass the entire tarball or archive in a -file option to your hadoop command. I’ve done this in the past with Perl but not Python.
That said, I would think this would still work for you if you use Python’s zipimport at http://docs.python.org/library/zipimport.html, which allows you to import modules directly from a zip.
Method 3
You can use zip lib like this:
import sys sys.path.insert(0, 'nltkandyaml.mod') import ntlk import yaml
Method 4
An example of loading external python package nltk
refer to the answer
Running extrnal python lib like (NLTK) with hadoop streaming
I followed following approach and ran the nltk package in with hadoop streaming successfully.
Assumption, you have already your package or (nltk in my case)in your system
first:
zip -r nltk.zip nltk mv ntlk.zip /place/it/anywhere/you/like/nltk.mod
Why any where will work?
Ans :- Because we will provide path to this .mod zipped file through command line, we don’t need to worry much about it.
second:
changes in your mapper or .py file
#Hadoop cannot unzip files by default thus you need to unzip it
import zipimport
importer = zipimport.zipimporter('nltk.mod')
nltk = importer.load_module('nltk')
#now import what ever you like from nltk
from nltk import tree
from nltk import load_parser
from nltk.corpus import stopwords
nltk.data.path += ["."]
third:
command line argument to run map-reduce
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file /your/path/to/mapper/mapper.py -mapper '/usr/local/bin/python3.4 mapper.py' -file /your/path/to/reducer/reducer.py -reducer '/usr/local/bin/python3.4 reducer.py' -file /your/path/to/nltkzippedmodfile/nltk.mod -input /your/path/to/HDFS/input/check.txt -output /your/path/to/HDFS/output/
Thus, above step solved my problem and I think it should solve others as well.
cheers,
Method 5
If you are using much more complex libs such as numpy、pandas, virtualenv is a better way.
You can add -archives to send the env to cluster.
Refer to the writing:
https://henning.kropponline.de/2014/07/18/virtualenv-hadoop-streaming/
Updated:
I tried above virtualenv in our online env, and find some problems.In the cluster,there is some errors like “Could not find platform independent libraries “。Then i tried the conda to create python env, it worked well.
If you are Chinese, you can look this:https://blog.csdn.net/Jsin31/article/details/53495423
If not, i can translate it briefly:
-
create an env by conda:
conda create -n test python=2.7.12 numpy pandas -
Go to the conda env path.You can find it by cmd:
conda env listThen,you can pack it:
tar cf test.tar test - submit the job through hadoop stream:
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -archives test.tar -input /user/testfiles -output /user/result -mapper "test.tar/test/bin/python mapper.py" -file mapper.py -reducer"test.tar/test/bin/python reducer.py" -file reducer.py
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0