Ich versuche, eine Map reduzieren Job auf Amazon EMR mit Python Mrjob und ich habe einige Probleme beim Installieren von Abhängigkeiten.Bootstrapping Abhängigkeiten auf Amazon EMR mit Python Mrjob
Mein mrjob Code:
from mrjob.job import MRJob
import re
from normalize import *
from compute_features import *
#Some code
Die normalisieren und compute_features Dateien viele Abhängigkeiten einschließlich numpy haben, scipy, sklearn, fiona, ...
Meine mrjob.conf Datei:
runners:
emr:
aws_access_key_id: xxxx
aws_secret_access_key: xxxx
aws_region: eu-west-1
ec2_key_pair: EMR
ec2_key_pair_file: /Users/antoinerigoureau/Documents/emr.pem
ssh_tunnel: true
ec2_instance_type: m3.xlarge
ec2_master_instance_type: m3.xlarge
num_ec2_instances: 1
cmdenv:
TZ: Europe/Paris
bootstrap_python: false
bootstrap:
- curl -s https://s3-eu-west-1.amazonaws.com/data-essence/utils/bootstrap.sh | sudo bash -s
- source /usr/local/ripple/venv/bin/activate
- sudo pip install -r req.txt#
upload_archives:
- /Users/antoinerigoureau/Documents/Essence/data/geoData/urba_france.zip#data
upload_files:
- /Users/antoinerigoureau/Documents/Essence/Source/venv_parallel/normalize.py
- /Users/antoinerigoureau/Documents/Essence/Source/venv_parallel/compute_features.py
python_bin: /usr/local/ripple/venv/bin/python3
enable_emr_debugging: True
setup:
- source /usr/local/ripple/venv/bin/activate
local:
upload_archives:
- /Users/antoinerigoureau/Documents/Essence/data/geoData/urba_france.zip#data
Und meine bootstrap.sh Datei ist:
#/bin/bash
set -e
set -x
yum update -y
# install yum packages
yum install -y gcc\
geos-devel\
gcc-c++\
atlas-sse3-devel\
lapack-devel\
libpng-devel\
freetype-devel\
zlib-devel\
ncurses-devel\
readline-devel\
patch\
make\
libtool\
curl\
openssl-devel\
screen
pushd $HOME
# install python
rm -rf Python-3.5.1.tgz
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz &&\
tar -xzvf Python-3.5.1.tgz
pushd Python-3.5.1
./configure
make -j 4
make install
popd
export PATH=/usr/local/bin:$PATH
echo export PATH=/usr/local/bin:\$PATH: > /etc/profile.d/usr_local_path.sh
chmod +x /etc/profile.d/usr_local_path.sh
pip3.5 install --upgrade pip virtualenv
mkdir -p /usr/local/ripple/venv
virtualenv /usr/local/ripple/venv
source /usr/local/ripple/venv/bin/activate
# install gdal
rm -rf gdal191.zip
wget http://download.osgeo.org/gdal/gdal191.zip &&\
unzip gdal191.zip
#
# Here is the trick I had to add to get around the following -fPIC error
# /usr/bin/ld: /root/gdal-1.9.1/frmts/o/.libs/aaigriddataset.o: relocation R_X86_64_32S against `vtable for AAIGRasterBand' can not be used when making a shared object; recompile with -fPIC
#
pushd gdal-1.9.1
./configure
CC="gcc -fPIC" CXX="g++ -fPIC" make -j4
make install
popd
export LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
echo export LD_LIBRARY_PATH=/usr/local/lib:\$LD_LIBRARY_PATH > /etc/profile.d/gdal_library_path.sh
chmod +x /etc/profile.d/gdal_library_path.sh
jedoch nicht meine Aufgabe, mit der folgenden Ausgabe:
Created new cluster j-T8UUFEZILJYQ
Waiting for step 1 of 1 (s-3SOCF1ZPWJ575) to complete...
PENDING (cluster is STARTING)
PENDING (cluster is STARTING)
PENDING (cluster is STARTING)
PENDING (cluster is STARTING)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
Opening ssh tunnel to resource manager...
Connect to resource manager at: http://localhost:40199/cluster
RUNNING for 16.2s
Unable to connect to resource manager
RUNNING for 48.8s
FAILED
Cluster j-T8UUFEZILJYQ is TERMINATING: Shut down as step failed
Attempting to fetch counters from logs...
Looking for step log in /mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575 on ec2-54-194-248-128.eu-west-1.compute.amazonaws.com...
Parsing step log: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/steps/s-3SOCF1ZPWJ575/syslog
Counters: 9
Job Counters
Data-local map tasks=1
Failed map tasks=4
Launched map tasks=4
Other local map tasks=3
Total megabyte-seconds taken by all map tasks=33988320
Total time spent by all map tasks (ms)=23603
Total time spent by all maps in occupied slots (ms)=1062135
Total time spent by all reduces in occupied slots (ms)=0
Total vcore-seconds taken by all map tasks=23603
Scanning logs for probable cause of failure...
Looking for task logs in /mnt/var/log/hadoop/userlogs/application_1463748945334_0001 on ec2-54-194-248-128.eu-west-1.compute.amazonaws.com and task/core nodes...
Parsing task syslog: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog
Parsing task stderr: ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr
Probable cause of failure:
R/W/S=1749/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 HOST=null
USER=hadoop
HADOOP_USER=null
last tool output: |null|
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:65)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
(from lines 48-72 of ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/syslog)
caused by:
+ /usr/local/ripple/venv/bin/python3 test_mrjob.py --step-num=0 --mapper
Traceback (most recent call last):
File "test_mrjob.py", line 2, in <module>
import numpy as np
ImportError: No module named 'numpy'
(from lines 31-35 of ssh://ec2-54-194-248-128.eu-west-1.compute.amazonaws.com/mnt/var/log/hadoop/userlogs/application_1463748945334_0001/container_1463748945334_0001_01_000006/stderr)
while reading input from s3://data-essence/databerries-01/extract_essence_000000000001.gz
Step 1 of 1 failed
Killing our SSH tunnel (pid 1288)
Terminating cluster: j-T8UUFEZILJYQ
ich all meine Bootstrap-Aktionen auf einer VM vorher getestet und es schien gut zu funktionieren. Irgendwelche Hinweise darauf, was passiert?
AKTUALISIERUNG: Ich habe versucht, den grundlegenden Mrjob-Beispielcode mit zusätzlichem numpy Import und demselben Installationsvorgang auszuführen. Ich bekomme den gleichen Fehler: Der Job schlägt fehl, weil er nicht numpy importieren kann.