Image for post
Image for post

Packaging code with PEX — a PySpark example

Fabian Höring
Jan 8, 2019 · 8 min read

Serializing the code that runs on the executors

>>> rdd = sc.parallelize([1, 3], numSlices=2)
>>> import math
>>> def sqrt(a):
… return math.sqrt(a)

>>> def pyth_add(a,b):
… return sqrt(a ** 2 + b ** 2)

>>> rdd.reduce(lambda x,y: pyth_add(x,y))
3.1622776601683795
>>> import numpy as np
>>> rdd = sc.parallelize([np.array([1,2,3]), np.array([1,2,3])], numSlices=2)
>>> rdd.reduce(lambda x,y: np.dot(x,y))
[Stage 0:> (0 + 2) / 2]18/11/09 18:29:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, xx-xx-xx-xx-xx-xx.xxx.xxx.criteo.prod, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
ModuleNotFoundError: No module named 'numpy'

Different ways to ship code to the executors

Using py-files

Install packages globally

Using virtual environments

Why do we need yet another tool at Criteo ?

Using PEX to ship code

$ pex numpy -o myarchive.pex
$ export PYSPARK_DRIVER_PYTHON=`which python`
$ export PYSPARK_PYTHON=./myarchive.pex
$ pyspark \
--conf spark.executorEnv.PEX_ROOT=./.pex \
--master yarn \
--deploy-mode client \
--files myarchive.pex
>>> import numpy as np
>>> rdd = sc.parallelize([np.array([1,2,3]), np.array([1,2,3])], numSlices=2)
>>> rdd.reduce(lambda x,y: np.dot(x,y))
14

Developing PySpark applications

$ pip list --exclude-editable --format json
[{"name": "numpy", "version": "1.15.4"}, {"name": "pex", "version": "1.5.2"}, {"name": "pip", "version": "10.0.1"}, {"name": "setuptools", "version": "39.0.1"}, {"name": "wheel", "version": "0.31.1"}]
$ pip list -e --format json
[{"name": "userlib", "version": "0.0.1"}]

Running in Prod

$ pex pyspark==2.3.2 numpy userlib -o myarchive.pex                 
$./myarchive.pex -m userlib.startup
14

Wrap-up

Criteo R&D Blog

Tech stories from the R&D team

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store