Deploying R models in production with Apache Airflow and AWS Batch

Teodor Popescu
BBC Data Science
Published in
10 min readApr 14, 2020

Productionizing machine learning models in R: A step-by-step guide

Disclaimer: We expect readers to be familiar with general data engineering concepts, Amazon Web Services and Apache Airflow.

One of our aims as data engineers is to provide an easy, scalable, technology and language-agnostic way of leveraging the stored petabytes of data we hold at the BBC, so that data scientists can shape customized experiences for our users across our offering i.e. BBC iPlayer, BBC Sounds, BBC News etc.

While Python-based machine learning models are very popular due to a wide scikit-learn community, ease of implementation and an increasing adoption of Python around the world, we were less familiar with the R equivalents.

Here at BBC, we were tasked with productionizing an R-based XGBoost model needed by the BBC Sports Product team for highly granular user segmentations allowing tailored content on the BBC Sports Home page.

Our use case involves a batch learning deployment through an Airflow DAG. Before running R scripts containing the XGBoost model, our data pipeline runs an ETL (extract-transform-load) process with PySpark on AWS EMR (Elastic MapReduce) to extract data from multiple sources, peform data cleaning and feature extraction before storing the outputs in an S3 bucket. Once this data is stored on the S3 bucket, the Airflow worker execution moves to the next task, which runs the R-based model training, validation, testing and deployment.

R model deployment lifecycle

We tried several approaches before getting to the solution outlined in this article. We document our prior trials below to showcase our learning process and how our solution evolved. Suggestions for overcoming these challenges are welcome in the comments section. For the R model deployment step-by-step guide, jump to Trial #4.

Trial #1: Using an Apache Airflow Docker image composed of multiple containers: Running an R-based Docker container inside an Apache Airflow worker container

  • As Apache Airflow has no support for R (there is no R operator), a Docker Operator was used for running R code inside a Docker container with r-base installed. (See the Docker image)
  • This Docker container runs inside another Docker container, where the Apache Airflow worker runs.
  • Running a Docker container inside another Docker container is not a recommended technique, as suggested by Jérôme Petazzoni, author at the Docker blog.
  • This trial led to a connection error between the first and second level containers (the Apache Airflow worker container and the r-base container) and several hours of debugging and so we did not pursue this approach.

Trial #2: Using an Airflow BashOperator and having R installed on the Apache Airflow worker container

While this proved a good option for running R code inside the Apache Airflow worker container, we encountered two main problems, hence proving a limited option for the future:

  1. Lack of scalability: Even though one can scale Airflow workers to run tasks in parallel, processing very large volumes of data required in near real-time (e.g. user data) means you would always have to adapt workers based on use case, as well as inherit the extra time spent in between task execution. For more details, please see this article
  2. Technical debt: As audience data volumes grow in time, so do data processing requirements. This approach would require significant additional work to achieve maximum parallelism (i.e. number of task instances running concurrently on Airflow), irrespective of Airflow executor type and configuration.

Even though Airflow workers capacity could be expanded, this option breaks modularity principles and imposes the need of deploying Apache Airflow on cloud instances every time an R model is productionized.

Trial #3: Developing an R Operator

  • An R Operator has been developed here, yet it’s still in a pull request stage. Comments revealed the operator is runnable with users reporting existing integration in their DAGs, yet they also mentioned several operator related unit tests fail.
  • Testing this approach and contributing to the pull request would have meant a great contribution to the open source community, yet a significant development cost that we are not currently able to incur, considering the impact of other data science initiatives we are looking to deploy.
  • Hence, we avoided this approach as running an Airflow DAG in production with an unstable operator lacks reliability and proves too risky for our use cases.

Trial #4: Using an Apache Airflow Docker image, an r-base Docker image and AWS Batch for running batch jobs on an EC2 instance.

This is the solution that allows reliability, scalability and the possibility for high manageability, efficiency and availability and we provide more details of our solution below.

It leverages knowledge of Apache Airflow, Docker and AWS services, such as AWS Batch, AWS ECR, AWS EC2 and AWS S3.

Prerequisites:

  • Docker is installed on your machine
  • You have an AWS account
  • You run Apache Airflow locally

Step 1: Create a Dockerfile

  • A Dockerfile is used for building the Docker image we will use for running the R code. Several AWS command line interface commands will be needed as part of our process (e.g. copying code files from an S3 bucket inside the Docker container that runs on an EC2 instance).
  • The Dockerfile will contain commands used for installing the binary packages required to run the R code successfully.
FROM rocker/tidyverse:3.6.2
RUN apt-get update
RUN apt-get install -y python3-pip
RUN apt-get install -y r-base
RUN apt-get install nano
RUN pip3 install awscli --upgrade --ignore-installed six


RUN R -e "install.packages('data.table', repos = 'http://cran.us.r-project.org')"
RUN
R -e "install.packages('digest', repos = 'http://cran.us.r-project.org')"
RUN
R -e "install.packages('xgboost', repos = 'http://cran.us.r-project.org')"
RUN
R -e "install.packages('R.utils', repos = 'http://cran.us.r-project.org')"

COPY
scripts/run_model_predictions.sh /usr/local/bin/
RUN
chmod 755 /usr/local/bin/run_model_predictions.sh

The commands above add the AWS CLI, pip package management system and nano text editor (used for ease of future debugging purposes) on top of the rocker/tidyverse image, a Docker artefact containing a version-stable build of R, Rstudio, and R packages. The second part installs required R packages used within our code. (These commands could also be moved inside the R code). Lastly, we copy inside the image a Shell script containing all the commands required for leveraging the model in production.

Open a Terminal (on macOS) or cmd (on Windows), navigate to the Dockerfile path and run the following:

docker build -t <image_name> .

This command will build an image with r-base installed for running R scripts and the AWS CLI used later on for moving files within AWS.

Step 2: Push the image to AWS Elastic Container Registry

  • Use the AWS console to go to ECR and create an ECR repository or use CloudFormation for this task.

Note: The repository name has to follow the image name defined above.

  • Back in the terminal, log in from the command line by using your AWS login credentials
ACCOUNT="123456789100"
ECR_PW=$(aws ecr get-login --region <region_name> --registry-ids ${ACCOUNT} | cut -d' ' -f6)
docker login https://${ACCOUNT}.dkr.ecr.<region_name>.amazonaws.com -u AWS -p $ECR_PW
  • Tag the image
IMAGE="r-base-aws-cli-image"
TAG="0.0.1"
docker tag ${IMAGE} ${ACCOUNT}.dkr.ecr.<region_name>.amazonaws.com/${IMAGE}:${TAG}
  • Push the image to AWS ECR using the docker push command:
docker push ${ACCOUNT}.dkr.ecr.
<region_name>.amazonaws.com/${IMAGE}:${TAG}

Step 3: Store your R code and your data in an S3 bucket.

This step can be completed manually via the AWS console, by using the AWS CLI or through a Python script by uploading artefacts into S3 with the boto3 library.

The fastest way is by using the AWS CLI as follows:

aws s3 cp <your directory path> s3://<your bucket name>/

Note: In our use case, it was only the R code that needed copying, as the data was already stored in S3 from the previous task (PySpark data processing with AWS EMR described above). This step is one of the first tasks defined within the Airflow defined pipeline (more on step 8).

Step 4: Create an AWS Batch job queue

This can be done via the AWS console or the AWS CLI. In order to programatically do this, one can use an Airflow BashOperator containing an AWS CLI command for creating a job queue, as follows:

aws batch create-job-queue --job-queue-name <job_queue_name> --priority <integer> --compute-environment-order order=1,computeEnvironment=<compute_environment_name>Example: aws batch create-job-queue --job-queue-name test-queue-from-cli --priority 1 --compute-environment-order order=1,computeEnvironment=first-run-compute-environment

You can otherwise set up a job queue manually via the AWS Console.

Step 5: Define one or multiple batch jobs

Go to the AWS Batch console and create a job from the job definition tab.

For our use case, we have only defined one job used for executing the Shell script.

Ensure that the following is true:

  1. The container image configuration value consists of the ECR stored Docker image URI.
  2. The command is the one you define for running the R scripts: It can be a Shell script, an Rscript command or an AWS CLI command in case you add that to the Docker image build configuration.

Step 6: Submit the jobs

This can be done via the AWS Console or by defining an Apache Airflow DAG task with the AWSBatchOperator. Examples for the latter follow in step 8.

At this point, we have stored in ECR a Docker image capable of running R code, hence our R model too.

It’s worth mentioning the running process: We submit a job to AWS Batch. This job consists of running a Shell script inside the running Docker image we stored on AWS ECR. Linking the AWS Batch to the image stored on ECR is done via the AWS console during the job definition stage and consists of a very user-friendly, straightforward process.

The script is running the following commands:

  • One task to copy the R code and relevant data from S3 inside the running Docker container
  • One task to run the R script (Rscript predictive_analytics.R)
  • One task to store the RDS models in an S3 bucket
  • One task to copy the prediction outputs from the running Docker container to an S3 bucket, where they can be further consumed by linking the data to Athena.
  • In case model prediction is lower than previously, re-run the process with the highest accuracy stored model. The logic of this does not make the subject of this article.

Step 7: Create an Athena table based on S3 bucket contents

We store the prediction outputs in Parquet format. At this point, we only need to create a table in Athena where S3 stored data can be queried from.

CREATE EXTERNAL TABLE IF NOT EXISTS medium_table.user_segmentation_features_and_target (
feature1 FLOAT,
feature2 FLOAT,
feature3 FLOAT,
feature4 FLOAT,
target1 FLOAT
)
STORED AS PARQUET
LOCATION 's3://some-s3-bucket/prediction_output'
tblproperties ("parquet.compress"="SNAPPY");
  • Additionally, we follow the same process for creating tables that contain model prediction accuracy, along with runtime timestamp.

Step 8: Review your Airflow DAG

Before deployment, review your Airflow DAG. Our DAG consists of the PySpark related tasks for data processing, one task to copy R code from GitHub to S3 and one task to submit to AWS Batch the job for running the Shell script mentioned in step 5.

The task definition looks as follows:

default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(1815, 6, 18),
'retries': 2,
'retry_delay': timedelta(minutes=5),
'on_failure_callback': task_slack_alert
}

dag = DAG(
PIPELINE_NAME,
schedule_interval='0 8 * * *',
default_args=default_args,
template_searchpath=[template-path/"],
catchup=False,
max_active_runs=1
)
command={
'command': [
'run_model_predictions.sh',
]
}
aws_job_submission = AWSBatchOperator(
job_name='airflow-job-submission-and-run-' + datetime.today().strftime('%Y-%m-%d'),
job_definition='sport-segments',
job_queue='first-run-job-queue',
overrides=command,
task_id='aws-batch-job-submission',
dag=dag)

AWS Batch job submission requires a JSON format command, as defined above the task definition. In order to submit a job via Airflow, we used an AWSBatchOperator. Documentation can be found here. It’s mandatory to define the job_name, job_definition and job_queue fields so that the AWSBatchOperator can successfully submit the job.

Note: The Airflow UI will update the task execution status as soon as the job has been run, hence there is no need for further development to check AWS Batch job status.

Once your DAG and the job submission task are defined, one should be ready to trigger the DAG via Airflow.

At this point, the DAG will run a job submission for R machine learning execution inside the Docker image stored on ECR, all done via AWS Batch.

Step 9: Integrate your code with a CI/CD tool.

Lastly, once we have a successfully running Airflow DAG, it’s time for deploying this in a live environment. Our CI/CD (continuous integration and continuous delivery/deployment) tool of choice is Jenkins, deployed on an EC2 instance.

Build automation configuration happens through the Jenkins UI, where we link a Jenkins build definition to the GitHub repository where the project code (containing R code and Airflow defined pipelines) is stored. Once this configuration is completed, Jenkins will build, test and deploy the data pipeline containing R models.

While there are quite a few steps involved for getting to R model deployment, redeployment only requires pushing the RDS files to the master branch (default branch for builds as configured in Jenkins) on the project GitHub repository. As configured in Jenkins, a push to the master branch will trigger a new build and redeployment process of the project.

Step 10: Relax

Announce that your model predictions are ready for consumption by platforms in your organization.

Conclusions

This post described a way of deploying R-based machine learning models in a production cloud environment by using Docker, Apache Airflow, AWS services and Jenkins. While it has only covered a batch learning approach, it has hopefully given the R community a useful method of how to leverage data science work across organisational platforms. This analysis represents an exploration by our data engineers and is not necessarily representative of our large-scale practices.

If you enjoyed this post, feel free to clap & share! Any questions or suggestions are more than welcome in the comments section.

Interested in working with us? Click here and search ‘data engineer’ to find out more.

--

--

Teodor Popescu
BBC Data Science

Data Engineer at BBC. Teaching Data Engineering at University College London