Running a time-consuming R task in Google cloud run
Every data scientist has already been in this situation. You have written some awesome R script, which does some magic. It works well in your RStudio, but you want to run it every day automatically.
In my case, this R script calculates data-driven attribution. It downloads data from Adobe analytics and calculates many versions of Markov’s model. For eight different customer segments, for a many time frames etc. The result is stored in Big Query and visualized using Data studio. It works well, but the whole processing takes about 40minutes.
If you want to run this script every day, you have two options in the Google cloud platform:
- You can use some server and run RScript using cron. This approach has one drawback. You have to maintain this server (security updates, etc.), and you pay for it. This server runs all the time, but you need it only 40minutes every morning. Such a waste of resources!
- You can use Google Cloud Run. It means you create a Docker container triggered by an HTTP request. Google Cloud Run fully manages this container and scale the number of instances from zero up to some defined maximum.
- (I know, there are some other options. You can refactor your code, split task to smaller subtasks and use task queue for execution. But I will ignore this for this moment.)
Cloud run seems to be perfect — you pay only for the time you need, and it is 100% serverless. Nice. There are some traps as Google Cloud Run is designed for web applications and not for longer tasks. But it can be configured, and you can run tasks up to 3600 sec.
The whole ETL looks like this:
Let’s look at the whole process step by step.
Step 0: Check R code on Linux
I suppose you have already written some R script a tested in RStudio. I call it calculateAttribution.R in this article. My recommendation — test your script on Linux. If you use Windows, use WSL. There may be small differences between Windows and Linux behavior.
Try:
RScript calculateAttribution.R
In my case there was some problem with Sys.timezone(). It was necessary to set a timezone in the environment variable in this way:
TZ="Europe/Prague" RScript calculateAttribution.R
Step 1: Python wrapper application
Google Cloud Run should respond to an HTTP request on port 8080. So we need to write something to do it. The fastest way is to write a very simple Python app, which runs my R script. It looks like this.
import os, subprocess
from flask import Flask, request, abortapp = Flask(__name__)@app.route("/recalculate", methods=["GET"])
def recalculate():
o = subprocess.run(
'TZ="Europe/Prague" Rscript calculateAttribution.R',
stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
return {"results": o.stdout}if __name__ == "__main__":
app.run(
debug=True,
host="0.0.0.0",
port=int(os.environ.get("PORT",8080)))
Step 2: Docker file
You have to create a Dockerfile. It should define the following steps:
- install python, R
- install R packages (quite time-consuming)
- copy all necessary files
- run Gunicorn web server on 8080 port
# Use the Google Cloud SDK image.
FROM google/cloud-sdk# install Python3 and pip3
RUN apt-get update && apt-get install -y python3-pip python3# Install production dependencies.
RUN pip3 install Flask gunicorn# Install R
RUN apt-get install -y dirmngr apt-transport-https ca-certificates software-properties-common gnupg2
RUN apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev
RUN apt-get install -y r-base r-base-dev# Install R packages
RUN R -e "install.packages('readr', repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('dplyr', repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('bigrquery', repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('tidyr', repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('reshape2', repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('ChannelAttribution', repos='http://cran.rstudio.com/')"
RUN R -e "install.packages('lubridate', repos='http://cran.rstudio.com/')"# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY ./calculateAttribution.R ./ # main script
COPY ./app.py ./ # python wrapper
COPY ./auth.json ./ # service account key used in Rscript# Run the web service on container startup
CMD exec gunicorn --timeout 3600 --bind :$PORT --workers 1 --threads 1 app:app
Please note the timeout parameter at the last line. It prevents the Gunicorn server to timeout before R script finish it’s job.
Step 3: Build Docker image
Building a Docker image can be a time-consuming job. Especially if you install R packages. So don’t forget to increase maximum building time:
gcloud builds submit \
--tag eu.gcr.io/YOURPROJECT/attribution
--disk-size=100GB \
--timeout=1h
Step 3: Deploy Docker image to a cloud run
When deploying you have to specify machine parameters and maximum timeout for request. Actually, the maximum request timeout is 15minutes, but you can specify up to one hour using gcloud beta.
gcloud beta run deploy attribution \
--image eu.gcr.io/YOURPROJECT/attribution \
--memory=4Gi \
--timeout=3600s \
--max-instances=1 \
--cpu=4 \
--platform managed \
--no-allow-unauthenticated \
--region europe-west1
I highly recommend to set max-instances to one. You probably don’t want to run two separate instances of this R script. It will be useless, or it can even cause problems by writing data to the same tables from two threads.
Step 4: Test it
Use curl. You have to send an authorization header, and you have to specify a maximum timeout. Again ;-)
curl --connect-timeout 3600 \
-H "Authorization: Bearer $(gcloud auth print-identity-token)" \
https://attribution-XXXXXXX-ew.a.run.app/recalculate
You should see something like this in Google cloud logging console. A long time HTTP request ending with status 200. Yes!
And after 15minutes Google Cloud Run instance shuts down and wait until next request.
Step 5: Schedule it
Finally, you can use Google Cloud Scheduler to run this task every morning.
gcloud scheduler jobs create http attribution-job \
--schedule "0 2 * * *" \
--uri "https://attribution-XXXX.a.run.app" \
--http-method GET \
--attempt-deadline=3600s \
--time-zone="Europe/Prague" \
--oidc-service-account-email=OIDC_SERVICE_ACCOUNT_EMAIL
Please note two parameters:
- attempt-deadline for long tasks
- You have to specify oidc-service-account-email to authorize this request. This service account should have roles/run.invoker IAM role.
Summary
That’s all. You have to be careful and set long enough timeouts at all levels. Everything then works fine even for very long jobs.