How to run Meltano in a container on Google Cloud Composer
Run Meltano pipelines with Airflow's KubernetesPodOperator
Meltano is an open source tool which can be used to extract from data sources and load it to destinations like your data warehouse. It uses extractors and loaders written in the Singer open source standard. Singer has been around for a long time so there’s a huge list of available extractors written by numerous contributors. At ManyPets we began using Meltano when neither Fivetran, Stitch, or any other ETL provider we could find supported our telephone system, called Purecloud. Luckily there’s a Singer extractor for it called tap-purecloud
which Meltano can run.
Cloud Composer is a managed service for running Airflow on GCP. Meltano can be run in a container and Airflow has operators for triggering containers so this is a good way to productionise your Meltano pipelines.
As someone new to running containers I struggled a lot to get my first pipeline working like this. Meltano is pretty new and doesn’t have that many how-tos out there so this post is for anyone else struggling! This post will cover:
- Getting your Meltano pipeline set up
- Running it in a container and how to handle variables and secrets
- Getting that container on GCP
- Running the container in an Airflow DAG with Cloud Composer
Setup
You need three basic things to run a pipeline with Meltano: a Meltano project, an extractor to pull data, and a loader to send it to your warehouse. This post won’t cover setting up a project or extractor as Meltano have a good getting started guide for that. I’ll be referring to the tap-purecloud
extractor as an example in this post but it the same steps will be done for any extractor.
For a loader, if you’re using Cloud Composer and GCP you’re also likely using BigQuery as your warehouse so this post will cover some setup for the target-bigquery
loader.
BigQuery loader and running a pipeline locally
When setting up the target-bigquery
loader I only set a few of the configs and will be using environment variables (env vars) to supply the rest at run time. More on that later.
Note: all terminal code shown here is run from the root of the Meltano project directory.
meltano add loader target-bigquery
meltano config target-bigquery set location EU
meltano config target-bigquery set add_metadata_columns true
target-bigquery
has 4 compulsory configs. One is location
which is set above as I know for my use case I won’t ever have to change this. There’s also project_id
and credentials_path
which we’ll set with env vars later as they will be variable. Finally there’s dataset_id
which, despite being compulsory, we actually won’t set as this will cause it to default to the namespace
config value of the extractor (so make sure you’ve set that if you haven’t already). This approach allows you to have data from different extractors uploaded to different BigQuery datasets (called schemas in other databases). If you do set the dataset_id
config then all extractors would upload to the same dataset. The extractor I’m using is tap-purecloud
and I’ve set it’s namespace
config to purecloud_landing_zone
.
With an extractor and the target-bigquery
loader set up you should be able to run your pipeline locally with a command like meltano elt tap-purecloud target-bigquery
.
Other odds and ends
To keep following this guide you’ll need to have Docker installed locally, so you can create containers, as well as the Google SDK, so you can use gcloud
commands. You can check if you have them already by running docker --version
and gcloud --version
and seeing if your terminal outputs a version or an error.
This guide was tested only on Composer version 1.x (note that Composer and Airflow itself have different versions, this guide will work with v1 or v2 of Airflow). If you’re using Composer v2.x then some modifications may be needed for the gcloud
commands and Airflow operator we’ll use. See here for the docs which allow switching between the commands and examples for v1 and v2.
A value we’re going to use a lot is the name of the GCP project we’re working on. Let us set it as an env var in our terminal so we don’t have to keep typing it.
export GCP_PROJECT=my-project-name
If you’re unfamiliar with env vars they’re variables you can set in your terminal and reference with the $
sign. Eg echo $GCP_PROJECT
will output our project name. We’ll add more useful env vars as we progress.
Containerising your Meltano project
If you’ve gotten a pipeline running locally then, before involving Airflow and GCP, the next thing to do is get it running locally in a Docker container. Meltano provide a command to create starter Dockerfile
and .dockerignore
files:
meltano add files docker
Passing secrets to the container
When setting up a Meltano extractor or loader it’ll store any config values with a type of secret (a secret is any confidential value, eg a password) in a .env
file. This is handy and something I use when running outside a container, but we shouldn’t include the .env
file on the container. It’s considered bad practice to store secrets on a container image as you can’t be sure where an image may end up. The .env
file is included by default in the .dockerignore
file Meltano generates to help you avoid doing this.
Instead we’ll pass any secret configs to the container as env vars. For any secret files, like the service account keyfile target-bigquery
needs, we’ll “mount” these to the container when we run it. If you’re keeping your keyfile in your Meltano project directory make sure you’re excluding it in the .dockerignore
file
Passing variables to the container
For config values that may change it’s also best to pass these as env vars to the container, rather than hardcode them in the meltano.yml
file. This allows you to define them in your Airflow DAG and then pass them to the task that calls the container. For example we do this for the project_id
variable for the target-bigquery
loader because we may run the pipeline on our development or production environments.
To see what Meltano expects the env var for a given config to be called run the below (replacing target-bigquery
with a different extractor or loader if you need).
meltano config target-bigquery list
From this we find the name of the env var for project_id
should be TARGET_BIGQUERY_PROJECT_ID
.
Build the container
You shouldn’t need to edit the Dockerfile
, so you can now build the a container image locally with a command like below. An image is a static, executable file which runs the containers code. We used meltano_bbm
for the tag here but it can be anything.
docker build --tag meltano_bbm .
If you make any changes to your Dockerfile
or .dockerignore
then re-run this command to update the image.
Running the container locally
You can then run the container locally with a command like below. The env vars the container will have are defined with -e
and the mounting of the keyfile is done --mount
. The reason for not passing the keyfile’s content as an env var is that the target-bigquery
loader can only use a file for credentials. I’ve used /var/keyfile.json
as where the keyfile will be stored on the container but you could put it anywhere.
The final line is the arguments that’ll be passed to the meltano
command when the container has started up. The reason they’re passed to the meltano
command is that that it’s defined as the “entrypoint” for the container in the Dockerfile
.
docker run \
--mount type=bind,src=/absolute/path/to/service-account/keyfile.json,dst=/var/keyfile.json \
-e TAP_PURECLOUD_CLIENT_SECRET=$TAP_PURECLOUD_CLIENT_SECRET \
-e TAP_PURECLOUD_START_DATE=2021-12-16 \
-e TARGET_BIGQUERY_PROJECT_ID=$GCP_PROJECT \
-e GOOGLE_APPLICATION_CREDENTIALS=/var/keyfile.json \
meltano_bbm \
elt tap-purecloud target-bigquery --job_id=purecloud_to_bigquery
Note that the way the keyfile is mounted here is technically a little different to how KubernetesPodOperator
is going to do it in Airflow. Here a bind mount is used while KubernetesPodOperator
will use a volume mount. I don’t believe there’s any difference in effect for this use case. I used the bind mount here just because I couldn’t figure out how to get the volume mount working right.
Debug by launching a shell on the container
Initially I had a lot of issues understanding what env vars where on the container so a top tip for debugging is to launch a shell on the container at start up instead of having it run Meltano.
You do this by overriding the defined entrypoint with the --entrypoint=bash
argument. Also using the -it
flag means an interactive container will launch instead of the shell just being started and immediately exiting. You can use these extra arguments while still setting all env vars as before. A difference to a normal run is that you won’t need to supply any arguments to Meltano to run a pipeline.
docker run \
--mount type=bind,src=/absolute/path/to/service-account/keyfile.json,dst=/var/keyfile.json \
-e TAP_PURECLOUD_CLIENT_SECRET=$TAP_PURECLOUD_CLIENT_SECRET \
-e TAP_PURECLOUD_START_DATE=2021-12-16 \
-e TARGET_BIGQUERY_PROJECT_ID=$GCP_PROJECT \
-e GOOGLE_APPLICATION_CREDENTIALS=/var/keyfile.json \
--entrypoint=bash \
-it \
meltano_bbm
Putting the container image and secrets on GCP
Now that we have a working container we need to make it and the secrets it needs accessible in our GCP project.
Artifact Registry
Artifact Registry is an expanded, newer version of the much better named Container Registry. It does everything Container Registry did and more. We’ll store our container image in it so that other GCP tools can download the image from it.
First create a repository for the image. In this case repository means a directory of container images. Here meltano-repo
will be the name of the repo. Change it or the location if you need.
gcloud artifacts repositories create meltano-repo --repository-format=docker --location=europe-west2
Next we need to authorise our locally running docker to be able to push to our new repo. If you used a different location to europe-west2
above you need to change that here too.
gcloud auth configure-docker europe-west2-docker.pkg.dev
We then need to tag our image so it gets pushed to the right repo. You need to do this every time you rebuild the image.
docker tag meltano_bbm europe-west2-docker.pkg.dev/$GCP_PROJECT/meltano-repo/meltano_bbm
Finally actually push the image to our Artifact Registry repo.
docker push europe-west2-docker.pkg.dev/$GCP_PROJECT/meltano-repo/meltano_bbm
Kubernetes setup
Kubernetes is a tool for managing running containers. GCP has a managed service for this called Google Kubernetes Engine (GKE). If you’re using Cloud Composer you won’t need to set up a cluster from scratch as Composer itself runs on GKE so one will already exist.
What we will do is create a new node-pool within that cluster for our Meltano container to run on. When you run the Airflow DAG that will trigger the container, the DAG will take up one pod on a node and the container will need to be run on a second. As this will take up at least two slots it’s recommended to create a new node-pool within the Composer GKE cluster for our Meltano container to run on. This will avoid it competing for resources with Composer itself.
To create the node pool we first need the name and zone of our Composer Kubernetes cluster. We can find these by going to Kubernetes Engine in the Google Cloud UI and clicking into the cluster for Composer. Set these values as env vars in your terminal as we’ll have to reuse them.
export COMPOSER_GKE_NAME=europe-west2-data-warehouse-123abc-gke
export COMPOSER_GKE_ZONE = europe-west2-c
Then we can run a command like below to create a pool called meltano-pool
. We enable autoscaling to avoid this new node pool costing us money for when it’s not in use. For machine-type
you can use what you want. I chose e2-standard-2
as it was one of the cheaper options and my pipeline doesn’t need to import much data.
gcloud container node-pools create meltano-pool \
--project=$GCP_PROJECT \
--cluster=$COMPOSER_GKE_NAME \
--zone=$COMPOSER_GKE_ZONE \
--machine-type=e2-standard-2 \
--enable-autoupgrade \
--num-nodes=1 \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=3 \
--disk-size=20
The last thing to do here is put our secrets on the Composer Kubernetes cluster so we can easily pass them to the container when we trigger it in Airflow.
Get credentials which, in future commands, kubectl
will automatically use to connect to the Composer cluster.
gcloud container clusters get-credentials $COMPOSER_GKE_NAME \
--project=$GCP_PROJECT \
--zone=$COMPOSER_GKE_ZONE
When setting a secret I first delete any existing one of the same name to allow it be updated
export PURECLOUD_SECRET_NAME=purecloud-api-secret
kubectl delete secret $PURECLOUD_SECRET_NAME --ignore-not-found
Then actually set the secret. As I set the Purecloud client secret in the Meltano extractor’s config when setting it up, the value for it is saved in my .env
file under TAP_PURECLOUD_CLIENT_SECRET
. Due to the formatting of the .env
file running source .env
makes any values defined in it accessible in my current terminal session.
source .env
kubectl create secret generic $PURECLOUD_SECRET_NAME \
--from-literal purecloud_secret=$TAP_PURECLOUD_CLIENT_SECRET
For use with the target-bigquery
we need to make the value of a service account keyfile a secret in Kubernetes. The service account needs permission to create datasets and modify tables so I gave it BigQuery User permission in IAM. You could be more restrictive than this though.
The commands to make a secret from a file are nearly the same as creating from a passed value except instead of kubectl create … --from-literal …
we use kubectl create … --from-file …
.
export SA_SECRET_NAME=meltano-service-account
kubectl delete secret $SA_SECRET_NAME --ignore-not-found
kubectl create secret generic $SA_SECRET_NAME \
--from-file meltano_serv_acc_keyfile.json=./path/to/serv_acc.json
Running on Composer
We’re nearly there! Now we just need to actually trigger the running of our container, which we’ll do with the KubernetesPodOperator
. Below is an example of the DAG we use to run our Purecloud to BigQuery pipeline. It shows how we pass normal env vars as well as a secret one. It also shows how the secret for the service account keyfile is handled differently, being mounted as a volume instead of passed as an env var.
A nice feature of KubernetesPodOperator
is that it doesn’t just start off the container and then finish the task. It’ll both wait for the container to finish as well as print out any logs coming from it. This makes running a container just like running any other Airflow DAG step. It was for these reasons we chose this approach over using Cloud Run (GCP’s on demand container service).
We’re growing fast here at ManyPets 🚀. This means there’s a lot more data challenges to solve and we’d would love your help to do it 😄. See our careers page to come join us!