Cloud-Native Advantages of Moving Your ETL Process to Cloud Run
Extract, Transform, and Load (ETL) describes the workflow of moving and processing data for future analysis. ETL processes are critical for businesses to make better decisions by bringing information together from across their systems for multiple reasons — auditing/monitoring, business analysis, or even machine learning pipelines. These processes can range from simple scripts to collecting streaming data flowing through your system.
Google Cloud Platform (GCP) has some very powerful data processing products like Cloud Dataflow and Cloud Composer (managed Airflow). However, you may not be able to justify migrating to these technologies or dedicated VMs if your system is not overly complex or your data volume is not large enough. You can keep the complexity within your control — keep your SQL, shell or python script — by executing in the simple fully-managed container environment of Cloud Run. This way you can still reap the benefits of Cloud-native processes and prepare your system for the future.
Moving pieces of your ETL processes to Cloud Run allows you to gain the advantages of the cloud without the hassle of managing infrastructure.
- Flexible to ingest new data sources: Business store data in various systems and formats. Each type of data may need a different type of transformation process such as batch processing or streaming data. Cloud allows you to 1) use consistent messaging or event formats and 2) quickly spin up new microservices for more granular transformations.
- Pay-per-usage: Cloud Run only charges you when you are actually processing data, rounded to the nearest 100ms.
- Scalable: Is your business growing? Are you looking to expand to new regions? Does your team want to analyze in-app events? The volume of your data is going to increase. You can build your system on infrastructure designed to scale.
- Fully managed: Using GCP’s managed services allows you to focus on your business logic and not your infrastructure.
- Secure: Create pipelines with built in security so you don’t have to worry about exposing your data or your processing services.
- Auditable: Consolidate your logs in order to quickly identify where and when your pipeline fails.
- Low latency: Process data in real time and let Google handle the networking and load balancing.
- Data transformation: Integrate with Cloud APIs or bring your own libraries.
- Fault tolerant: Cloud services with built-in retries allow jobs that time out or fail to automatically restart or be used to identify patches needed.
You can harness these advantages out of the box with Cloud Run. Individual services can be deployed and scaled independently of one another, which allows you to move your specific business logic to the cloud piece by piece. With Cloud Run you are ready to scale, but you won’t be paying for VMs waiting for data. When your ETL load grows, your service dynamically scales to match data volume, including down to zero when there is nothing to process. With Cloud Run there is no need to learn new technologies or languages. You can write each processing step in the language of your choice and bring your own runtime and libraries. All you have to do is focus on your data transformations.
Cloud Pub/Sub is a globally durable message-transport service that supports native connectivity to other Cloud Platform services, making it the glue between applications and downstream processing. Connect your services with Cloud Pub/Sub to standardize your event/data sources and provide fan in, fan out, and many-to-many communication with high throughput. Pub/Sub message can be generated from applications hosted on Cloud Platform or on-premises. Pub/Sub message are fault tolerant — retrying delivery until the message is processed or expires after the retention period (up to 7 days).
Cloud Tasks is another type of distributed messaging queue. Cloud Tasks can be substituted for Pub/Sub if features such as rate limiting, task deduplication, configurable retries, and schedule tasks are needed, in the ETL design, over fan-out and many-to-many communication. Subdividing your data into Cloud Tasks can improve the fault tolerance of this architecture. Cloud Tasks will handle the retries if data processing fails and allows you to identify what data is potentially missing. Using Cloud Tasks to trigger data processing allows for Cloud Run to scale more instances or more concurrent requests in order for data to be processed in parallel.
- Cloud Scheduler is a cloud native cron-as-a-service and an easy way to run your batch processes on a schedule.
- Cloud IAM allows for granular access control for your Cloud Run services with service accounts and policy bindings.
- Stackdriver Logging automatically integrates with Stackdriver to consolidate your logs across GCP services.
This section describes streaming and batch style architectures on Cloud Run.
Continually pushing data to an endpoint is a common approach for ETL architectures which need to provide results with low latency. User-generated data such as clickstreams and transactions are examples of data that is commonly pushed to an endpoint or produces an event notifications. However, streaming protocols are not yet supported on Cloud Run. If your data volume grows, Cloud Run can scale up to as many instances as needed. Each instance can handle up to 80 concurrent requests allowing you to use the compute resources more efficiently when ETL logic involves I/O.
The Processing images from Cloud Storage tutorial demonstrates using Cloud Run to receive upload notification from GCS in the form of Pub/Sub messages, analyzing the image with the Vision API, and blurring images that are offensive. The transformation logic and dependencies are packaged in a container and deployed to Cloud Run as a private service. Then using Cloud IAM, a service account is created to authorize Pub/Sub requests of the Cloud Run service.
Other use cases for streaming inputs:
- Collecting data and logs from multiple services.
- Tracking website activity through streaming events into Big Query.
Batch processes are common in scenarios with large amounts of historical data or requirements to avoid duplicating or changing the data in unexpected ways. Batch architecture may be the only way to pull data from read-only sources or backups. Data sources to be processed can be located on-premises or on the Cloud Platform such as Cloud Storage, Cloud Pub/Sub, Cloud SQL, or Cloud Datastore.
Timeouts can be a limitation for Cloud-native services. Cloud Run has a max timeout of 15 minutes with a default of 5 minutes. Therefore, you must be cognizant of the processing time of your transformations and add your own implementation to properly set batch size or add checkpoints. A map-like distribution (like in MapReduce) across Cloud Run instances can be accomplished by using Cloud Tasks to represent separate chunks of data. The tasks then will be dispatched to be processed in parallel. Transformation can also be chained together using Cloud Pub/Sub or Cloud Tasks which additionally provides retry logic.
A nightly batch process could use Cloud Scheduler to trigger a Cloud Run service to pull and data, for example files from Cloud Storage, and create separate tasks. Then another Cloud Run service can be used to process the tasks data and add the transformed data into a database. With Cloud IAM service accounts, an OIDC authorization token can be added to both Cloud Tasks and Cloud Scheduler requests in order to invoke these private services.
Other batch input use cases:
- Migrating on-prem databases to the cloud.
- Scraping marketing sources or APIs to create dated snapshots.
- Running mapping transformations on big files.
- Integrating data from 3rd party applications.