Photo by Veri Ivanova on Unsplash

Often the first version of our Apache Beam pipelines do not perform as well as we would like, and sometimes it is not so obvious to find the places where we could optimize performance; sometimes it will be a function parsing JSON, some others the bottleneck will be a external source or sink, or we have a very hot key and we are trying to group by key. On Google Cloud Platform, these are the kind of situations that we could easily detect combining Dataflow with Cloud Profiler (formerly known as Stackdriver Profiler) for pipelines written in Java. …


Photo by SpaceX on Unsplash

Among the many different methods that you can use to trigger Dataflow jobs, Terraform counts with some popularity. Terraform uses the Dataflow API to trigger Dataflow jobs. This means that you can only launch templates. But thanks to Dataflow Flex Templates, this actually does not impose a limit on the kind of job you can launch. If your code is not a template, just wrap it with a launch container and use Flex Templates.

However, not all the picture is as rosy. If you compare the options to deploy a Dataflow pipeline with what the Terraform resources offer (for regular…


Photo by Raul Cacho Oses on Unsplash

The Google Cloud operators for Apache Airflow offer a convenient way to connect to services such as BigQuery, Dataflow, Dataproc, from your DAG. If you are not in the latest Airflow version, you might be stuck with a set of operators that don’t bring the latest available functionality.

With the fast pace of evolution of cloud services, this can become an issue that prevents you from leveraging the latest Google Cloud features in your data pipelines. …


A DAG that runs a “goodbye” task only after two upstream DAGs have successfully finished. This post explains how to create such a DAG in Apache Airflow

In Apache Airflow we can have very complex DAGs with several tasks, and dependencies between the tasks.

But what if we have cross-DAGs dependencies, and we want to make a DAG of DAGs? Normally, we would try to put all tasks that have dependencies in the same DAG. But sometimes you cannot modify the DAGs, and you may want to still add dependencies between the DAGs.

For that, we can use the ExternalTaskSensor.

This sensor will lookup past executions of DAGs and tasks, and will match those DAGs that share the same execution_date as our DAG. However, the name execution_date


Photo by Ben Hershey on Unsplash

When reading data from BigQuery using BigQueryIO in an Apache Beam pipeline written in Java, you can read the records as TableRow (convenient but offering less performance) or as GenericRecord (extra performance, but some extra work may be required in your code too).

In a previous post, I showed that when using GenericRecord, the BigQuery types are more or less easily mapped to Java types — with the exception of NUMERIC, that requires some additional work.

But that's actually only for BigQuery columns that are required.


Reading NUMERIC fields from BigQuery using Apache Beam's BigQueryIO

Among all the number types in BigQuery, NUMERIC offers the longest precision (38 digits) and scale (9 digits), which makes it the preferred option for applications that require very long precisions. Handling NUMERIC values in BigQuery is straightforward, but not so much in Apache Beam pipeline that reads data from BigQuery. In this post, we will see how we can read this NUMERIC type from BigQuery with BigQueryIO, so we can use decimal values with the longest possible precision in our pipeline.

When reading data from BigQuery using Apache Beam and BigQueryIO, there are two main ways to read the…


Photo by Christian Englmeier on Unsplash

Cloud Bigtable is a high performance distributed NoSQL database that can store petabytes of data and response to queries with latencies lower than 10 ms. However, in order to achieve that level of performance, it is important to choose the right key for your table. Also, the kind of queries that you will be able to make depends on the key that you choose for your table.

Bigtable comes with the Key Visualizer tool, to diagnose how our key is performing. Key Visualizer needs at least 30 GB of data, and some workload, in order to start producing results. …


Using Beam + Bigtable in local, with no cloud resources

Cloud Bigtable is a great high performance distributed NoSQL database that can store petabytes of data, but sometimes you might need to test it for a tiny amount of data.

Spawning a remote instance just for that is cumbersome, you will be burning some money and cloud resources just for a small test.

Fortunately, Bigtable comes with an emulator that is very handy to do simple tests with tiny amounts of data, because you can run it in your local computer.

The emulator works in memory, exposing the same interface as Bigtable, so any system that is able to connect…


The pipelines that we will be launching in this blog post

Google Cloud Platform (GCP) offers many options to run streaming analytics pipelines. The cornerstone message queues processing system in GCP is Cloud Pub/Sub. However, Kafka is probably the most popular alternative to Pub/Sub, and there is not so much info out there about how to integrate with other GCP products.

For stream processing, Dataflow is one of the most advanced systems out there. To run a pipeline on Dataflow, we have to use Apache Beam, an open source SDK which can run on top of many different systems, such as Apache Flink, Apache Spark, and many others, not only Dataflow…


Let's play with Kafka!

In many situations, we may want to have a Kafka server on Google Cloud Platform(GCP) to do some testing — for instance, to test how a streaming Dataflow pipeline would work with Kafka.

There are several options in the GCP Marketplace, some of them offering full featured deployments of Confluent's Kafka on a cluster. However, it is likely that we don't want to setup a scalable Kafka cluster and pay license fees just to some small testing. At the same time, deploying Kafka by hand in a VM can also be cumbersome. …

Israel Herraiz

Strategic Cloud Engineer @ Google

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store