Go: Fan-out Pattern in Our ETL

Vincent Blanchon
Jul 2, 2019 · 3 min read
Illustration created for “A Journey With Go”, made from the original Go Gopher, created by Renee French.

Go is very popular when it comes to building micro-services, especially with gRPC, but it is also extremely powerful to build command line applications. In order to study the fan-out pattern, I will use an example of an ETL we have built in my company based on this pattern.

ETL

The ETL (Extract, Transform, Load) often needs to deal with a huge amount of data. In this situation, having a good concurrency strategy for the ETL could make a big difference in terms of performance.

The two heavy parts of the ETL are the extracting and the loading, since both are often related to databases and the issues they often bring: network latency, queries performance, etc. Depending on the kind of the data we have to deal with and where the bottleneck is, two patterns could be useful that allow us to multiplex or demultiplex our workers in the data or input processing.

Fan-in, fan-out pattern

Fan-in and fan-out are two patterns that take advantages of the concurrency environment. Here is a review of each of them:

  • fan-out, as defined by the the Go blog:

Multiple functions can read from the same channel until that channel is closed

This pattern takes advantage of a fast input provider to a distributed data process:

fan-out pattern with distributed work
  • fan-in, as defined by Google:

A function can read from multiple inputs and proceed until all are closed by multiplexing the input channels onto a single channel that’s closed when all the inputs are closed.

This pattern takes advantage of having many input providers with a fast data processing:

fan-in pattern with multiple inputs

In our case, we have a huge number of files to parse in order to load in a database. The input is generated quite quickly, while the loading will take more time including network latency. The fan-out pattern looks the perfect solution for this case.

Fan-out in action

In our project, we deal with a lot of data stored in CSV files that should be loaded in elastic search. The input processing should be quite fast while the loading could get slower. We therefore need more workers than input processor generates, so the fan-out pattern looks perfect in this example. Here is the diagram of our workflow:

Here is our algorithm:


data chan
// a goroutine will parse the CSV and will send it to the channel
ParseCSV(data<-)
// a goroutine is started for each workers, defined as command line arguments
For each worker in workers
Start goroutine
For each value in <-data
Insert the value in database by chunk of 50
Wait for the workers

The input and loading processing are executed in concurrent, we do not need to wait for the parsing before starting the data processing.

This pattern allows us to take advantage of the concurrency while maintaining our separated business logics. The native distribution of the workload among the workers helped us to easily solve a peak load during the process.

A Journey With Go

A Journey With Go Language Programming

Thanks to Jim Geoshua Bactad

Vincent Blanchon

Written by

French Gopher in Dubai

A Journey With Go

A Journey With Go Language Programming

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade