Go is very popular when it comes to building micro-services, especially with gRPC, but it is also extremely powerful to build command line applications. In order to study the fan-out pattern, I will use an example of an ETL we have built in my company based on this pattern.
The ETL (Extract, Transform, Load) often needs to deal with a huge amount of data. In this situation, having a good concurrency strategy for the ETL could make a big difference in terms of performance.
The two heavy parts of the ETL are the extracting and the loading, since both are often related to databases and the issues they often bring: network latency, queries performance, etc. Depending on the kind of the data we have to deal with and where the bottleneck is, two patterns could be useful that allow us to multiplex or demultiplex our workers in the data or input processing.
Fan-in, fan-out pattern
Fan-in and fan-out are two patterns that take advantages of the concurrency environment. Here is a review of each of them:
- fan-out, as defined by the the Go blog:
Multiple functions can read from the same channel until that channel is closed
This pattern takes advantage of a fast input provider to a distributed data process:
- fan-in, as defined by Google:
A function can read from multiple inputs and proceed until all are closed by multiplexing the input channels onto a single channel that’s closed when all the inputs are closed.
This pattern takes advantage of having many input providers with a fast data processing:
In our case, we have a huge number of files to parse in order to load in a database. The input is generated quite quickly, while the loading will take more time including network latency. The fan-out pattern looks the perfect solution for this case.
Fan-out in action
In our project, we deal with a lot of data stored in CSV files that should be loaded in elastic search. The input processing should be quite fast while the loading could get slower. We therefore need more workers than input processor generates, so the fan-out pattern looks perfect in this example. Here is the diagram of our workflow:
Here is our algorithm:
// a goroutine will parse the CSV and will send it to the channel
ParseCSV(data<-)// a goroutine is started for each workers, defined as command line arguments
For each worker in workers
For each value in <-data
Insert the value in database by chunk of 50Wait for the workersStop
The input and loading processing are executed in concurrent, we do not need to wait for the parsing before starting the data processing.
This pattern allows us to take advantage of the concurrency while maintaining our separated business logics. The native distribution of the workload among the workers helped us to easily solve a peak load during the process.