Embulk data bus service

Clarence Chen
2 min readOct 13, 2021

--

Ref
Embulk is a bulk data loader. It helps data transfer between types of databases, storages, file formats, cloud services, and else.

Ref: Embulk official

Embulk serves as a light weight, pluggable data bus that helps data transfer, migration, transformation between different data source.

Benefited from it's pluggability, different extensions could be installed to adapt different data source, for example: JDBC, AWS S3, GCP gcs

Meanwhile, some other extensions could be used for data processing, which is also called filter in embulk: Add time column, Expand json to table

You can find all usage plugins in: https://plugins.embulk.org/

The typical usage of embulk is:

  • Set up data source and data destination
  • Install needed plugins with embulk command
  • Set up config.yml that defines data source and data destination and data filter
  • Run crontab or cron command to invoke embulk to start data transfering complete guild can be found in Official use case

In this article, we would like to extend it to a service that accept a POST request with infomation of data source/destination and related schemas. The service would be capable of automatically set up config.yml and execute embulk at desired time. In this example, we will support data transfer from MongoDB to GCP big query

We build the service with Express.js framework.

Below is the flow of how embulk service works

Prerequisite

  • Make sure you install Java8 rumtime in your host
  • Make sure you install Node.js environment

Set up typescript

Set up dependencies

Build wrapper for Embulk in Node.js

Embulk plugin abstract class

An example of plugin arguments(configuration) is AWS S3 Input Plugin

`MongoDB input plugin

BigQuery output plugin

A factory to create plugin based on client request payload.

Client request payload

PluginFactory

Main class for embulk execution

Controller for embulk service

The complete source code can be found in github

--

--