Embulk data bus service
Embulk is a bulk data loader. It helps data transfer between types of databases, storages, file formats, cloud services, and else.
Ref: Embulk official
Embulk serves as a light weight, pluggable data bus that helps data transfer, migration, transformation between different data source.
Benefited from it's pluggability, different extensions could be installed to adapt different data source, for example: JDBC, AWS S3, GCP gcs
Meanwhile, some other extensions could be used for data processing, which is also called filter
in embulk: Add time column, Expand json to table
You can find all usage plugins in: https://plugins.embulk.org/
The typical usage of embulk is:
- Set up data source and data destination
- Install needed plugins with
embulk
command - Set up
config.yml
that defines data source and data destination and data filter - Run
crontab
orcron
command to invokeembulk
to start data transfering complete guild can be found in Official use case
In this article, we would like to extend it to a service that accept a POST request with infomation of data source/destination and related schemas. The service would be capable of automatically set up config.yml
and execute embulk
at desired time. In this example, we will support data transfer from MongoDB
to GCP big query
We build the service with Express.js
framework.
Below is the flow of how embulk service works
Prerequisite
- Make sure you install Java8 rumtime in your host
- Make sure you install
Node.js
environment
Set up typescript
Set up dependencies
Build wrapper for Embulk in Node.js
Embulk plugin abstract class
An example of plugin arguments(configuration) is AWS S3 Input Plugin
`MongoDB input plugin
BigQuery output plugin
A factory to create plugin based on client request payload.
Client request payload
PluginFactory
Main class for embulk execution
Controller for embulk service
The complete source code can be found in github