Writing a good ETL flow with NodeJs

Daniele
Data Tech Tips
Published in
4 min readJul 27, 2019

Last story was about if it’s better to write your own ETL flow instead of using any of the tools available nowadays. in this story we will cover how to write a good quality ETL flow with javascript and Node js.

Requirements

  • Node Js (use the version you’re most familiar)
  • Vs Code
  • A Linux machine (Ubuntu or Debian should be fine)
  • A database (We will use PostgreSQL 11 use whatever you like most)
  • Pino an advanced logging script
  • Sequelize this node module helps us for querying any database
  • PM2 a NodeJs scheduler

Folder structure

A good folder structure is needed in order to give to each part of your software its importance and to have a mental organisation where to find the piece of code you need.

This is the usual folder structure i use for my projects.

  • config → Here goes all the configurations from database connections and node environments
  • deploy → Here are all the deploy scripts, usually using ssh2 and git should do the work of deploying any changes
  • logs → This folder will help you a lot in debugging
  • src → Here is the heart of the ETL
  • src/businessLayer → Here are all the business related behaviour of your software.
  • src/dataLayer →This folder contains the mapping of your database column to Sequelize
  • src/helper → All the scripts useful to obtain transformations should be here
  • src/webServer → Needed for starting the ETL process
  • ssh → We don’t use password, we like keys
  • utils → Here are all the utilities (Testing some code)
Photo by Pankaj Patel on Unsplash

Let’s code!

Our starting point is index.js.
In this file you should put all the logics needed to avoid any incongruence for example:

  • Check if all the required schemas are created
  • Check if all the tables are created
  • Check if all the stored procedures and functions are created

Once you’ve done those checks let’s start with the first part of our loading.

In dataLayer you should put the database structure you have for example:

For each table you have to create a raw import of your data without transforming it at first (maybe only a rename should be made at this point) with a simple select * query importing only the difference from the last import. A good check should be the lastInserted data and load only the data produced after that period.

Two are the ways to do this:

  • Creating a SQL file containing the instructions, read the file and execute it
  • Writing from code with Sequelize it should be faster but harder to explain to a non developer.

From Staging to DWH

This is the funnier part of the work, as long as you have created the database schema you should write the import code with some transformations (eg. dataTypes) and some simple computations (Eg. counting and summing)

Don’t forget the keys

Every time you do an insert don’t forget to drop the primaryKeys, every constraint and every index, this allows a faster insert and avoids errors.

Battling with existing data
Each time you start to import your data into dwh delete, before any insert, all the existing data matching the ids that are into staging, this allows to implicitly update the data instead to use huge updates that are slow.
So do something like this:

It is obvious that you need to delete every data in staging environment before importing from source in order to make this logic work.

The last part

The last but not the least here you create your data marts using SQL scripts and Sequelize to read the file and execute it.

PM2

With PM2 you can create a cron-like job that starts the nodeJs scripts at the time of your preference.

With ubuntu just use:

pm2 start etl.js — name “my-etl”

and with PM2 list you can list all the jobs created in PM2.

Checking ETL status

An another good approach to check if the job is running well is to write each step into a loggingSql table structured like this for example:

This si very useful for checking in real time what is doing the job:

And at the start of each step update the table inserting the duration of previous step by doing a simple (endTime - startTime).

This is obvious an example, let me now in the comment what are your best practices and let’s start a discussion.

--

--