Fantastic Data Pipelines and Where to Find Them

Ilma Alifia Mahardika

Published in

tiket.com

6 min readOct 9, 2020

— A Sneak Peek of Data Team at tiket.com

By @ilmaalifia, Data Engineer at tiket.com

Fantastic Data Pipelines and Where to Find Them

It is nowhere to be found actually. I am so sorry to disappoint you.

✌️

But, why though? You might be wondering.

Because Data Pipeline is Something We Build

You can’t randomly google “Fantastic Data Pipelines” and then stumbled upon a ready-to-use one that matches your business’ needs. Business’ needs. That is actually the keyword. Data pipeline is an abstract concept on how a company collects their data and makes a useful analysis out of it. It is fully customisable.

Why is the data pipeline fully customisable?

Let me quote the most common background statement of Informatics students scientific report. Nowadays, technology has grown so insanely fast that it is so hard to catch up with everything. Because it is just too many! You might use a specific tool today, but the next day, it is very possible that a brand-new tool is released and going to replace yours. Your current tool will slowly become obsolete (I bet you often hear the similar background statement). One purpose can be delivered by various technologies.

For example, saving data. You can choose whether to use SQL or NoSQL technology. You also have a tremendous amount of tech stack options inside them. For SQL, you might use MySQL or PostgreSQL. For NoSQL, there are MongoDB, Firestore, and so on.

Every company has its own decision on which technology to use for their business. Especially in the startup ecosystem, where agility and flexibility are the main guidelines. I mean, like, everyone is free to propose tech stack that might be useful for business. This condition leads into different data solution needs for each enterprise.

Let us say tiket.com uses Firestore as database for saving app data. Then, the management a.k.a C-level (you know CEO, CTO, etc.) wants the data to be delivered in the form of an informative, customisable, yet catchy dashboard. So, the Data Team of tiket.com must build a data pipeline which enables us to collect all databases in Firestore, put it all together inside an analytical tool, and finally display it inside the dashboard. Another company that doesn’t use Firestore need not to build this. This is why “business’ needs” is the keyword, as I said earlier.

Right now, Data Team of tiket.com provides data analysis, artificial intelligence/machine learning services, and monitoring dashboard for the management. Beside that, Data Team also provides data feed for other systems.

Fantastic Data Pipelines

Data Engineer is the first actor who builds a data pipeline through data engineering process, and hi, there is me on that team! Data Engineer is responsible for ingesting data from data source into destined analytical tool. In other words, integrating data into one single source of truth.

So, what does fantastic, in terms of data pipeline, means? Based on my almost 2 months journey as a newbie Data Engineer (and still counting of course), these are some points that make the data pipeline fantastic.

Fulfill Business’ Needs

Remember why we started building the data pipeline right? Because of the business’ needs. It is the fundamental point to be considered. Even before starting to build the data pipeline.

Provide Reliable Data

Reliable data means that the data really resembles the real event happening in transactional database or data source. If you have 1,000 rows of data from data source, the data pipeline must exactly transfer 1,000 rows of data into destination platform. No less, no more.

This is Patrick Star and he knows how to provide reliable data. Be like Patrick Star!

If the data needs transformation, it must be transformed correctly. For example, there is a `createdDate` column which contains timestamp value of the record creation in UTC time zone. If the user wants `createdDate` in WIB time zone, the data must be transformed by adding exactly 7 hours into that column.

The point is, the resulting data must represent the exact same thing as the source data.

Provide Readable Data

Do you know that timestamp can be represented in readable format and unreadable format? You might see the readable one looks like this `2020–08–03 9:15:00 UTC`, while the unreadable one looks like this `1596420900`. The latter is called epoch or UNIX timestamp. It is the number of seconds that have elapsed since January 1st, 1970 (midnight UTC/GMT).

Readable Data Illustration (Epoch/Unix Timestamp to Readable Timestamp)

Pssst, you don’t want to hear complaints from the Business Intelligence team or other users, right? Then provide the readable one! I once got a complaint from Business Intelligence team because I provide date data in epoch. LOL. I was unaware at that time.

Yeah, it is understandable that it is hard to do analytical work if the data is shown in unreadable format. Therefore, a data transformation is needed.

But, don’t forget to make it reliable as explained on point 2! 😉

Having a Certain Standard

Standard is so important because let’s be honest, the company will not stay the same. There will be dynamic changes in the team. If the team doesn’t possess a certain standard on developing the pipeline, there will surely be a huge mess all around. Everyone will just code and build a pipeline with their own style. It will be hard to keep track of everything. It will be hard for new joiners to understand the whole work. They will face a steep learning curve and slowing down the work.

Provide Proper Error Alerting

I am not gonna lie that data pipeline is something that is very vulnerable. Most of the time, it causes overtime work for Data Engineer Team when something is not right. LOL. For example, late data ingestion can cause unreliable data, or worse, it can lead to wrong company decisions. Alerting is a must. And proper alerting is important.

What does proper mean?

The alert must provide a whole important information, such as in which part the error happens, what is the error type, what is the priority, and what are the impacts. The alert also needs to be put in each section of the data pipeline, this is what we called as end-to-end monitoring.

With proper alerting, Data Engineer Team can fix the error in an effective and efficient way. So, we can focus on other things, such as doing research in new technologies and methodologies to improve the existing data pipeline.

Oh, don’t forget, proper alerting will also decrease the overtime work, of course! Hahaha.

Building a fantastic data pipeline is definitely not a one-time work. We gotta learn, build, and review. Then learn again, build again, and review again. We will always learn along the way. And that’s how we are moving towards the fantastic one.

A Spoiler for Data Engineering at tiket.com

Speaking of Data Engineering at tiket.com, the main actor is Apache Nifi. How does the Data Engineer Team make use of this tech stack? It is a very long story, you’d better wait for the next episode because I blurred out everything below. Hahaha.

Cheers! 😌