Orchestrating Data Collection with Airbyte

To get anything out of big data, you first need to ingest data from multiple sources. Big data collection is notoriously difficult. But does it have to be?

Michael C. Reyes
humanmanaged
4 min readAug 10, 2022

--

Imagine having one smartphone that can only handle one mobile application at a time. It would be manageable if you only use 1 or 2 applications, but what if you have 50 mobile applications: One smartphone for Google News, another one for Facebook, another one for Instagram, and so on. Sounds awful, right?!

Hands of people using mobile apps in smartphones
Phone illustration vector created by pch.vector on FreePik

That is one of the main problems we’ve encountered in trying to develop the data ingestion for our Collect Microservice at Human Managed. Our platform generates intelligence and action from customers’ data, which means we work with many data sources as our input for analysis(currently standing at 30+ data sources and counting!)

Initially, we developed numerous bash and python scripts running via cron job or AWS lambda just to automatically fetch the data from multiple APIs and data sources. However, this became difficult for us to manage once we tried to scale up and replicate for each data source and customer that we have.

We had to find a way to address this with something that we can manage, maintain, scale, and modify for our use cases.

Airbyte Open-Source ELT Platform

Introducing Airbyte, an open-source data integration platform that syncs data from applications, APIs, and databases and transfer them to the destination that you want such as database and object storage. Remember the analogy I gave earlier? Airbyte is the type of smartphone that can handle multiple mobile applications, which makes data integration simple, secure, and extensible.

We utilized existing Airbyte connectors that fits into our use cases and also developed our own connectors when necessary. Since Airbyte is an open-source platform, the connectors available are developed by the community where you can also request and contribute your own (We plan to contribute our custom connectors and improvements in existing ones so that we could also give back to the community, so stay tuned!).

Once we converted our automation scripts into Airbyte connectors, we were able to easily implement and organize the batch collection that happens within our Collect microservice. In replicating the same connection but for another customer, we just set the configuration in the existing connector, set the destination, and that’s it! We were able to integrate the another data source into our platform in just minutes.

Airbyte ELT Overview

Referring back to the analogy of having multiple smartphone, with Airbyte, we have now a one single powerful smartphone that can handle and receive data from multiple mobile applications (You may see the available connectors here). This solved the problem of having to maintain, and organize the automation of ingesting data which can also be customized, scaled, and managed however suits our needs.

Drawbacks and Limitations

There will always be a setback in using any other tool, even if it’s Airbyte. The first thing that comes into mind is the learning curve required to develop and maintain custom airbyte connectors. If you are not familiar with the concepts, you’ll have to figure out a lot of these essential concepts to grasp in order to properly understand and implement an Airbyte connector. But once figured out, most of the development will be easier than before because of the Airbyte CDK thanks to Airbyte Team.

As for its limitation, the current lack of support for data backup of raw data in-between the Extract (E) and Load (L) is not yet possible. This conflicts with our architecture of storing the raw logs without the metadata added by Airbyte.

Conclusion

Even though I’ve also mentioned the cons of implementing Airbyte. The benefits far outweigh the disadvantages and gives you the following capabilities:

  • Open source and free for all to contribute
  • Cloud Agnostic
  • Customizable, Modular, Scalable, Secured, and Compliant
  • Set Connection Syncs and Incremental Stream State
  • Unlimited Sync Frequency [No Tier]

The capacity and features of Airbyte to unify your data integration pipelines under one fully managed platform is fit to develop our Collect microservice since it enables us to easily integrate data sources, develop custom connectors when necessary, and manage these connections without worrying too much on security and compliance.

With a fast growing community and contributions, more and more connectors will be made and improved in the future which will continue to innovate and push to evolve the data ingestion capabilities of Airbyte.

You could learn more about Airbyte and get involved in the open-source platform by visiting the official Airbyte website.

About Human Managed

Human Managed is a data company with a purpose to empower responsible decisions. Founded in 2018, we are headquartered in Singapore and operate across Hong Kong, Philippines, and India. As a self-funded ASEAN startup, we currently have more than 40 employees and a growing gig community, serving global customers in the essential services sectors to improve their cyber, risk and digital maturity.

Today, organizational success is driven by quick and effective decisions from an abundance of data. To enable this consistently, our platform finds answers and provides recommendations based on intuitive models and collective intelligence. Our products are built for people who process, analyze, triage, communicate, and make decisions from high volume of information each day.

We are always keen to explore new ways to build, co-create and solve problems. Come say hello@humanmanaged.com and follow us on LinkedIn, Instagram and Twitter.

--

--

Michael C. Reyes
humanmanaged
0 Followers
Writer for

Threat Hunter | SOC Analyst | Python Developer