Ingesting TeraBytes Of Data Every Day — using Microservices Architecture at MiQ

Published in

MiQ Tech and Analytics

8 min readMay 18, 2021

Data-driven solutions require faster access to data and this is especially true for an analytics-based solution that uses data at its core. At MiQ, we always focus on this thought and have ensured to treat data as a first-class citizen. Keeping an eye to enrich our ability to ingest data at scale and provide a seamless journey for analytics. At MiQ, on average, we onboard 10+ TB of compressed Adtech data daily into our infrastructure.

Our journey in building a Batch Data Ingestion solution at MiQ has gone through a couple of iterations on identifying the right set of tools and procedures. Over the years we have adopted a path that is well suited to scale faster. Before we talk about the current setup and the future vision, let us talk about what was available during the initial days.

Before — how it was NOT scalable before

Our initial usage was purely driven by business needs and was focused on getting the data ingested to analyze. Back then when we started and wanted to pursue data ingestion, we typically had a monolithic approach, i.e. with one service running on an EC2 server, which was using Apache Camel for connecting to source and sink.

In simple terms:

A standalone java console application, named “Feed Service”, which would read inputs from a properties file.
It would launch Apache Camel routes and initialize all source and sink connectors and execute pipelines.
These routes would connect to some of the important data sources which MiQ uses for data Ingestion like AWS S3, Google Cloud Storage (GCS), SFTP and so on.
Once it connects, the pipelines were asynchronously triggered via the schedulers to transfer data to the targets like MiQ’s own AWS S3 or AWS Redshift

Limitations we faced

Standalone service — a single point of failure
Non Maintainable Code — Every pipeline was being customized for the specific use case, which resulted in a lot of unmanageable custom code.
The restart was a pain — The java properties file was growing and service restart was not easy.
No Shutdown hook — There was no support for the shutdown hook in the code, thus whenever we wanted to restart the service, we would eventually have to wait for hours until the existing pipelines would stop.
Difficulty in debugging — the logs were bare metal with one log file view for all pipelines, it was difficult to pin down on a specific pipeline and debug the issue. No tool was used to filter the logs over a while and for a given pipeline.
Slow to market-ready — every time we received a new request to onboard data, we would face a lot of issues in designing the pipeline/re-using them. We also saw that any code change would introduce a lot of maintenance for older pipelines
Combined multiple logic of ingestion and processing: While onboarding the data, there were several nuances of how the data files need to be accessed and the format of data. We ended up having custom processing for some of the pipelines, like column transform and the file format change from JSON to CSV and no common functionality.

We focused on business deliverables to make sure we could address our clients faster. Later, it caused more issues to scale as we grew in new markets and new clients, for instance, when we started with US and APAC markets, we had to customize our data ingestion per region, which was very cumbersome. We had to change our strategy with onboarding the data and building a scalable solution.

Now — how we modeled it

While the buzzword was Microservices everywhere, we then focused on getting there eventually. The legacy Feed Service was still operational and we had to formalize a design that would help us achieve faster data ingestion. It took us some time to make the complete transition from legacy to the new scalable service and it’s been a rollercoaster ride, considering there are also challenges when you choose to build microservices and expecting the same to follow the best practices. We currently have close to 300+ pipelines running in our production scheduled as hourly/daily/weekly/monthly.

NOTE: The part of using Processing Service with Qubole, shown in the above diagram is not part of the Data Ingestion requirement, but in true sense the onboarding of data that consists of data wrangling (like data format conversion to Parquet, applying Data Governance), we needed a Processing layer which supports BigData ecosystem. It is our in-house microservice to fulfil this requirement.

A brief explanation of the above high-level design:

Microservice Approach — We have now moved on to make our services adopt Domain-Driven Design and create a couple of microservices (Data Ingestion Service/Data Catalog Service) to provide batch ingestion capabilities. This follows the Event-Driven Architecture to communicate with other services.
Abstract tools or wrap it — The Data Ingestion Service abstracts the tools used for transferring data across multiple sources and target systems. Currently, there are two tools that we use, i.e. Streamsets Data Collector and Apache NiFi. The way we abstract it out is that we use their API and create pre-determined templates for the pipelines like S3-S3, SFTP-S3 and so on. The benefit is we don’t have to reinvent the wheel and write the connectors ourselves. This will also make us agnostic to any tool provider and can always plug-and-play newer tools whenever we find more providers, like Google Cloud Data Fusion or AWS Data Pipeline.
Configurable and Extensible — Everything is configurable via the Pipeline and Catalog entities to ensure the pipeline data transfer is scalable for new use cases. Even tools like Stream sets Data Collector (SDC) can be replaced with Apache NiFi over the required feature which is not supported in SDC.
Highly Available — Everything is running in our AWS EKS Kubernetes Cluster. The configurations are now stored in MongoDB as a set of entities, i.e., we use the industry-standard domain entities for denoting pipelines, datasets and schema, for all our data pipelines.
Privacy and Data Governance at scale — This structure takes care of overall Data Governance and Data Security while transferring the data. For instance, the moment we onboard the data we ensure to run it through our Processing Service to process and remove PII while storing it in our Infrastructure.
Separation of concerns — Data Wrangling: We have ensured to move the heavy lifting of modifying data to the required format and governance using our in-house Processing Service which internally uses BigData platforms like Qubole and Databricks.

Limitations faced

Non-user Friendly — considering this service is predominantly API/Event-driven, the stakeholders in MiQ, like Business Analysts, would find it difficult to configure and run the same.
Data Lake Vision: to ensure we can build a platform that can streamline the onboarding of data as a first-class citizen, we need to make sure the data once on-board, is seamlessly integrated with a data preparation tool before anyone wants to perform analytics.
Integrated Data Health: today our Data health is not part of the ingestion pipeline and to ensure the integrity of data is maintained we run it as a separate entity.

Future — how we intend to shape the future

The whole focus moving forward would be to make it more accessible for users running any kind of workload and to make the data onboarding easier, which would include applying data governance, integrating data health and preparing the data while in transit.

We are looking at our internal tool named Studio, which would help us shape any data ingestion being applied seamlessly.

Why abstract the underlying tools and provide a UI?

Ease Accessibility: We are focusing on making it easy to access and thus providing a user interface and allowing all the stakeholders to build and onboard their pipeline.
Reducing the technical knowledge for stakeholders: Our stakeholders are a mix of Business Analysts, Solution Engineers, and also sometimes Product Analysts, who need data from a couple of different data sources/vendors. They need a simplified view and a minimalist approach to connect source and target. For instance, the target storage for big data defaults to Parquet post applying the Data Governance. The stakeholders need not worry about choosing the right data format, they should rather be guided over this best practice to store the data.
Data Lake Vision: This UI is also being modelled to provide access to the other modules like Data Preparation and other common processing templates. Enabling the onboarded data to be easily accessible for analytics, we focus on this platform to prepare data on arrival.

Limitations that we can face

Toolset Change Management: With every change underlying in the toolsets source and target connectors, we can end up with a modification to our stack and fix the UI. There are ways to mitigate this, by checking on backward compatibility and similar approaches.
Faster new Data Connectors to UI: The backend Ingestion service allows API for setting up the pipeline between different connectors. The UI is modeled to accept inputs as per the API spec. This can create a bottleneck if the API is not being designed by keeping the UI in focus. The best way to tackle this would be to render the UI based on the contract of backend service, i.e. Generating config-driven dynamic UI components.

In Summary

Our ability to scale for every market requirement has been more agile and we aim to build a platform that can treat data at its core. This journey of moving from a legacy monolithic service to a better scalable ingestion platform has also inculcated us to think for a better Self-Serve Platform. There have also been many challenges, for instance, in the monolithic approach, we used to process the data. We couldn’t migrate such pipelines easily, so we ended up separating the monolithic pipeline to accept the multi-step process of getting ingested first and then processing it separately. In retrospect, it is a daunting task to build a platform that can address so many different requirements, but it’s a great experience and shows us that every stage of software development is a learning phase.