MFine Data Platform

Published in

mfine-technology

6 min readMar 7, 2022

Background:

At MFine, we aim at providing quality healthcare services to all. MFine partners with top hospitals and leading healthcare institutions across the country to make sure that expert doctors are just a tap away. On our platform, we have 50K+ doctors spanning across 58 specialities and have served 8+ millions cases so far.

We continue to innovate and provide better customer experience by leveraging big data, machine learning, and artificial intelligence technology. As a result, over the last five years, the scale of our data platform multiplied from few MBs to TBs and each day we continue to ingest additional GBs.

In this blog post, we will discuss:

Need of building a Data Platform
How we built a large scale data platform at MFine
How we used technology as an opportunity to improve our data ecosystem.
The learnings from our experience.

Need of Data Platform:

We wanted to build a scalable architecture with a single source of truth.
Data must be as accurate and as timely as possible to support services on our platform.
Building a data platform that can provide fast and real time analytical dashboards for business and day to day operations.
It helps to enable self-service for a diverse range of users to discover and analyse data within the platform.
Data Science team consumes data to train and build AI/ML models.
Data/Business Analyst builds interactive analytics platforms that can garner critical business insights in seconds.
Data Platform helps our Engineering teams in building products in a better way and finally it helps to drive our daily operations needs.

How we solved it here @mfine ?

The entire MFine system is built upon micro service architecture. Each service implements a set of business functional units and communicates with one another via messaging platform for asynchronous communication and RESTFull APIs for synchronous communications.

Our infrastructure is built on top of BigData technologies and we leverage Amazon Web (AWS) messaging service (SNS/SQS) for event streaming and AWS S3 for data layer. This separates compute and storage layers

Ingestion Mechanism:

We have 130+ micro-services that generate ½ million events per minute which undergo strict sanity checks before they are published to AWS SNS. An in-house ETL Tool has been developed that is responsible for consuming messages and performing on the fly transformations from 50 topics consisting of 800 different events and writing to AWS RDS, ElasticSearch index along with sending messages to the data lake. It is also responsible for sending data to AppsFlyer analytics platform. Nearly 5 Lakhs messages are processed per minute and it comes with an inbuilt mechanism to replay failed events from Dead Letter Queue (DLQ) so as to ensure no data loss. In background, we have daily CRON jobs scheduled using AirFlow and AWS Glue Scripts that are responsible for moving daily aggregated snapshot based data from data lake to our big data analytics platform and perform file format conversions for better performance.

Storing Layer:

We have built our own data lake ecosystem to handle GBs of daily data ingestion using the following components:

Our transactional data resides in AWS RDS with a storage capacity of 3.5 TB and provisioned IOPS of 35000. Data life cycle in RDS is only three months, later it is moved to Presto for historical data analysis. Data lake is on S3, but since data in S3 is just the data, we need someplace to store metadata about what it contains within those S3 locations. “Glue Data Catalog” comes to the rescue. We enable Athena to query our data with the help of scheduled AWS Glue Crawler that updates table metadata automatically. On top of that we have Presto Cluster running on Amazon EMR which provides us data warehousing capabilities with a data latency of 24 hours. For tenant data we have an in-house application which swallows all the events and carries out de-identification/anonymisation of the data and stores it in the form of time based partitioned tables which are consumed by different Data Shores. The data we have only includes Click streams and Analytical storage.

Consumption Layer:

With nearly 15K ad-hoc/scheduled queries running on our platform a massive scale of 20GB of data is being scanned per day and we provide no limits on the amount of data scanned. Metabase (open source) UI Layer is being used on top of RDS and Presto Engine to build interactive dashboards. We also use ELK stack to build real-time high performance dashboards for operational needs.

High Level Design:

Why did we choose what we chose?

We wanted to build a lightweight low cost architecture with infinite data capacity and 0% data loss. A centralised consolidated data platform with minimum complexity and highest business-value. Our data architecture should also support continuous evolution.

EMR: Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. We chose Presto as our system’s SQL engine because of its scalability, high performance and faster query processing. These properties make it a good fit for many of our teams.

RDS: Amazon RDS makes it easy to set up, operate, and scale SQL deployments in the cloud. With Amazon RDS, you can focus on application development instead of managing time-consuming database administration tasks including backups, software patching, monitoring, scaling and replication.

S3: Virtually unlimited scalability and high durability. Also, we can seamlessly and non-disruptively increase storage from gigabytes to petabytes of content, paying only for what you use. It has scalable performance, ease-of-use features, and native encryption and access control capabilities that are ideal for building our data lake. Also S3 integrates with a broad portfolio of data processing tools.

Some Optimisation Techniques Used

Partitioning : As told earlier, we would be partitioning the data based on time. AWS Glue crawlers automatically identify partitions in your Amazon S3 data. So if you write the query correctly, you can skip the query scan in thousands of partitions present in the data lake and point it to the exact bucket of your interest.

ORC Conversion : Conversion to ORC reduced the data size by over 95% and query time from Presto Cluster by over 75%. Using ORC files improves performance when Athena/Presto is reading, writing, and processing data.

Data Anonymization: It is a type of information sanitisation whose intent is privacy protection. It is the process of either encrypting or removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous.

Data Reduction: We compress the files whose size >128 MB. Saved 100’s of TB of storage in S3 through removing versioning, incomplete multipart uploads and keeping global replication off that reduced international data transfer cost (Inter Region). For all the historical analyses of more than 3 months old data we rely upon Presto Engine which helps to reduce RDS data size and provides faster query results for daily dashboards. It also helps to reduce RDS costs as well.

Future Goals:

MFine’s busy 2021 included 3 million users with more than 45% being from non-metro cities. But as we grow, big data will play a crucial role in analysing aggregated business trends and provide a scope to investigate where our services can run more smoothly.

Over the past years we have worked hard to build infrastructure that scales to our operational needs and we look forward to further optimising our infrastructure and making MFine’s platform services run better than ever.

We have identified few of the promising areas that can help us to achieve the same:

On the fly data enrichment for real time data warehousing capabilities.
Building an anomaly detection engine and improving data quality.
Our main aim is to achieve lower latency between streaming and stationary data.
As business grows we are looking forward to a more mutable framework that can provide us with easy upserts.
Self-servicing data pipelines via configuration.
Develop tools for data discoverability, usage visibility and change management.

We will have follow up blog posts on these topics in future. Please stay tuned!

MFine Data Platform

Written by Sangna Maniar