Data Engineering Pipeline — Microsoft Azure managed Services

Ravinder Singh Sengar
6 min readAug 20, 2021

--

Over the last few years, data has become an important asset in business as many organizations are collecting huge amount of data from various sources they had opened to interact with customer. Now with all that data been collected in some form or other, every organization is looking to generate value and revenue out of it
We will be discussing on few of the artefacts and designs that can help build a scalable and efficient data pipelines and analytics systems.

As a business case, let’s take an example of an insurance firm which is collecting various type of data from various sources like

  1. User Detailed Information when registered.
  2. Claims details when it got filed.
  3. Information from social media, analytics and government data.
  4. Demography from where the claim was filed and the reason for the claim.
  5. Information about current state of the user against whom the claim was filed.
  6. Medical history, income category, financial statement, credit score, travel habits and eating habits of the user against whom the claim was filed (these will come from different type of users insurance user had taken like mortgage, car, travel, health, life property, shipment, etc.).
  7. Agent Customer interaction.
  8. Customer and call center/support center interaction.

Collecting this Information, the insurance company can use the data to derive value for few of the following purposes

  1. Fraud Detection and risk mitigation.
  2. Targeted advertisement and marketing for new offers.
  3. Product Recommendations based on user’s activity, habits, income categorization and medical history.
  4. Customer categorization based on behavior, habits, financial Assets, medical history or personal information.
  5. Underwriting and Price correction or optimization for the policy of specific group of policies or users.
  6. Identifying new markets and their specific insurance needs.
  7. Call center and customer support optimization.

Before jumping into high level design, let’s understand what Data Engineering, Data Pipeline, Data Science and Machine Learning are, and where will they fit-in in the whole system.

  1. Data Science : You have a huge data store where you have collected data from different sources in different formats. The data might be messy, incorrect and to some extent non related as well. Data science is all about understanding this data and create meaningful data models which you analyze and pull some value out of it, and create smaller meaningful subsets. There are various BI and DS tools to understand and visualize this data.
  2. Machine Learning and AI: To continuously learn and enhance the query result with better correlation and prediction is machine learning. We keep enhancing and ingesting the ML models to pipelines to get more accurate results. AI is through which we enhance these ML models. All public cloud platforms have their own set of tools and services categorized under MLaaS platform (Machine learning As A Service) with various supported ML frameworks like TensorFlow, PyTorch, Cognitive Service, scikit-learn, keras etc.
  3. Data Engineering : For above mentioned ML templates and BI tools to use this data, some set of components has to collect the data from different sources and ingest that to data lake, and from data lake to various transformation and map-reduce systems. This we call as data engineering.
  4. Data pipelines : Data engineering is achieved by created data pipelines with various stages and input/output sources depending on requirement.
    There are different stages of data pipeline and different mechanisms how these stages are been designed. Selection of components become very crucial while designing the pipelines, so that pipeline should be scalable and efficient, and also cost effective as much as possible. We are following cloud-first approach and going with Azure managed services to build our pipelines scalable and also avoid unnecessary maintenance overhead. Although this might not be true for all the cases as cost is also one of the important factors and it might increase depending on data size we are storing and type/frequency of queries we are triggering into the system.

Data Pipeline

Data pipeline, as we have already discussed is combination of different stages to move data from multiple sources to destination where one the last stage BI can be applied for visualization and . There can be different routes prepared with skip stages depending on nature and usage of data.

Lets understand what’s happening in each stage

  1. Collection and Ingestion : Data is getting ingested to data lake or intermediate DB via various devices and sources like smartphone (app), Web (internet website), events (IoT or integrated services in EDA), API gateway (for third parties), application logs and batch file upload. Data ingestion can be configured to either process in real time (events, api request and logs) or batches (file upload).
  2. Data Lake and Intermediary storage : The data ingested in stage 1 will either be stored in different databases in different format and can be pushed to pipeline or can be aggregated into single data lake like ADLS and then pushed to map-R pipeline via Azure Data Factory.
    Also, some data can be pushed for compute and some can be directly use for analytics in stage 3 depending on nature of data and business requirement (like events).
  3. Compute, Enrich and Transform stage : Databricks notebook based on apache spark utilized to clean, transform, and analyze the streaming data, and combine it with structured data from operational databases or data warehouses. Data can move to and fro between different map-R and analytics services to derive deeper insights from this data.
  4. Data warehouse storage : Enriched and transformed data will be stored in either SQL DB (clickhouse) or Cosmos DB (in document, Graph or Columnar format) to run BI queries and provide faster result. For achieving high throughput, choosing the correct database as per your need is important here considering various factors like what is the amount of data that will return via query, what type of relationship does the data holds, expected TPS and many more.
  5. Consumption or Visualization : Once the data is in the warehouse you can use it to get insights and to make predictions using machine learning models, build analytics dashboards like power BI or Tableau and use Azure Analysis Services to serve this data to thousands of users. Users can build self-service reports based on ML models that were ingested in data lake to get more customized reporting.

Interestingly, just like Google AI platform Azure also provide one unified platform (Azure Synapse) having all the components (pipeline, data lake, analytics, SQL pool, Spark map-R jobs, etc.) integrated and can be used to build complete system from ingestion to compusmption with high throughput, efficiency and scalability. This is also the recommended platform to use for analytics by Azure. Also, you can use Azure Synapse link to replicate data into Synapse analytics from transactional and non transactional data store either on cloud or on prim. Of course, cost is one of the major factor that you have to look at before going for Azure Synapse. Although it can be used as “on-demand serverless” model (which allows it to scale up or down and pay for only what you need when you need it), or it can operate on pre-provisioned server resources — whichever is better for your budget and use-case.

Now when we have understood how what pipeline is, what different stages of pipeline are and what is the significance and components involved in each stage, lets look at the high level design of how data flows from one component to another and how these components are integrated with each other.

Microsoft Azure Data Analytics Design

--

--