Data Analytics and Insights

Manas Pant
Pasarpolis Product + Tech
9 min readMar 17, 2020
data pipeline

Being one of the prominent players in the fintech domain in the South-east Asia region, we are custodians of our wide-ranging customer base’ data. Its identifiable characteristics are its huge size, the immense rate of growth, and the sheer variety of it — from transactional data captured in JSON fields to pictures and videos.

Pasarpolis aims to stay a data-driven organization wherein the direction in which the data moves, educates us about — what we ought to do and what we can avoid.

As the organization grows, we want to eliminate all barriers to getting useful data. Regardless of the type of role within Pasarpolis — tech or non-tech, the first natural step for an individual should be to easily and accurately understand the current state of affairs. Further, one should be able to study the historic data and draw useful patterns from it and therefore, helps create projections for the near future.

Goals

  1. Empower our Product Managers, Operations, Partners, and Providers by giving them access to the right data at the right time about their products in terms of usage — it helps with — a) Claims payout/product performance, b) Claim servicing performance, c) New product development, and d) Product improvement insights
  2. Enable Product and Operations teams to do self-service Business Intelligence
  3. Innovate over existing products by reviewing and modeling risk
  4. Enhance the end-user experience by leveraging on Machine Learning technology
  5. Scale operations by automating/speeding-up the claim review and validation process

With the above set of objectives set, we looked at what was already being used within the organization. Our intent was -

  1. To continue to use what was working for us to solve existing problems
  2. Replace infrastructure and tooling that was not working for us to solve the current and upcoming problems. Split them into two parts -
  • Short-term: Consider requirements that needed an immediate solution
  • Long-term: Park those requirements which are important but not urgent

Goals that are applicable as we do tasks to meet every objective of ours are Data -

  1. Authentication — Validated users get access to data. Each user has a G Suite account created when they join Pasarpolis. Users access the various data services available using Google Authentication.
  2. Authorization — Users get access to only that data that they are privy to. As the gatekeepers of various data service, we have defined roles at the application level, and then allocate those roles to individuals to control access. E.g. Restrict access to tables and fields which capture end-customers’ Personally Identifiable Information, such as their names, email IDs, telephone numbers, etc.
  3. Availability — Users get access to data when they want it, there should be minimal to no delay.

Considering the overarching goals listed above, here is the list of our objectives

Objectives

We realized 3 deliverables would form the backbone of data in Pasarpolis -

  1. A performant analytical database
  2. A data pipeline creation, scheduling, and management platform that helps — Extract — Transform — Load
  3. A querying platform that allows users to build queries visually

Setup for Analytics purpose

Database

Criteria -

  1. Well documented
  2. Proven performance
  3. Horizontally and vertically scalable
  4. Available as a managed service that requires minimal effort maintaining it
  5. Ability to integrate with our existing set of services which were all hosted on Google Cloud Platform to — Ingest data and Export data
  6. Minimal restrictions when doing data modeling for our use cases
  7. Minimal cost considering all the above

We chose to go with BigQuery. It competes with Redshift and Snowflake. Here is a peek into the comparative interest over time as presented by Google Trends -

Snowflake is what we may have considered if it had been available as a service on GCP in our region. We did not want to move data between different Infrastructure-As-A-Service vendors nor across regions to avoid introducing latency and data movement costs. Thus, we decided to go ahead with the best option available within GCP for our use-cases. BigQuery is a proven data warehousing solution that has a good amount of documentation available online. It is available as a managed service. We don’t have to worry about creating indexes, setting up replicas, tweaking cluster configuration to ensure our analytical jobs remain performant over time.

It is vertically scalable, meaning it will execute the queries faster if we throw more hardware at it. It is defined by how we allocate slots to the project. We can allocate more slots to increase the rate of computing. It supports horizontal scaling by allowing partitioning. A table can be partitioned based on ingestion time, any DateTime field, or an integer field.

Similar to other Google services like Cloud Storage, Pubsub, Cloud SQL, BigQuery has a range of APIs available which allows us to build interconnected data pipelines.

As far as data modeling is concerned, there are no strict rules. We keep the data denormalized to the extent that it serves 2 purposes -

  1. Simplicity for the downstream users who build reports
  2. Minimal intervention to keep the data sources relevant

There are 2 types of costs that are associated with BigQuery — Storage and Compute. Storage cost is a flat rate per region based on how actively the data source is updated or read. It is decided at the tabular level. Compute cost is based on the size of data processed in each query. The utilization based cost structure is working well for us.

ETL Platform

Criteria -

  1. Ease of use when it comes to creating pipelines
  2. Visibility into scheduled jobs and pipelines — Historic runs and current progress
  3. Ability to integrate Python code for performing transformations
  4. Connectivity with our existing systems

We chose to go with Cloud Composer. It is a Google-managed workflow orchestration service built on top of Apache Airflow.

We had evaluated Google Dataflow, Stitch, and Fivetran. All of these are managed services available on Google Cloud Platform. We quickly realized that Dataflow is something that can be used in conjunction with the Composer but not as a standalone product to solve all our data pipelining needs. Stitch and Fivetran looked promising. Two reasons why we did not go ahead with either of those — first, potentially high cost, and second, comparatively less chatter about them on various forums versus Airflow.

Apache Airflow is an open-sourced project. It has a huge community behind it. There are numerous plugins available for it which allow inter-connectivity between a wide range of services both open-sourced and those available as managed services across different cloud platforms.

We are a bunch of Python developers. The concept of a Directed Acyclic Graph (DAG) as the representation of a data pipeline made sense to us. We could chain Operators to form a graph. Each operator performs a unique task in the pipeline. Information can be exchanged between operators, and across graphs. It allowed our code to be both clean and flexible. DAG execution is visible within the admin panel. Its internal database stores information about all historic runs. That information can be accessed anytime using the available GUI. The GUI also presents information like logs, schedules, and flow visualizations which make pipeline monitoring easy.

Given that it is widely used and is open-sourced, we found various hooks and operators applicable to us ready-made. Connectivity to Google’s services like Cloud SQL, Storage, and BigQuery was established.

Setup — We have a 3-node cluster of Cloud Composer. Virtual Machines are of an n1-standard-2 type.

Reporting

Criteria -

  1. Ability to visually build queries without having to write SQL for regular users
  2. Draw visualizations easily
  3. Export data into Excel or CSV files
  4. Performant
  5. Connect to any database
  6. Ability to write SQL for advanced users
  7. Build dashboards
  8. Share insights between users

Metabase

We use Metabase, an open-sourced reporting platform. Users log into Metabase using Google Single Sign-On. Roles are assigned to individuals which define the access limits. It is simply an interface to fire SQL queries on any database connected to it.

Pros -

  1. It is free
  2. We can connect to any database, it has connectors available for most of them. It was extremely easy for us to use databases and tables from both MySQL and BigQuery.
  3. Visual query builder is quite useful allowing a non-technical user to perform actions like — a) Choosing a subset of columns to review, b) Performing joins between tables, c) Doing aggregations on numerical fields
  4. Easy to draw simple visualizations like bar/line/pie charts
  5. Ability to save analyses as questions, which can be shared between users
  6. Export to Excel/CSV allows a user to download files for local analysis
  7. Reports can be scheduled as ‘pulses’, they can configure triggers to send emails or Slack notifications with or without attachments

Known limitations -

  1. There is a learning curve associated with the exposed data models/data sources, non-technical users are forced to reach out to the Analytics team to get their requirements published as ‘questions’
  2. Exporting data to an Excel or CSV when the database is BigQuery, is unreliable — if the count of rows in the result set exceeds 16,000
  3. A lot of our data, unless explicitly modeled, is enclosed within JSON fields. We couldn’t find any mechanism to extract that information using the GUI elements.

Setup — We have a dedicated VM with 4 vCPUs and 16 GB RAM. It uses our Cloud SQL (MySQL) to host its internal database which is its metadata system. It works well for us. The only exception is the beginning of any month when operational users want to create Excel files with the data from the previous month. When many such parallel requests are fired, since the creation/computation for those files’ creation happens on the server, it starts crawling as the CPU hits 100% utilization across all cores.

Tableau

This is something we are exploring. We will share our detailed findings as part of a separate blog post.

Our goal is to become a data-driven organization. We want to aim down the sights instead of shooting from the hip, especially as we are expanding.

This end-to-end system helps meet the first 2 goals from the beginning of this article -

  1. Empower our Product Managers, Operations, Partners, and Providers by giving them access to the right data at the right time
  2. Enable Product and Operations teams to do self-service Business Intelligence

They also form the backbone for the remaining 3 goals -

  1. Innovate over existing products by reviewing and modeling risk
  2. Enhance the end-user experience by leveraging on Machine Learning technology
  3. Scale operations by automating claim review and validation process

Credits

We are a 6 member team. Each of us wears one of these hats at any point-of-time depending upon the objectives’ placement within the urgency-importance matrix.

  1. Data Engineer — a) Design, track and maintain ETLs, b) Design apt table schemas to fulfill downstream analysis requirements
  2. Data Analyst — a) Perform product performance analysis, b) Draw visualization reports and dashboards, c) Communicate insights to product stakeholders
  3. Data Scientist — a) Build, deploy and maintain computer vision models to process claims faster, b) Do exploratory data analysis to come up with strategies for new product development, c) Build, deploy and maintain risk analysis and forecasting models for the existing products
  4. App Developer — a) Build a Django REST API to share data with another microservice, b) Implement an app that ingests voluminous data to create and manage summarized transactional data

Shout out to my team-mates who set these building blocks -

  1. Ashish Nayan
  2. Pramod Bisht
  3. Ayush Dosajh
  4. Aashish Gupta
  5. Shourya Sengupta

--

--