Data Analytics Stack at Docon

Navneet Gupta

Published in

Docon

10 min readMar 22, 2019

Setting-up Analytics for an early and mid-stage startup

Background

Having the right context at the start is always good. I joined Docon Technologies in Aug’18 with an intent to set-up and grow the Data Science function at the Organisation. Docon is a health-tech startup aimed at providing technology solutions to improve the Doctor-Patient experience in an Out-Patient Department (OPD). One of our major product offerings is an EMR Product (iPad based App) that doctors can use to write Prescriptions in a digital format (no more relying on pen and paper). This not only enables the patient a better readability and digital record keeping for their health data, but also enables the doctor to have access to important health information like Vaccination/Medical History etc that may prove quite helpful in the longer run.

Across the article I’ve tried to use as less jargons as possible and have mostly skipped the technical details. This has been done to ensure that Startup Founders/Leaders can also consume the information.

Importance of Data Analytics

With the recent boom in Cloud Computing (read cheaper data storage) and billions of devices generating Peta Bytes of data, the importance of using this data in making smart business decisions is inevitable for any Organisation. Making data informed decisions in business brings competitive advantage; professionals strive for operational efficiencies and business optimisations can be enabled. Much has been already written on this point, a simple Google search can help find an answer to this.

At Docon, our Leadership also went through the same decision making process around setting up the Data Science Team and therefore empower data-driven decision making across various teams.

Data Maturity & Literacy

To have data is not enough

While it is super-important to access the maturity and literacy of people who would consume data, still a lot of companies tend to ignore this in haste at the very start. The very reason this holds a place of significance is that almost every decision right from hiring the right set of people to setting up the right set of tools to training the existing talent really depends on how mature is the organisation in terms of data literacy.

Common pitfalls could be either hitting too high or too low. For instance, if the Organisation is setting-up the entire data stack for the very first time, then it is quite possible that technologies like Hadoop/Hive or Kafka etc are not the best option. If these happen to be the best fit then either the Organisation is setting up Analytics too late or is directly shooting for the moon. Similarly, if the Organisation is Consumer-facing Internet company and already has anything more than 100k Monthly Active Users then not having a Distributed Computing solution might be a very conservative idea. That being said, these are high level guidelines and can have exceptions in a few cases, like overnight growth or consumer base that has very high DAU to MAU ratio. The key point here is to have a solution or tech stack that fits in well with the current volume of data flow and factors in for upto 10x business growth.

Ok, I get it, but where do I start?

Data Engineering is the ignition key

Image via HealthCatalyst (***Image is representative only***)

A lot of companies especially start-ups mess this up, thanks to super-execution strategy at play. The reason why this fails is that the Leadership teams generally setup functions and results start pouring in within the first 4 to 8 weeks, Analytics is no different in this expectation. This sounds quite a fair ask but the ground reality is that in the absence of Data Engineering folks, Data Analysts generally struggle. The core competencies of Data Analysts do not require them to build Data Warehousing solutions, Data Dictionary and check for Data Quality at each step.

At Docon, we were fortunate enough to set the right expectations at the very start and buy in time for setting-up the Data Engineering stack first and then power business use cases using Data. We started out with having a Data Warehouse in place, for which we chose Amazon Redshift. The logic for the decision was quite straight forward, all our technology stack was on AWS and we pretty much did not want to have multiple Cloud Platforms. Within AWS as well, they offer quite a lot of solutions like Amazon Athena and Redshift Spectrum but Amazon Redshift fit our requirements really well in terms of functionalities and the prospective billing.

Data at Source

Common sources of data are one or more of structured databases like MySQL, PostgreSQL, Microsoft SQL Server, Oracle SQL Server etc, these can be categorised as RDBMS (Relational DataBase Management Services). Unstructured databases, which have really picked up worldwide recently can also be used, few of them are MongoDB, CassandraDB, DynamoDB, Graph databases etc, these are categorised as NoSQL databases.

The primary database at Docon for storing all the data generated by Doctor/Patient is stored in MongoDB. This being one of the unstructured sources does not have a defined schema. Other sources of data include Hubspot which is used as a Sales CRM, Intercom which is used for in-app chat with the user, Google Sheets are used as an input source by various teams for process management, JIRA is used for issue/bug reporting and Crashlytics is used for Crash reporting, Segment is used for collecting the clickstream data of users. Apart from these, there are a few other sources as well that aren’t used for any Analytics purposes but we will use them in the future.

Data Processing

Enter Data Transformation at work

We’ve already talked about the Data Sources and the Destination Data Warehouse, the only missing piece is how to make the data transfer. In order to understand how the data transfer works, it’s essential to understand business use-cases at a broad level. This again is a very common pitfall wherein the Data Engineering folks don’t feel an importance for understanding the business requirements and start building the ETL (Extract-Transform-Load) framework right away. I feel having a high level understanding of what is finally expected from the data at hand helps answer critical questions like priority for data transfer, refresh frequency and selective syncing if possible.

At Docon, since one of our Data Source is NoSQL based while the destination is a columnar-standard SQL based Data Warehouse, therefore the transformation from Unstructured to Structured happens when the data is in transit. For enabling this we decided to use Hevo, which is an Enterprise Data Integration Tool. Making use of Hevo helped us gain speed of execution and the only thing that we had to do was write the transformation code, the integration tool takes care of sync-frequency, schema definition and re-queuing of events in case of failure. Apart from MongoDB most of the other sources have been transferred to the Data Warehouse using custom code written in Python 3.6. As a rule of thumb, we always prefer using APIs for accessing third party data and a sync-frequency of once in 3/4 hours (and 24 hours in some cases) if there is not a relevant use case to transfer data more frequently. Needless to say, the data syncing is an “upsert” operation which means that all records which are updated are re-synced and the new records are synced to the destination.

Data Warehouse

Once the data is present in the Data Warehouse the data may not be directly usable and may need some more standardised formats and data cleaning as well. This is generally where the Data Analysts and Data Engineering folks work closely to handle data processing. One common way of data cleaning is using custom SQL queries and then building “fact” & “dimension” tables from the base tables. In order to understand “fact” and “dimension” tables imagine these as if one had to create a limited set of standardised tables which could power ~80% (just an indicative number) of the Analytics use cases. Fact tables generally contain transactional data which mostly has transaction & product/service IDs only and most commonly used and calculated attributes of the transaction. Dimension tables act like data stores that contain all the attributes for the product/service that can be offered in a transaction. The rationale for building these standard tables is that these data models serve as source of truth for all kinds of data reporting and basic descriptive analytics. In absence of such tables there might be a few repercussions like extended time in addressing data requests, mismatch in values of KPIs which is not so good in building trust across business stakeholders.

At Docon, we started using Fact tables from the very first day and have ever since kept it as our defacto way of Data Analysis process. The approach to what “fact” tables to build is however not so standard and is mostly an iterative process. We use standard SQL queries on Amazon Redshift to build the data models. The queries are then embedded into Python scripts and loosely can be compared to being stored procedures in itself. The frequency for refreshing the data models is once in 24 hours, any use case that requires intra-day data is powered outside of data models. The Python scripts are scheduled over an ec2 linux machine again in AWS.

Consumption of Data

Data Visualisation & BI Reporting

Now this is one of the most critical piece when it comes to success or failure of the entire Data Analytics efforts. All that has been done so far remains in background most of the times and the business stakeholders interact with only the Data Visualisation and/or Business Intelligence tool. This is another probable point of failure wherein Analytics teams lack empathy in how usable actually the BI tool is. Change management for any Organisation big or small is difficult and a lot has been written by domain experts on this over the years, one key thing that Data Science teams should pick is that “Never overwhelm the user”.

What this also means is that start with a very simplistic solution and build trust over time and always keep the overall experience in mind. This gets usually ignored if the Analyst comes up with a super efficient Machine Learning solution that brings in 90%+ accuracy but at the same time fails to realise that the Business decision maker doesn’t have a slightest idea of what ML is and what it can solve. In my personal experience I would recommend, Data Science teams to level-up the game quarter by quarter. If the consumers of your data are data literate then and only then it is wise to put the industry best solutions in place. It is ok to work with a non-optimised solution for a while than the business decision maker ignoring Data Analytics completely.

At Docon, we took a conscious call to have just Descriptive Analytics for the first 6 months, we’re expecting a completion of this phase in a few more months from now. For BI reporting we chose Metabase which is an open-source Data Visualisation tool. There were a bunch of other tools in evaluation as well like Chartio, ReDash, Mode Analytics etc but we went ahead with Metabase because it offered basic Visualisation capability along with querying on data and drag-drop functionality which makes it highly self-serve. Apart from this it also offers a functionality to schedule alerts to be received over email or slack. For clickstream data we use Amplitude, which offers good solutions for Product Event Analytics and has ready to use templates for quite a lot of use-cases.

Ethics at Work

All is not done in Data Science when it comes to Ethics. Ethics and Data Security should play a larger role right from the start in any Data Science Team. As owners of Data we should understand that it is our responsibility to make sure that the Data does not land up in wrong hands and that security gets its due importance irrespective of project execution timelines.

Since at Docon we deal with sensitive health data, therefore we follow a few standards in terms of Data Security. Having said that we have a long way to go in order to comply with global standards but as of now we have made sure that we have the basic sanity ones in place. Data through the entire transit from Source to Destination flows in an encrypted format, which means only people/systems with the decryption key can read it. The Personally Identifiable Information (PII) for patients is NOT AT ALL available in any form on the Data Warehouse which means no one can read or access the patient name, contact number or address etc. This becomes super critical as firstly for any kind of analysis purpose we don’t have to know or contact the patient and secondly, if not done this way, would be seen as a breach of trust for the services we offer to our doctors and patients. All the data on the Warehouse flows in and out of the Mumbai region of AWS, this was deliberately done so as to comply with the Draft Data Privacy Law that is expected to be in effect in a few months from now.

All that said and done, Docon as a Organisation has a long way ahead and we still have a large number of Data Science Projects and Use Cases to solve and therefore empower data-driven decision making across all possible teams.

Really appreciate if you’ve read the article this far. If you found it to be useful for yourself or a colleague or a friend please feel free to share, and support with a few claps 👏. I would love to hear feedback on the article as well 😃.

PS : For anyone who wants to contact can reach out over LinkedIn.