Improving Analytics Engineering at mPharma — PART 1

Samuel Baafi Boakye
mPharma Product & Tech Blog
7 min readMay 20, 2021

--

Photo Courtesy of Adam Nowakowski

If you have ever worked on a data science-related project, you might be familiar with the process of data cleaning and feature engineering. The cleaning process usually involves removing null values, renaming fields, correcting spelling mistakes, and dealing with missing data while the feature engineering process usually involves generating new fields for the dataset, merging dataset with other datasets, and generating even more fields. Data cleaning and feature engineering could be summed up into one process called analytics engineering.

In a data team, Analytics engineering can be defined as what an engineer does to bridge the gap between a data engineer and a data analyst.

In a modern data team, you would probably have an analytics engineer to “provide clean data sets to end users, modelling data in a way that empowers end users to answer their own questions”. But traditional data teams do not have an analytics engineer. A traditional data team consists of data engineers and data analysts.

This article would cover what traditional data teams are lacking and how to bridge the gap between the data engineer and the data analyst.

The Role of a Data Engineer

A data engineer’s sole responsibility is to move data from one point to another. The engineer ensures that data is transported in the safest way possible from point A to point B and disseminated to the people who need it in a timely manner.

Ever wondered what happens when the data gets to its destination? This is where the data analyst plays a role.

The Role of a Data Analyst

A data analyst generates business insights from data that is provided by the Data Engineer. Once the analyst receives the data, the individual needs to understand it, transform the data (if necessary) and extract insights from it.

Hence in a traditional data team, the process entails a data engineer moving data from point A to point B, then the data analyst uses the data at point B to generate insight. This is generally the workflow in a traditional data team.

With the workflow in a traditional data team in mind and the definition of analytics engineering, you might be wondering if a gap really exists between a data engineer and a data analyst. The crux of the matter is, both roles can co-exist but it leads to some serious implications.

Implications of the Traditional Data Team’s Workflow

A traditional data team’s workflow suggests that when the data engineer moves data from its original source (Point A) to a destination (Point B), his job is done. But why is this a problem?

  • Data in its original source is in its rawest form — when data is moved through a pipeline to a data warehouse or data lake, it is still pretty much in its raw form although probably not as raw as it was in the source. This is problematic for the data analyst because the analyst will have to generate insight from the raw data. To illustrate, this could be likened to someone inviting you to a buffet, but you are served uncooked food and asked to use your discretion to prepare your own food. This leads us to the next problem.
  • Since the analyst has been served uncooked food, the analyst will have to take the pains and effort to prepare his own food; that is a bit misleading. Because of the structure of the traditional data team, you will find analysts writing long lines of SQL code just to create materialized views (MVs) or reports for a particular department in the business. Although the analysts do their best to create MVs that end up acting as dependencies for other MVs and reports, it can clearly be observed that the process of preparing their own food could have been avoided. How so?

The Gap

As can clearly be observed, a gap exists between the role of a data engineer and the analyst for traditional data teams. The gap is created because neither the data engineer nor the data analyst is a cook. The former provides raw materials like vegetables, a bag of rice, oil, and the likes while the latter needs finished products like sauce or stew, cooked rice or cooked meat to taste in order to make sense of what has been prepared, and finds ways to combine several finished products to suit others.

The gap is the cook which is rightly named Analytics Engineering.

With Analytics Engineering, when the data gets to Point B, the data undergoes a process where the data is cleaned and transformed to prepare it for the data analyst. A number of processes are involved in analytics engineering which are:

  • Raw data is transformed into meaningful data. Instead of the transformation happening in the reports given to the business, it is done at this stage. This makes it easier for the analyst to generate insight and create reports at a later stage.
  • Raw data is decoupled into various data marts to serve various business functions.
  • Raw data and processed data are properly tested for discrepancies based on predefined metrics or expectations at regular time intervals.

At mPharma, the data team’s structure falls within the structure of a traditional data team and a modern data team. This means we do a little bit of analytics engineering but this function is not properly defined. But unlike the modern data team which would have an analytics engineer, we plan to have every member in the data team function as an analytics engineer as well. This means data engineers and data analysts are all going to learn to cook.

There are a number of ways to approach Analytics Engineering. I would like to summarize them into two main ways — the hard way and the easy way.

The Hard Way

Currently, we approach analytics engineering the hard way.

The hard way is when the data analyst combines a set of tools which are partially automated to handle a particular process under analytics engineering.

At mPharma we have a platform to automate our scripts for analytics engineering. We version control our scripts using GitLab and run tests separately. We manually document our scripts and database tables using GitLab wikis. The point I am making here is that each part of the analytics engineering workflow seems to be decoupled from each other which makes it difficult to maintain and it does not help analytics engineering scale due to the following reasons:

  • Tests are only run when data is being cleaned and transformed for the first time. This means that in the subsequent runs of our custom scripts, there will be no testing. This leads to a series of problems — What if bad data has entered a materialized view? What if the count of rows in a materialized view has reduced when it should be increasing? What if the refresh rate of a particular table is off? What if vital metrics are off? We would never know unless we manually check or run tests. The solution would be to run automated tests as part of the analytics engineering workflow.
  • Another issue with the hard way is that it is not easy to create dynamic and custom queries with a templating engine. With the current workflow we have, queries are structured for a particular materialized view or report. If in any case, you have two scripts that have a similar code structure, you will not be able to apply the DRY (Don’t Repeat Yourself) software development principle — you will have to deploy both scripts independently. The workaround is to create a materialized view with the part of the code that is similar and both scripts will use that materialized view.
  • Another point worth mentioning is that there are libraries and packages to make analytics engineering a little easier. Libraries like Great expectations can seriously improve our workflow. There are some packages that are tuned to suit some data warehouses like AWS Redshift which could significantly improve analytics engineering at mPharma. Unfortunately, the hard way makes it difficult to use some of these libraries and packages.
  • The final point is that when we are done creating scripts and tables, we have to manually create documentation for each of our tables or views. Would it not be nice to have this generated automatically and also embedded in the workflow?

The Easy Way

The easy way is to manage the analytics engineering workflow with a tool like dbt.

Why do I call it the easy way? Well, it caters to all the problems mentioned in the hard way. For instance, using dbt cloud gives you a beautiful UI to manage your SQL scripts with version control integrated. You can create jobs to automate your scripts and automate your tests using the same tool without having to decouple your workflow. You can also install packages in dbt and fortunately for us, there is a dbt package for Great expectations and a package for Amazon Redshift. Since we use Redshift as our data warehouse it is a huge plus. You can also create dynamic and custom queries which makes it easy to apply the DRY principle and also use macros. dbt gives you some simple test cases to apply to your data and allows you to create custom test cases as well. When you create your materialized views or tables using dbt, you can generate documentation and get a graph-like view of how your tables are connected which is good for auditing and debugging purposes.

The beautiful thing about dbt is that your analytics engineering workflow happens in the same place with the same tool.

Conclusion

I hope I have given enough reasons why we want to improve our analytics engineering workflow and why we chose dbt to help us do this? Ultimately, I believe you cannot build an efficient and effective data team without considering the role analytics engineering plays. At mPharma, we always strive to be the best at what we do. We are a data-driven organization leveraging the right tool for the right job to build a modern data team. And we chose dbt.

Kindly look forward to the next part of this series as we talk about our learnings with the use of dbt.

--

--