Common data engineering challenges and their solutions

Vlad Rișcuția
Jul 28, 2020 · 7 min read

Last year — before COVID-19 put a stop to conferences — I attended the Strata Data Conference in San Francisco. The speakers came from across the globe, representing multiple companies and a variety of industries. But many of their sessions focused on a common set of problems, the same ones that we faced when we built our data platform. I was struck by the universal nature of these problems in the data science and data engineering domains, but fortunately they are increasingly easy to solve.

Not long ago, simply storing and processing data at scale were major challenges. Both have since been commoditized by the emergence of cloud providers, allowing teams to focus on higher level problems such as how to best handle metadata management, how to integrate multiple data systems, how to implement DevOps practices, and how to track data quality. I cover each of these below.

Metadata management

As data volume grows, organizing it becomes a big challenge. There is a need for an additional layer of data about the data, or metadata. There are several important pieces of information that are not readily available in the data itself and that a big data platform must provide.

First, there is simply the descriptions of various datasets: What the various columns are, what the purpose of a table is, and so on. This metadata needs to be searchable, making it easy for users to identify and find relevant data in the system. This type of solution is known as a data dictionary.

Next is data lineage. Where did the data come from and how was it sourced? This has compliance implications: For example, if end users agree to share telemetry data for the purpose of improving the product, that data should not be used for other purposes. Also for compliance purposes, various datasets and columns have to be tagged as containing personally identifiable information or other sensitive information so the system can automatically lock down or purge this type of sensitive data when needed (for example, when a GDPR “right to be forgotten” request comes in).

Organizing big data also requires some amount of information architecture. One aspect of this is controlled taxonomies: Clear definitions of what various business terms and data pieces mean, and the canonical queries associated with them. This ensures everyone working in the space shares the same understanding. For example, Azure Consumed Revenue, or ACR, is a key metric that my team tracks. ACR has a standard definition, produced by a canonical query, and we need to ensure everybody on the team uses that same definition and that same query anytime we talk about ACR. Solutions for tracking these types of definitions are known as data glossaries.

The data layer consists of tables (and other datasets) and queries. The metadata layer consists of schema documenting the tables and a glossary of business terms linking to canonical queries.

Our team is working to evolve our in-house solution combining data dictionary and glossary capabilities. We will have more to share on this soon.

Integrating multiple data systems

There is no one-size-fits-all data fabric. Each storage solution has some workflows it was optimized for, and others that it might struggle with. It’s very unlikely that an entire platform will be running on one single storage solution, be it SQL, Azure Data Explorer (ADX), Databricks, or something else. Some workflows need massive scale (processing terabytes of data), some workflows need fast reads (serving a website), and some need interactive-query support (enabling exploratory analysis). And because a data platform often needs to integrate data from multiple sources — and provide data to multiple destinations — we often don’t get to decide: Teams that are upstream in the workflow expose data from solutions that make sense for their needs, while teams that are downstream want it in formats that make sense for their needs.

Standardizing on a single storage solution is often unfeasible, so the next best thing to do is to standardize on the tooling to move the data around and ensure that it is easy to operate; in other words, make it easy to create a new data movement pipeline and provide monitoring, alerting, and so on. Because data movement is a necessity, it must be as reliable as possible.

Our team uses Azure Data Factory (ADF) for orchestrating data movement at scale. I covered this in the “data movement” section of a previous article, “How we built self-serve data environment tools with Azure.” Azure Data Factory makes it easy to set up and orchestrate ETL pipelines. We use it for all our data ingress and egress, running hundreds of pipelines.

We are leveraging ADF’s integration with Azure DevOps, such that all our pipelines are tracked by source control, deployed automatically, and monitored using Azure Monitor.

DevOps

Another major work area is bringing engineering rigor to workflows supported by other disciplines such as data science and machine learning engineering. With a small team, it is relatively easy to create ad hoc reports and run ad hoc machine learning (ML) but this approach doesn’t scale. Once production systems depend on the output, these workflows must run reliably.

This is a solved problem in the software engineering field, with source control, code reviews, continuous integration, and so on. But non-engineering disciplines are not accustomed to this type of workflow, so there is a need to educate, support, and create similar DevOps workflows. Analytics and ML ultimately reduce to code (whether SQL queries, Python, R, and so on) and so they should be handled just as production code.

ADF pipelines deployed from Git orchestrate data movement and running ML code, also deployed from Git, on Azure Machine Learning. Data is distributed through ADLS. The system is monitored using Azure Monitor.

Our team supports these types of workflows using Azure DevOps, with pipelines that can deploy ML and analytics from Git to our production environment. Everything is deployed from source control using Azure Pipelines, and Azure Monitor alerts us if anything goes wrong. Bringing DevOps to data science is a big topic. I covered some of the aspects in previous articles, one pertaining to self-serve analytics, the other to MLOps.

Data quality

The quality of all analytics and machine learning outputs depends on the quality of the underlying data.

There are multiple aspects to data quality. One set of definitions is given by the article “Data done right: Six dimensions of data quality”:

  • Completeness: The dataset is not missing any required data.
  • Consistency: The data is consistent across multiple datasets.
  • Conformity: All data is in the right format, within the right value ranges, and so on.
  • Accuracy: The data accurately represents the domain being modeled.
  • Integrity: The data is valid across all relationships and datasets.
  • Timeliness: The data is available when expected and datasets are not delayed.

A reliable data platform can run various types of data quality tests on the managed datasets, both at scheduled times and during ingress. Issues must be reported and the overall state of the data quality made visible through a dashboard so stakeholders can easily see the datasets with problems and the potential implications.

There is no “out-of-the-box” service to do this in Azure, but we collaborated with another data engineering team within Microsoft to deploy a cloud-native solution for data quality testing.

A Cosmos DB database stores test definitions and results. An orchestrator schedules test runs. For each data fabric (ADX, SQL, ADLS), a web job handles test execution and data fabric–specific details. Another web job handles test results: Writing them to the Cosmos DB and opening incidents on test failures. A Power BI report provides a view on test status.

The solution uses a Cosmos DB backend to store test definitions and an orchestrator web job to handle test scheduling. Messaging between components is done via service bus queues. We have a web job for each data fabric supported and they handle running tests on the data fabric. This can be Azure Data Explorer, Azure SQL, Azure Data Lake Storage, and others. This architecture allows us to have a standard, abstract test definition, with plug-in data fabric support.

Test results get written back to Cosmos DB and, in the case of test failures, the system opens tickets in IcM, the Microsoft-wide incident management system. This alerts data engineers about data quality issues in real time so they can mitigate them as quickly as possible and keep things running smoothly.

We also have a Power BI report built on top of the Cosmos DB, which shows stakeholders the overall health of the data in our platform.

During the past few months, we authored hundreds of data tests, which continuously check the health of our platform.

Summary

In this article we looked at some of the existing challenges in data engineering and how our team handles them on Azure.

  • We are evolving our custom-built metadata management solution, which includes a data dictionary and a glossary.
  • We use ADF to integrate multiple data systems at scale, with solid DevOps practices around it, including Git integration and active monitoring.
  • DevOps is foundational for a robust data system. We use Azure DevOps and Azure Pipelines to deploy from Git, and Azure Monitor for both analytical and ML workloads.
  • Ensuring data quality is another critical component of a data platform. We collaborated with another team at Microsoft to deploy a cloud-native data testing solution that allows us to run hundreds of tests across our platform.

In the coming years, I expect we will have both better tooling for some of these problems and better-defined industry-wide standards. As I mentioned at the beginning of this article, not long ago, just storing and processing huge amounts of data was a hard problem. Today, the main challenges are more about organizing and managing data. I fully expect that soon we will have out-of-the-box solutions for all these problems, and a new set of challenges will emerge.

Data Science at Microsoft

Lessons learned in the practice of data science at…