Improve your Data Lifecycle with Metadata-Driven Pipelines

Published in

Slalom Data & AI

6 min readApr 2, 2020

Written by Arjun Jeyapaalan, Justin Erickson, & Olya Solomon — March 30, 2020

No digital transformation program is complete without a data-based initiative. With some speculating that artificial intelligence will create a “winner takes all” game in certain industries, the need to capture and process data that will feed into artificial intelligence algorithms has never been more urgent. In fact, a recent Gartner survey found that leading organizations are expected to triple the number of artificial intelligence projects they have in place by the year 2022.

While artificial intelligence’s potential is exciting, the outlook for big data initiatives — one of the cornerstones of any artificial intelligence process — is less positive. According to Gartner Analyst Nick Heudecker, approximately 85% of big data projects are slated to fail. Among the primary causes of failure cited are the difficulties inherent in integrating big data with existing business processes and applications, management resistance, as well as security and governance challenges.

Consider the following scenario:

A data analyst working with the sales team would like to track new metrics in the daily sales reports. However, the data analyst realizes that the underlying data to these metrics does not currently exist in the database that feeds into the reports. The analyst then proceeds to contact the data engineers that own the data warehouse to add these new metrics to reporting tables.

Tasked with this new request, the data engineers check the database schema and realize that the data elements required to calculate these new metrics are not in the data warehouse. In this scenario, the data engineers must:

Identify and request access to the system that currently stores the required data elements
Redesign the data warehouse schema to add the new data elements
Pre-process the data elements to ensure it satisfies the data warehouse schema and table requirements from the sales team

This entire process could take weeks before the data analyst is able to report the new metrics in the daily sales reports. Meanwhile, the sales team have lost the opportunity to act on potentially valuable insights earlier. In addition, the long operational lead time to go from data acquisition to insight creates a high marginal cost to ask the next business question.

The challenges that plague traditional data analytics call for an improvement on existing methodologies to increase project success rates. Fortunately, there is a modern approach to data analytics that aims to accelerate the ability of analytics teams to deliver insights to users.

DataOps 101

DataOps is an approach to data analytics that refers to the union of data experts and IT operations. It simplifies and reduces the end-to-end data analytics lifecycle, from the origin of ideas to the creation of reports and models that add value. While DataOps aims to achieve with data analytics what DevOps has achieved with software development, DataOps also focuses on the diverse set of people, processes, and tools that exist in data environments. DataOps achieves focus by leveraging agile principles, the DevOps principles of continuous integration and delivery, and lean manufacturing. These tenets are in place to minimize redundancy, foster a culture of continuous improvement and where possible, reduce the number of steps in a data lifecycle through automation.

Metadata-Driven Pipelines and How they Enable DataOps

At a manufacturing plant, assembly managers coordinate the activities of the assembly line workers to ensure that work is being done efficiently while still producing high-quality products. Within DataOps, data pipelines play a similar role. Specifically, they manage the full lifecycle of data: scheduling jobs, executing workflows, and coordinating dependencies to process data from source systems to downstream applications. With countless accessible data sources available for enterprises and even more downstream endpoints for data consumption, data pipelines need to be robust and scalable to route that data correctly without losing integrity. For example, within an enterprise, customer data may need to be filtered and sent through a rigorous security scanning process, while at the same time, streaming location data is being combined with sales data and utilized downstream to train a machine learning model. Modern data pipelines will minimize the amount of code changes required to handle these different types of scenarios as they need to adapt to the changes in incoming data at the speed and scale required. The most efficient way to enable this flexibility and scalability is to use parameterized, metadata-driven pipelines.

Parameterization allows users to successfully run the same process using different inputs. Following up on our manufacturing example, it would mean a clothing factory could change from producing red scarves to blue scarves by changing the color dye that they use. For data pipelines, it means that pipeline code must be re-usable and elastic enough to accommodate all kinds of data inputs, with few constraints on schema or type. In data-driven organizations, these inputs are always changing. Any new data source, schema change, or other analytics improvement requires an update to the data pipeline. Having to re-engineer a data model every time a new requirement comes in can take up valuable resource time that could be better spent elsewhere. Instead, by parameterizing workflows within the pipeline, a set of metadata inputs and conditions (i.e. information about the data source) could be used to determine how that new data source might be sent through the pipeline. These metadata inputs and conditions enable teams to quickly bring together siloed data sources without major code changes or resource requirements, which is crucial to lowering cycle times and analytics delivery. By emphasizing parameterization, the time taken to get data from point A to point B is minimized.

Practical Approach with Azure Data Factory

Building a metadata-driven pipeline that can handle all analytics scenarios in a large organization is daunting. Even with a robust data governance policy, poor data quality can creep in. Fortunately, there are tools and services available to make this process easier. Open source tools such as Apache Airflow, Luigi, and Azkaban can help organizations accelerate the data lifecycle at a low cost. There are also managed services available that can handle the back-end infrastructure to minimize any upfront work required. Azure Data Factory (ADF) is a managed service within Microsoft Azure that we have used recently for our clients. ADF is an intuitive data pipeline orchestrator in Azure with drag-and-drop capabilities, a huge variety of connectors to bring in any number of datasets, and a large number of “activities”. Its intuitive user interface and integration with other Azure services makes it a great option for organizations hosted in Azure.

Recently, we used ADF to build robust pipelines to bring data from key sources in the organization into a centralized data lake, which was then used to curate and consolidate downstream analytics. Following the DataOps approach of ensuring code reusability, we parameterized our entire pipeline to ensure that the same pipeline could be used for any future data source that our clients might decide to bring in. We developed JSON files to store information about the data sources (Salesforce, Azure SQL Databases, Marketing Platforms, etc.), and used ADF to orchestrate and process that data through the pipeline. Now, when the organization wants to bring in a new data source, they can simply add that source to the metadata JSON files. During the next scheduled pipeline run, the new data source will be automatically run through the pipeline, where it is processed, checked for data quality, and loaded into the data lake where it is available for analytical use. Instead of taking weeks to determine requirements, modify code, and get the approvals required, their data engineers are now able to add a new source in minutes by using already-approved code and incorporate it into the analytics lifecycle.

Conclusion

Speed to market is crucial to the success of big data initiatives within organizations and thus finding ways to improve business processes and the data lifecycle are essential. While changes to existing processes may face the inertias that stem from the established practices of an organization, building metadata-driven pipelines is one key step towards reducing the time spent deploying analytics to production and planting the seeds of a DataOps-enabled modern culture of data.

If this article was of interest to you, or if you would like to learn more about our perspectives on the modern data culture, please let us know! We love to talk data!

Improve your Data Lifecycle with Metadata-Driven Pipelines

Written by Slalom DC