2017: The Year of DataOps

Data analytics has become increasingly important over the past several years as organizations increasingly find that data is the key to creating and sustaining a competitive advantage. The single most important innovation in data analytics this past year was DataOps. 2017 was the year that DataOps reached critical mass.

2017 was the year that DataOps reached critical mass.

If you are unfamiliar with the term, DataOps is a new approach to the end-to-end data lifecycle, which applies new processes and methodologies to data analytics. Agile software development helps deliver new analytics faster and with higher quality. DevOps automates the deployment of new analytics and data. Statistical process controls, used in lean manufacturing, test and monitor the quality of data flowing through the data-analytics pipeline.

People are talking about DataOps. Companies are marketing DataOps products and services, and organizations are adopting DataOps to improve the efficiency, quality and cycle time of their data analytics. Below are some references, which summarize the state of DataOps and show the breadth of activity emanating from the DataOps ecosystem this past year.

DataOps Ecosystem

Growing enterprise interest in DataOps has spawned a robust ecosystem of vendors who market a wide array of product and services:

Four key software components of a DataOps Platform

  1. Data Pipeline Orchestration: DataOps needs a directed graph-based workflow that contains all the data access, integration, model and visualization steps in the data analytic production process
  • Airflow — an open-source platform to programmatically author, schedule and monitor data pipelines.
  • Apache Oozie — an open-source workflow scheduler system to manage Apache Hadoop jobs.
  • DBT (Data Build Tool) — is a command line tool that enables data analysts and engineers to transform data in their warehouse more effectively.
  • BMC Control-M — a digital business automation solution that simplifies and automates diverse batch application workloads.
  • Composable Analytics — a DataOps Enterprise Platform with built-in services for data orchestration, automation, and analytics.
  • DataKitchen — a DataOps Platform that reduces analytics cycle time by monitoring data quality and providing automated support for the deployment of data and new analytics.
  • Reflow — Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs. (added April 2018)
  • Mara — A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow. (added May 2018)
  • ElementL — A currently stealth company founded by ex-facebook director and graphQL co-creator Nick Schrock. Dagster Open Source (added June 2018)
  • Astronomer.io — Astronomer recently re-focused on Airflow support. They make it easy to deploy and manage your own Apache Airflow webserver, so you can get straight to writing workflows. (added June 2018)
  • Piperr.io — Use Piperr’s pre-built data pipelines across enterprise stakeholders: From IT to Analytics, From Tech, Data Science to LoBs.(added Dec 2018)

2. Testing and Production Quality: DataOps automatically tests and monitors the production quality of all data and artifacts in the data analytic production process as well as testing the code changes during the deployment process

  • ICEDQ — software used to automate the testing of ETL/Data Warehouse and Data Migration.
  • Naveego — A simple, cloud-based platform that allows you to deliver accurate dashboards by taking a bottom-up approach to data quality and exception management.
  • DataKitchen — a DataOps Platform that improves data quality by providing lean manufacturing controls to test and monitor data.
  • FirstEigen — Automatic Data Quality Rule Discovery and Continuous Data Monitoring (added August 2018)
  • Great Expectations — Great Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compile or deploy time).

3. Deployment Automation: DataOps continuously moves code and configuration continuously from development environments into production.

  • Jenkins — a ‘CI/CD’ tool used by software development teams to deploy code from development into production
  • DataKitchen — a DataOps Platform that supports the deployment of all data analytics code and configuration.
  • Amaterasu — is a deployment tool for data pipelines. Amaterasu allows developers to write and easily deploy data pipelines, and clusters manage their configuration and dependencies.
  • Meltano — A new project from GitLab. Meltano aims to be a complete solution for data teams — the name stands for model, extract, load, transform, analyze, notebook, orchestrate — in other words, the data science lifecycle. (Added August 2018)

4. Data Science Model Deployment and Sandbox Management: DataOps-driven Data science teams make reproducible development environments and move models into production.

  • Domino — accelerates the development and delivery of models with infrastructure automation, seamless collaboration, and automated reproducibility.
  • Hydrosphere.io — deploys batch Spark functions, machine-learning models, and assures the quality of end-to-end pipelines.
  • Open Data Group — a software solution that facilitates the deployment of analytics using models.
  • ParallelM — moves machine learning into production, automates orchestration, and manages the ML pipeline.
  • Seldon — streamlines the data science workflow, with audit trails, advanced experiments, continuous integration, and deployment.
  • Metis Machine — Enterprise-scale Machine Learning and Deep Learning deployment and automation platform for rapid deployment of models into existing infrastructure and applications.
  • Datatron — Automate deployment and monitoring of AI Models
  • DataKitchen — a DataOps Platform that supports the testing and deployment of data science models and the creation of sandbox data science environments.
  • DSFlow — Go from data extraction to business value in days, not months. 
    Build on top of open source tech, using Silicon Valley’s best practices. (added April 2018)
  • DataMo-Datmo tools help you seamlessly deploy and manage models in a 
    scalable, reliable, and cost-optimized way. (added May 2018)
  • MLFlow- An open source platform for the complete machine learning lifecycle from MapR (added June 2018)
  • Studio.ML — Studio is a model management framework written in Python to help simplify and expedite your model building experience.(added August 2018)
  • Comet.ML — Comet.ml allows data science teams and individuals to automagically track their datasets, code changes, experimentation history and production models creating efficiency, transparency, and reproducibility. (added Sept 2018)
  • Polyaxon — An open source platform for reproducible machine learning at scale. (added Oct. 2018)
  • Missinglink.ai — MissingLink helps data engineers streamline and automate the entire deep learning lifecycle(added Dec 2018)

Other supporting functions in DataOps

  1. Code and artifact storage (e.g. git, dockerhub, etc)
  2. Parametrization and secure key storage (eg. Vault, jinja2)
  3. Distributed computing (e.g. mesos, kubernetes)
  4. Data Virtualization, Versioning, and Test Data Management:
  • Delphix —A software platform that enables teams to virtualize, secure and manage data.
  • Redgate — SQL tools to help users implement DataOps, monitor database performance, and provision new databases.
  • Pachyderm — version control for data, similar to what Git does with code.
  • Quilt Data — Quilt versions and deploys data: like Docker for Data (added July 2018)
  • Privitat — More data-driven decisions without compromising on privacy. Get more business value from sensitive data — while enhancing privacy protection. (added Sept 2018)
  • DVC — Open-source Version Control System for Machine Learning Projects … data version control (added Dec 2018)

2. Data Integration and Unification

  • Nexla — Scalable and secure Data Operations platform that allows business users to send, receive, transform, and monitor data.
  • Switchboard Software — fully managed, cloud-hosted data operations solution that integrates, cleans, transforms and monitors data.
  • Tamr — enterprise data unification solution that uses a bottoms-up, machine-learning-based approach.
  • StreamSets — The industry’s first data operations platform for full life-cycle management of data in motion. (started messaging DataOps in 2018, added Oct. 2018)

3. Big Data Performance Management

  • SelectStar — database monitoring solution with alerts, monitoring, and relationship mapping.
  • Unravel — manages performance and utilization of big data applications and platforms.

Other Vendors

  1. All-in-One Cloud Platforms
  • MAPR — provide a Converged Data Platform that enables customers to harness the power of big data by combining analytics in real-time with operational applications to improve business outcomes.
  • Quobole — big-data-as-a-service company with a cloud-based platform that extracts value from huge volumes of structured and unstructured data.
  • John Snow Labs — The Data Lab is an enterprise platform featuring data integration, no-code interactive data discovery & analysis, a collaborative data science notebooks environment, and productizing models as API’s at scale.

2. Service and Consulting Organizations

  • John Snow Labs — Data curation, data science, data engineering, and data operations services. specializing in healthcare and life science.
  • XenonStack — DataOps, DevOps, decision support, big-data analytics, and IoT services
  • Locke Data — Data science services
  • Cognizant — services that help define and deliver a big data strategy
  • Silicon Valley Data Science — data science consulting company
  • Kinaesis — We work with our clients within the Financial Services to leverage investment into Data Solutions and generate real value. (added October 2018)

DataOps Analyst Coverage

Toph Whitmore at Blue Hill Research began to cover DataOps this year and produced four reports examining how enterprise leaders use DataOps to break down organizational, architectural, or process-related silos.

Keyword Searches

Searches for the keyword DataOps were up sharply in 2017, reflecting a surge of interest in the topic.

Google Searches for “DataOps” 2013–2017. Note: 2017 searches include YTD searches through Nov 30th.

DataOps Surveys

Both Nexla and Qubole conducted surveys to shed light on the challenges in data analytics that can be addressed by DataOps:

  • Nexla “Definitive Data Operations Report 2017” — Key findings: Companies need to elevate DataOps into a core function if they want to maximize data value. Inter-company data collaboration is growing and will become the norm. Data executives do not have the support they need to maintain their company’s DataOps.
  • Qubole “State of DataOps” Report — The survey indicated that data teams have high confidence and that demand for big data analysis is growing, but big data processes are still in the earliest stages of maturity. Only 8 percent of respondents consider their big data initiatives to be fully mature.

Book

The team that founded Qubole convey how data pioneers at Facebook, Uber, LinkedIn, Twitter, and eBay create data-driven cultures and self-service data infrastructures for their organizations using DataOps principles.

DataOps Manifestos

Key players and supporters coalescing around DataOps have produced a DataOps manifesto consisting of 18 DataOps principles, which summarize the mission, values, philosophies, goals and best practices of DataOps practitioners.

The State of Connecticut published a guide to applying DataOps principles for government enterprises.

Open Source Projects

There are multiple active open source projects providing DataOps-related functionality. These include, Airflow, Apache Oozie, TensorFlow Serving and others.

Top Articles

Dozens of DataOps articles, white papers and blogs were published in 2017. Here are some of the best:

DataKitchen Blogs

At DataKitchen, we’ve written extensively on the DataOps blog this year. Here are some of our most widely read posts:

Case Studies

As DataOps gains traction with enterprises, product and solutions vendors have published case studies to demonstrate the value that DataOps adds. Below is a small selection of the numerous case studies available.

  • MapR DataOps Case Study — TransUnion has launched a new self-service analytics platform to provide its customers with market insights and historical perspectives to inform their risk strategies.
  • Qubole DataOps Case Study — Pinterest builds big data applications quickly by turning a single cluster Hadoop infrastructure into a ubiquitous self-serving platform.
  • TAMR DataOps Case Study — Toyota Motor Europe connects and cleans datasets through the use of DataOps automation.
  • DataKitchen DataOps Case Study — Celgene speeds delivery and improves the quality of data analytics for their product launches.

Videos

The DataOps Youtube Channel features videos from DataKitchen, Open Data Science, PyData, Xentaurs Academy and many others.

The Future of DataOps

As DataOps matures and evolves, it promises to radically reshape how data analytics is created, improved, maintained and monitored. DataOps will allow enterprises to reduce the cycle time from data to value while delivering accurate and robust analytics. Organizations that get this right will derive competitive advantage from their data and will ultimately achieve their goals and outpace competitors.

Update: Read our blog on the DataOps ecosystem in 2019.


Like this story? Download the 140 page DataOps Cookbook!