People are talking about DataOps. Companies are marketing DataOps products and services, and organizations are adopting DataOps to improve the efficiency, quality and cycle time of their data analytics.
If you are unfamiliar with the term, DataOps is a new approach to the end-to-end data lifecycle, which applies new processes and methodologies to data analytics. Agile software development helps deliver new analytics faster and with higher quality. DevOps automates the deployment of new analytics and data. Statistical process control, used in lean manufacturing, tests and monitors the quality of data flowing through the data-analytics pipeline.
Growing enterprise interest in DataOps has spawned a robust ecosystem of vendors. To date, over $50M has been invested in companies who market a wide array of DataOps product and services.
Please email us if we forgot anyone or if you have any comments.
Key Components of a DataOps Platform
There are four key software components of a DataOps Platform: data pipeline orchestration, testing and production quality, deployment automation, and data science model deployment / sandbox management. Below is our running list of the vendors in each group.
- Data Pipeline Orchestration: DataOps needs a directed graph-based workflow that contains all the data access, integration, model and visualization steps in the data analytic production process
- Airflow — an open-source platform to programmatically author, schedule and monitor data pipelines.
- Apache Oozie — an open-source workflow scheduler system to manage Apache Hadoop jobs.
- DBT (Data Build Tool) — is a command line tool that enables data analysts and engineers to transform data in their warehouse more effectively.
- BMC Control-M — a digital business automation solution that simplifies and automates diverse batch application workloads.
- Composable Analytics — a DataOps Enterprise Platform with built-in services for data orchestration, automation, and analytics.
- DataKitchen — a DataOps Platform that reduces analytics cycle time by monitoring data quality and providing automated support for the deployment of data and new analytics.
- Reflow — Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs.
- ElementL — A current stealth company founded by ex-facebook director and graphQL co-creator Nick Schrock. Dagster Open Source
- Astronomer.io — Astronomer recently re-focused on Airflow support. They make it easy to deploy and manage your own Apache Airflow webserver, so you can get straight to writing workflows.
- Piperr.io — Use Piperr’s pre-built data pipelines across enterprise stakeholders: From IT to Analytics, From Tech, Data Science to LoBs.
- Prefect Technologies — Open-source data engineering platform that builds, tests, and runs data workflows.
- Genie — Distributed Big Data Orchestration Service by Netflix
2. Automated Testing and Production Quality and Alerts: DataOps automatically tests and monitors the production quality of all data and artifacts in the data analytic production process as well as testing the code changes during the deployment process.
- ICEDQ — software used to automate the testing of ETL/Data Warehouse and Data Migration.
- Naveego — A simple, cloud-based platform that allows you to deliver accurate dashboards by taking a bottom-up approach to data quality and exception management.
- DataKitchen — a DataOps Platform that improves data quality by providing lean manufacturing controls to test and monitor data.
- FirstEigen — Automatic Data Quality Rule Discovery and Continuous Data Monitoring
- Great Expectations — Great Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compiling or deploy time).
- Enterprise Data Foundation — Open-source enterprise data toolkit providing efficient unit testing, automated refreshes, and automated deployment.
- RightData- is a self-service suite of applications that help you achieve Data Quality Assurance, Data Integrity Audit and Continuous Data Quality Control with automated validation and reconciliation capabilities.
- QuerySurge-Continuous Testing with QuerySurge for DevOps
QuerySurge is the smart Data Testing solution that automates the data validation & testing of Big Data, Data Warehouses, and Business Intelligence Reports.
- CompactBI — TestDrive is a testing framework for your data and the processes behind them. (added July 2019)
3. Deployment Automation and Development Sandbox Creation: DataOps continuously moves code and configuration continuously from development environments into production.
- Jenkins — a ‘CI/CD’ tool used by software development teams to deploy code from development into production
- DataKitchen — a DataOps Platform that supports the deployment of all data analytics code and configuration.
- Amaterasu — is a deployment tool for data pipelines. Amaterasu allows developers to write and easily deploy data pipelines, and clusters manage their configuration and dependencies.
- Meltano — aims to be a complete solution for data teams — the name stands for model, extract, load, transform, analyze, notebook, orchestrate — in other words, the data science lifecycle.
- Lentiq — Lentiq is the data science environment that brings your projects to life. (added July 2019)
4. Data Science Model Deployment: DataOps-driven data science teams make reproducible development environments and move models into production. Some have called this ‘MLOps”
- Domino — accelerates the development and delivery of models with infrastructure automation, seamless collaboration, and automated reproducibility.
- Hydrosphere.io — deploys batch Spark functions, machine-learning models, and assures the quality of end-to-end pipelines.
- Open Data Group — a software solution that facilitates the deployment of analytics using models.
- ParallelM — moves machine learning into production, automates orchestration, and manages the ML pipeline. (acquired by DataRobot June 2019)
- Seldon — streamlines the data science workflow, with audit trails, advanced experiments, continuous integration, and deployment.
- Metis Machine — Enterprise-scale Machine Learning and Deep Learning deployment and automation platform for rapid deployment of models into existing infrastructure and applications.
- Datatron — Automate deployment and monitoring of AI Models
- DataKitchen — a DataOps Platform that supports the testing and deployment of data science models and the creation of sandbox data science environments.
- DSFlow — Go from data extraction to business value in days, not months.
Build on top of open source tech, using Silicon Valley’s best practices.
- DataMo-Datmo tools help you seamlessly deploy and manage models in a
scalable, reliable, and cost-optimized way.
- MLFlow- An open source platform for the complete machine learning lifecycle from Databricks
- Studio.ML — Studio is a model management framework written in Python to help simplify and expedite your model building experience.
- Comet.ML — Comet.ml allows data science teams and individuals to automagically track their datasets, code changes, experimentation history and production models creating efficiency, transparency, and reproducibility.
- Polyaxon — An open source platform for reproducible machine learning at scale.
- Missinglink.ai — MissingLink helps data engineers streamline and automate the entire deep learning lifecycle.
- kubeflow — The Machine Learning Toolkit for Kubernetes
- Vert.ai — Models are the new code!
- Omega | ML — Python AI/ML analytics deployment & collaboration for humans (added July 2019)
DataOps Supporting Functions
In addition to the foundational tools above, there are many software components that play a critical supporting role in the DataOps ecosystem.
- Code and artifact storage (e.g. git, dockerhub, etc)
- Parametrization and secure key storage (eg. Vault, jinja2)
- Distributed computing (e.g. mesos, kubernetes)
- Data Virtualization, Versioning, and Test Data Management:
- Delphix — A software platform that enables teams to virtualize, secure and manage data.
- Redgate — SQL tools to help users implement DataOps, monitor database performance, and provision of new databases.
- Pachyderm — version control for data, similar to what Git does with code.
- Quilt Data — Quilt versions and deploys data: like Docker for Data
- Privitat — More data-driven decisions without compromising on privacy. Get more business value from sensitive data — while enhancing privacy protection.
- DVC — Open-source Version Control System for Machine Learning Projects … data version control
- Instabase — a platform for data management and version control
- Datical — Database release automation for software development teams
- DBMaestro — Automate & govern database releases to accelerate
time-to-market while preventing downtime & data-loss.
2. Big Data Performance Management
- SelectStar — database monitoring solution with alerts, monitoring, and relationship mapping.
- Unravel — manages the performance and utilization of big data applications and platforms.
Other Vendors Talking DataOps
In addition to the tools above, there are many software components that are messaging on DataOps.
1. Data Integration and Unification with a DataOps Message
- Nexla — Scalable and secure Data Operations platform that allows business users to send, receive, transform, and monitor data.
- Switchboard Software — fully managed, cloud-hosted data operations solution that integrates, cleans, transforms and monitors data.
- Tamr — enterprise data unification solution that uses a bottoms-up, machine-learning-based approach.
- StreamSets — The industry’s first data operations platform for full life-cycle management of data in motion.
- Trifacta — end-user data prep.
- Infoworks — Use Big Data Automation to Simplify Data Engineering and DataOps
- Landoop — The enterprise overlay for Apache Kafka R & Kubernetes
- Devo — Devo delivers real-time operational and business insights from analytics on streaming and historical data to operations, IT, security and business teams at the world’s largest organizations.
2. All-in-One Cloud Platforms talking DataOps
- MAPR — provide a Converged Data Platform that enables customers to harness the power of big data by combining analytics in real-time with operational applications to improve business outcomes.
- Quobole — big-data-as-a-service company with a cloud-based platform that extracts value from huge volumes of structured and unstructured data.
- John Snow Labs — The Data Lab is an enterprise platform featuring data integration, no-code interactive data discovery & analysis, a collaborative data science notebooks environment, and productizing models as API’s at scale.
- Saagie — Saagie Data Fabric seamlessly orchestrates big data technologies to automate analytics workflows and deploy business apps anywhere.
2. Service and Consulting Organizations with a DataOps slant
- Kinaesis — We work with our clients within the Financial Services to leverage investment into Data Solutions and generate real value.
- CapGemini — Capgemini is building a practice area around DataOps
- John Snow Labs — Data curation, data science, data engineering, and data operations services. specializing in healthcare and life science.
- XenonStack — DataOps, DevOps, decision support, big-data analytics, and IoT services
- Locke Data — Data science services
- Cognizant — services that help define and deliver a big data strategy