Machine Learning platform — Dataiku and H2O Driverless AI & MLOps

Marcel Moldenhauer
7 min readDec 17, 2021

--

Welcome back to our article series about the art of assessing ML platforms. In our previous articles the term ML platform was defined [Part 1] as well as the elaboration on why you need a ML Platform and how to assess the so-called ML platforms [Part 2]. This time we want to get our hands dirty and take a well-trained look at our first ML platform contenders to give you a glimpse of the insights generated while completing the assessment based on the framework considering coverage and maturity of each functional area and component.

The first two ML platforms picked for this article are Dataiku and H2O Driverless AI + MLOps, where for H2O we consider two separate products, H2O Driverless AI and H2O MLOps, which can be used separately but in combination encompass the E2E ML Lifecycle. Without further ado we want to dig into the different functional areas which have been defined in the previous article. You will realize, while reading todays article and having a deeper look on the distinct functionalities of the two platforms, that Dataiku and H2O Driverless AI + MLOps are quite different to each other in their philosophy and ways of working. We hope you enjoy our first exhibit as much as we enjoyed assessing these platforms.

1. Data Ingestion and Storage

Dataiku provides multiple connectors for batch and streaming data out-of-the-box, from over 25 leading data sources, both cloud and on-premises including prominent Hyperscalers as Amazon S3, Azure Blob Storage, Google Cloud Storage as well as Snowflake, SQL and NoSQL databases, HDFS, and more. The Dataiku Visual Flow allows coders and non-coders to collaborate on the same project by providing a seamless integration of no-code and code-based building blocks and easily build and monitor data pipelines. The platform provides options to use built-in or customizable recipes to cleanse, prepare and analyze data. Additionally, built-in data transformers are provided for the most common data manipulations task like find and replace or data normalization, while users can quickly extend functionalities by written code for more sophisticated and specialized data transformation tasks for example via Python, R or Scala. Monitoring, in case of data pipelines, is only possible as a log of all actions, metrics for dashboarding are not provided by Dataiku.

On the other hand, H2O Driverless AI works on existing big data infrastructure, on bare metal or on top of existing Hadoop, Spark or Kubernetes clusters. Data ingestion is possible directly from Hadoop HDFS, Spark, Amazon S3, Azure Data Lake or many other data sources out-of-the-box. Manual and customized transformation capabilities of ingested data to the platform is limited and restricted by execution of modification function as codes in a simplistic text box with basic syntax highlights. H2O Driverless AI’s strength is the automatic feature generation and transformation to automize engineering of new, high-value features for a given dataset. Additionally, H2O Driverless AI provides automated visualizations to help users get a quick understanding of their data prior to starting the model building process. Monitoring of any data ingestion processes is not possible via the H2O Driverless AI platform and needs to be implemented via a separate tools.

2. Experimentation Zone

Dataiku AutoML capabilities provide automated solutions for feature engineering and model training algorithms. For code-based experimentation Dataiku supports a variety of notebooks using Python, R, and Scala-based on Jupyter. For deep learning models data scientists can use Keras and Tensorflow modules and libraries to utilize additional performance provided by GPUs for training and deployment. In addition, Dataiku provides good management of models and a variety of visualizations for understanding the model outputs and behavior. The whole experimentation process can be operated and governed from the Dataiku Visual Flow in one unifying view from data ingestion to deployment.

H2O Driverless AI is first and foremost an AutoML platform, fully focused on automated ML development, making it easier and faster to build, train, and evaluate ML models for all different kind of analytics personas, including people with no coding background. Additionally, one can use custom code snippets, which need to be provided in external code repositories, like Git, to extend on the given AutoML capabilities. Manual development of ML models via code is not possible as part of H2O Driverless AI. Robust techniques and customizable visualizations are provided for experiment tracking, making it effortless to interpret and explain the results of ML models. One point worth mentioning is that H2O Driverless AI puts an emphasis on model explainability and provides a big suite of visualizations to tackle this new and upcoming topic in AI.

3. Continuous Integration

Dataiku provides integration with Git, including version control of projects, importing Python and R code, developing and importing reusable plugins and more. Datasets created via Dataiku Visual Flow are automatically versioned in case data pipelines are executed multiple times. Models developed via building blocks, which are provided by Dataiku, are versioned by default with the respective metadata. Dataiku does not provide a comprehensive feature store but one can generate set of recipes acting as a functional limited feature store.

H2O Driverless AI delivers a comprehensive model store that persists and versions models developed on the platform, as well as a basic dataset manager, which displays all usable and connected datasets including metadata. H2O introduced a newly added feature store recently, which was not part of this assessment. A shared production model repository is provided by H2O MLOps and H2O Driverless AI via projects that enables teams to collaborate and deploy models onto test or production environments easily and creates a well-functioning link between experimentation and industrialization of models on the platform. Externally developed ML models via code can be deployed by H2O MLOps utilizing necessary code wrappers.

4. Industrialization Zone

Dataiku Data Science projects neatly bundle developed ML models as a ready to deploy package including all necessary environment variables to run in a production environment. Containerization requires extra plugins with the possibility of integration with Kubernetes. The Dataiku unified deployer manages movement of the packaged project between experimentation and production for batch and real-time scoring. The Dataiku production environment can plan everyday tasks for projects like monitoring, updating data, and retraining models based on a schedule or alerts. Additionally, it is possible to integrate Dataiku in an existing CI/CD landscape for automated testing, retraining and deployment with the help of available DevOps tools like Jenkins and GitLabCI.

H2O MLOps makes it easy to package and deploy models into production environments as a single instance or a Kubernetes cluster. MLOps teams can easily manage multiple environments for development, testing, and production, all running in different locations directly from H2O MLOps. H2O MLOps includes monitoring for different service levels as well as data drift with real-time dashboards integrated by Grafana. For model lifecycle management, H2O MLOps provides the operations team the tools to seamlessly update and promote models in production, troubleshoot models and run deployment strategies like A/B tests on connected environments.

5. Data Presentation

Effective visualizations are available on Dataiku to analyze outcomes and share data insights across the team or organization. Interactive and data driven dashboards are easy to build, view and share with stakeholders across the company in just a few clicks via Dataiku. Additionally, integration with existing BI platforms like Tableau, Qlik, and PowerBI is provided out-of-the-box and models can be deployed as REST-API to be consumed by interface. With the Dataiku Apps, you can easily create AI apps and publish a project as a usable business application with a few clicks.

On the other hand, with H2O Driverless AI, models can be deployed automatically across several environments as a REST-API endpoint to be used in any kind of application or automatically run as a service in the cloud by the power of AWS Lambda, or simply exported as a highly optimized Jar-File for edge devices. H2O Driverless AI also integrated into Knime and Snowflake. H2O Wave provides an easily accessible integrate web app platform which leverages ML models developed in H2O Driverless AI, this product was not part of our assessment but is worth mentioning.

6. Outcome and things to look forward to

Dataiku shines as a standalone platform with focus on ease-of-use, visual pipelines and no coding requirements which is beneficial for Data Citizen but also satisfies coding heavy Data Scientists. Plugins supported by Dataiku allow to handle different types of ML use cases and challenges. Overall Dataiku provides a great coverage across all functional areas and their components, but sometimes lacks the necessary maturity, for example in model serving.

H2O showcases advanced AutoML tools with sophisticated visuals as part of the Driverless AI platform especially focused on Data Citizen and no-code solutions. H2O MLOps provides a comprehensive model deployment solution with an almost perfect integration to H2O Driverless AI. H2O Driverless AI + MLOps lacks coverage in Data Ingestion and Storage as well as in the Industrialization Zone, where for example components for data pipeline monitoring or model retraining are not supported.

To conclude our first assessment, we want to mention, there are so many more points and nuances for each assessed platform to talk about, but this article might be a first appetizer for you to take a closer look at these ML platforms and be curious about other platform offerings, which are also evolving every day.

We hope you enjoyed reading this article and we could share some of our gained insights, stay tuned for our next installment where two new candidates will be discussed.

--

--