Euclid: Blueprint to Create End-to-End AI Application Pipelines

By Mohan Reddy, CTO of The Hive

Introduction

In the last few years, we’ve become capable of building real-world AI products that automate tasks that were previously not possible because of computational complexity. It is possible to the specific instance of the problem with the technology we have today and is more difficult to solve a generalized version of the same problem. This is the primary reason why AI product development is still specialized and relatively expensive, rather than just another service purchased from a vendor or built by simply plugging in an open source module. Euclid, developed at The Hive, aims at bridging this gap by plugging in building blocks for developing end-end AI application pipelines.

Background on Synapse

Synapse was started in 2013 at The Hive and the goal was to build end-end big data pipelines. The trends that drove the design of Synapse, are fast changing open source technologies adding complexity to big data application design, real-time stream analytics for operations that can respond to patterns in live data streams, rethinking trade-offs between scale-up & scale-out architectures, faster analysis through smarter partitioning of data & parallelism in model building, and reducing the overhead of product development in the areas of data management, lineage and curation.

It was successfully used by many of The Hive portfolio companies to scale their data applications.

Euclid — Motivation and Introduction

More and more organizations are adopting machine learning to gain knowledge from data across a broad spectrum of use cases and products. The workflow of applying machine learning to a specific use case is simple: at the training phase, learning algorithms take a dataset as input and emits a trained model; at the inference phase, the model takes features as input and gives out predictions. However, the actual workflow becomes more complex when machine learning needs to be deployed in production. Building this type of automation is non-trivial and is very challenging. Creating and maintaining a platform for reliably producing and deploying machine learning models requires careful orchestration of many components — a learning component for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt.

Hence the motivation for Euclid. Euclid is a general-purpose machine learning platform/with deep learning pipelines and blueprints implemented at The Hive by integrating various components into one platform. Euclid standardizes the platform components, simplifies the platform configuration, and reduces the time to production. The goal is to enable teams to easily deploy machine learning in production for a wide range of applications, ensure best practices and limit one-off implementations that cannot be reused.

Euclid Components

Data Ingestion

Data is ingested from multiple sources in both streaming and batch modes using a rich set of I/O connectors available from Apache Beam. It supports a wide variety of data formats as well.

Data Preprocessing

Data processing is the most important part of the pipeline and usually data scientists and data engineers spend 80% of their time in this step. Rightfully so as machine learning models are only as good as their training data, so understanding the data and right transformations are very important. Euclid’s data processing component comprises of various libraries for data analysis, validation, missing data, feature mappings, vocabularies, feature wrangling, and data linting.

Figure 1: Euclid Components

Feature Engineering

Feature engineering transforms raw data into features to represent the problem and inputs to the machine learning models. Model accuracy depends on the features and these features are often domain specific. Euclid performs statistical analysis on the datasets. For continuous data, the statistics include histograms, mean and standard deviation, quantiles, etc., and for categorical/discrete data statistics we include top-k values by occurrence frequency. General insights into the dataset are obtained by looking at these feature statistics.

Data Labeling

The main goal of Data Labeling is to convert non-servable knowledge to servable models and help with feature engineering. A significant corpus of historical information in a specific domain will enable an AI application to extract key concepts, entity recognition, associations, and hierarchies and generate what we call smart data by merging domain knowledge with ontologies. A common bottleneck in deploying supervised learning systems is collecting human-annotated examples. In many domains, annotators form an opinion about the label of an example incrementally. Euclid’s data labeling tools are applied to text data and machine data and use active learning techniques to develop semi-automatic data annotation. They also make use of source heuristics, content heuristics, model and graph-based labeling.

Model Training

Euclid standardizes training process, be it using TensorFlow, PyTorch or vanilla machine learning frameworks across multiple use cases. It takes minimal effort to convert IPython notebooks to this framework and makes it algorithm agnostic. The trained models are stored in a model store. Efforts are underway to make this declarative using DSL.

Model Evaluation

Euclid employs the metrics like Confusion Matrix, AUC — ROC, Gain and Lift Chart, Root Mean Squared Error and Cross-Validation in evaluating the model accuracy from the training step to determine the right candidate model to push to model store for evaluation/serving module.

Model Serving

The model serving component uses Docker for serving the serialized candidate model. It supports both TensorFlow serving and NVIDIA’s TensorRT frameworks to scale.

Orchestration

Euclid supports AirFlow and Kubeflow for end-end pipeline orchestration. It also includes support for visualization and collaboration along with runtime metrics and lineage tracking.

Conclusion

Software engineering practices such as Test Driven Development (TDD), continuous integration, rollback and recovery, change control, etc., are being introduced into advanced machine learning practices. It is not enough for a specialist to develop on a Jupyter notebook and throw it over the wall to a team to make operational. The same end-to-end DevOps practices that we find today in the best engineering companies are also going to be demanded in machine learning endeavors. Many interesting challenges remain. While Euclid is general purpose and already supports a variety of model and data types, it is flexible and accommodates new innovations from the machine learning/deep learning community.

Euclid is currently available to The Hive portfolio companies. We have plans for open-sourcing the same at a later time.


To find out more about the benefits of building a startup with The Hive, visit our website or email us at jobs[@]hivedata.com.