An end-to-end guide with code

Image for post
Image for post
Photo by pickup on Adobe Stock under license to Zer0 To 5ive

*Disclaimer: I am the Machine Learning lead for Splice Machine

If you’ve clicked on this article, I don’t need to convince you that Machine Learning is a game-changing tool. Building ML models is now accessible, easy, and beginner-friendly (most of the time). Deploying those models, however, is another story. Simplicity, scalability, and inference speed are just a few of the roadblocks you have to overcome.

One of the biggest problems with creating ML models is that the models are built in environments that are useless for deployment. Maybe you’ve started a single-instance Jupyter notebook with Python. You could integrate Spark for big data processing, which is great, but the resulting inference speeds are extremely slow. Maybe you’ve created a pipeline for putting your individual model in production with Flask, which works, but it’s missing reproducibility (and potentially scalability). …


By Ben Epstein, Sergio Ferragut, and Monte Zweben

Image for post
Image for post
Source: Adobe Stock Ribkhan

I’ve spent the last few months thinking heavily about feature stores. It’s the hottest new buzz word in the ML space, and everyone has a distinct implementation laser-focused on their personal use cases.

A recent article¹ that I read talked about this exact topic and did a great job summarizing the fundamental problem: these implementations don’t create a general purpose, conceptual framework for what a feature store is, rather focusing on the outcomes of their particular use cases. …


Image for post
Image for post

If you read my first article you hopefully have a good understanding of what a feature factory is, why it’s important, and a general idea of how to best cultivate one. If you haven’t, I suggest you check it out first. In this follow up article, I want to begin diving into MLFlow and introducing the major concepts with some code examples.

ML Lifecycle

Image for post
Image for post

To begin, let’s create a common understanding of a classic machine learning lifecycle:

A business problem is identified where the application of machine learning might be valuable

A team gathers a large dataset

Data engineers begin cleaning and standardizing the data, preparing it for…


Image for post
Image for post

As the co-founder and CEO of Splice Machine, Monte Zweben, wrote about in an earlier post, How Data Science Silos Undermine Application Modernization, there are a number of important steps any team must take to avoid costly silos that can hinder the modernization journey. Two of the most important are creating the right team for the right features and creating a culture of experimentation through feature factories. The former is quite straightforward: find the statistics wiz, the subject matter expert, and the SQL genius, and you’re well on your way to your data science dream team. The big challenge we see in many companies is the latter, the feature factory, especially with something I like to call feature organization. Before we begin, however, let’s define what a feature is: A feature in data science is a piece of information that your machine learning model can use to predict the outcome (label). …

About

Ben Epstein

Machine Learning Engineer at Splice Machine with a passion for production ML and everything outdoors

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store