#3: Developing a Machine Learning Model from Start to Finish

This is part 3 of the 6-part tutorial, The Step-By-Step PM Guide to Building Machine Learning Based Products.

You should be familiar with all the technical concepts, so now we can move on to the nitty-gritty of turning an idea into an actual model in production.

Modeling at a Glance

At a high level, building a good ML model is like building any other product: You start with ideation, where you align on the problem you’re trying to solve and some potential approaches. Once you have a clear direction, you prototype the solution, and then test it to see if it meets your needs. You continue to iterate between ideation, prototyping and testing until your solution is good enough to bring to market, at which point you productize it for a broader launch. Now let’s dive into the details of each stage.

Since data is an integral part of ML, we need to layer data on top of this product development process, so our new process looks as follows:

  • Ideation. Align on the key problem to solve, and the potential data inputs to consider for the solution.
  • Data preparation. Collect and get the data in a useful format for a model to digest and learn from.
  • Prototyping and testing. Build a model or set of models to solve the problem, test how well they perform and iterate until you have a model that gives satisfactory results.
  • Productization. Stabilize and scale your model as well as your data collection and processing to produce useful outputs in your production environment.


The goal of this phase is to align as a team on the key problem the model solves, the objective function and the potential inputs to the model.

  • Align on the problem. As discussed, machine learning needs to be used to solve a real business problem. Make sure all the stakeholders on your team and in the company agree on the problem you’re solving and how you’ll use the solution.
  • Choose an objective function. Based on the problem, decide what the goal of the model should be. Is there an objective function the model is trying to predict? Is there some measure of “truth” you’re trying to get to that you can verify against “ground truth” data, e.g. home prices, stock price changes etc.? Alternatively, are you just trying to find patterns in data? For example, cluster images into groups that have something in common?
  • Define quality metrics. How would you measure the model’s quality? It is sometimes difficult to foresee what acceptable quality is without actually seeing the results, but a directional idea of the goal is helpful.
  • Brainstorm potential inputs. Your goal is to decide what data could help you solve the problem / make decisions. The most helpful question to ask is: “How would an expert in the space approach this problem?” Think what would be the variables / pieces of data that person would base a solution on. Every factor that may affect human judgement should be tested — at this stage go as broad as possible. Understanding the key factors may require problem business space knowledge, which is one of the reasons it’s important for business / product people to be heavily involved at this stage. The data team will have to translate these potential inputs into model features. Please note that in order to turn inputs into features additional processing may be required — more on that next.

Data Preparation

The goal of this phase is to collect raw data and get it into a form that can be plugged as an input into your prototype model. You may need to perform complex transformations on the data to achieve that. For example, suppose one of your features is consumer sentiment about a brand: You first need to find relevant sources where consumers talk about the brand. If the brand name includes commonly used words (e.g. “Apple”), you need to separate the brand chatter from the general chatter (about the fruit) and run it through a sentiment analysis model, and all that before you can begin to build your prototype. Not all features are this complex to build, but some may require significant work.

Let’s look at this phase in more detail:

  • Collect data for your prototype in the fastest way possible. First, identify your missing data. In some cases you may have to break down the necessary inputs to get to the “building blocks” level of raw data that is more easily available, or to data that is a close proxy to what you need and is easier to get. Once identified, figure out the quickest, easiest way to get your data. Non-scalable methods such as a quick manual download, writing a rudimentary scraper or buying a sample of data even if a little expensive may be the most practical approach. Investing too much in scaling your data acquisition at this stage usually doesn’t make sense, since you don’t yet know how useful the data would be, what format would be best etc. Business people should be involved — they can help brainstorm ways to find data that is not readily available or simply get it for the team (the relevant business functions to involve depend on the data needs and the org structure — partnerships, business development or marketing may be helpful here). Note that in the case of a supervised learning algorithm, you need data not just for the model features; you need “ground truth” data points for your model’s objective function in order to train and then verify and test your model. Back to the home prices example — in order to build a model that predict home prices, you need to show it some homes with prices!
  • Data cleanup and normalization. At this stage the responsibility largely moves to your data science / engineering team. There is significant work involved in translating ideas and raw data sets into actual model inputs. Data sets need to be sanity checked and cleaned up to avoid using bad data, irrelevant outliers etc. Data may need to be transformed into a different scale in order to make it easier to work with or align with other data sets. Especially when dealing with text and images, pre-processing the data to extract the relevant information is usually required. For example, plugging too many large images into a model results in an enormous amount of information that may not be feasible to process, so you may need to downgrade the quality, work with a portion of the image or use only the outlines of objects. In the case of text, you may need to detect the entities that are relevant to you in the text before you decide to include it, perform sentiment analysis, find common n-grams (frequently used sequences of a certain number of words) or perform a variety of other transformations. These are usually supported by existing libraries and don’t require your team to reinvent the wheel, but they take time.

Prototyping and Testing

The goal of this stage is to get to a prototype of a model, test it and iterate on it until you get to a model that gives good enough results to be ready for production.

  • Build prototype. Once the data is in good shape, the data science team can start working on the actual model. Keep in mind that there’s a lot of art in the science at this stage. It involves a lot of experimentation and discovery — selecting the most relevant features, testing multiple algorithms etc. It’s not always a straightforward execution task, and therefore the timeline of getting to a production-ready model can be very unpredictable. There are cases where the first algorithm tested gives great results, and cases where nothing you try works well.
  • Validate and test prototype. At this stage your data scientists will perform actions that ensure the final model is as good as it can be. They’ll assess model performance based on the predefined quality metrics, compare the performance of various algorithms they tried, tune any parameters that affect model performance and eventually test the performance of the final model. In the case of supervised learning they’ll need to determine whether the predictions of the model when compared to the ground truth data are good enough for your purposes. In the case of unsupervised learning, there are various techniques to assess performance, depending on the problem. That said, there are many problems where just eyeballing the results helps a lot. In the case of clustering for example, you may be able to easily plot the objects you cluster across multiple dimensions, or even consume objects that are a form of media to see if the clustering seems intuitively reasonable. If your algorithm is tagging documents with keywords, do the keywords make sense? Are there glaring gaps where the tagging fails or important use cases are missing? This doesn’t replace the more scientific methods, but in practice helps to quickly identify opportunities for improvement. That’s also an area where another pair of eyes helps, so make sure to not just leave it to your data science team.
  • Iterate. At this point you need to decide with your team whether further iterations are necessary. How does the model perform vs. your expectations? Does it perform well enough to constitute a significant improvement over the current state of your business? Are there areas where it is particularly weak? Is a greater number of data points required? Can you think of additional features that will improve performance? Are there alternative data sources that would improve the quality of inputs to the model? Etc. Some additional brainstorming is often required here.


You get to this stage when you decide that your prototype model works well enough to address your business problem and can be launched in production. Note that you need to figure out which dimensions you want to scale your model on first if you’re not ready to commit to full productization. Say your product is a movie recommendation tool: You may want to only open access to a handful of users but provide a complete experience for each user, in which case your model needs to rank every movie in your database by relevance to each of the users. That’s a different set of scaling requirements than say providing recommendations only for action movies, but opening up access to all users.

Now let’s discuss the more technical aspects of productizing a model:

  • Increase data coverage. In many cases you prototype your model based on a more limited set of data than you would actually use in production. For example, you prototype the model on a certain segment of customers and then need to broaden it to your entire customer base.
  • Scale data collection. Once you verified which data is useful for the model, you need to build a scalable way to gather and ingest data. In the prototyping phase it was fine to gather data manually and in an ad-hoc fashion, but for production you want to automate that as much as possible.
  • Refresh data. Create a mechanism that refreshes the data over time — either updates existing values or adds new information. Unless for some reason you don’t need to keep historical data, your system needs to have a way to store growing quantities of data over time.
  • Scale models. There is both a data science and an engineering aspect to this. From a data science perspective, if you changed the underlying data, e.g. expanded the number of customer segments you include, you need to retrain and retest your models. A model that works well on a certain data set won’t always work on a broader or otherwise different data set. Architecturally, the model needs to be able to scale to run more frequently on growing quantities of data. In the movie recommendations example that would likely be more users, more movies and more information about each user’s preferences over time.
  • Check for outliers. While the model as a whole may scale very well, there may be small but important populations that the model doesn’t work well for. For example, your movie recommendations may work very well for users on average, but for parents you’ll show mostly kids movies because they choose movies for the kids from their account. This is a product design problem — you need to separate the recommendations for the parent from the recommendations for their kids in the product, but this is not something the model will just tell you.

What I described so far is a conceptual flow. In reality the lines often blur, and you have to go back and forth between phases quite often. You may get unsatisfactory results from your data sourcing efforts and have to rethink the approach, or productize the model and see that it works so poorly with production data that you need to go back to prototyping etc.

A Note on Outsourcing

Model building often involves some very time consuming and sisyphic tasks such as generating labeled data and testing the model. For example, labeling hundreds or thousands of data points with the right categories as input for a classification algorithm and then testing whether the output of the classification model is correct. It is very useful to set up an on-demand way to outsource such tasks as they come up. In my experience you can get decent results from Mechanical Turk if you get several people to perform the same simple task and take the more frequent answer or some kind of average. There are platforms like CrowdFlower that give more reliable results, but they are also more expensive. Certain tasks require more pre-training of the people performing them (e.g. if the task is specific to your space and/or requires prior knowledge), in which case you may want to check out platforms such as Upwork.

Now that you understand the process of building a model, next we’ll discuss who does what in this process and how to build an organization that would be the most effective in making that happen.

If you found this post interesting, would you please click on the green heart below to let me know, or share with someone else who may find it useful? That would totally make my day!