Face detection and landmark extraction on a $100 bill. Spoiler: this bill won’t buy you too much ML.

The Cost of Machine Learning Projects

Published in

Cognifeed

8 min readSep 12, 2019

Estimating the cost of a generic machine learning project, without knowing most of the details, is a titanic endeavor. We are, nevertheless, going to try… but take the results with a grain of salt.

In order to make this estimation possible we need to eliminate, from the start, two types of projects: the trivial and the academic. Projects of the first type already have a solution out there — both the dataset and the model architecture already exist. These type of projects are basically free so we’re ignoring them.

The second type are the projects that require fundamental academic research: applying ML to a whole new domain or on entirely different data structures than mainstream models. Or maybe you have extreme hardware limitations you’re working with. Whichever the case, the cost of such a project is quite likely beyond what you can afford (be it in terms of money or partnerships).

In between these two types lie the majority of ML initiatives — the projects we’re going to focus on. You’re taking some algorithms or model architectures that already exist and you tweak them to suite the original data you’re working with. Sounds quite straight forward and simple, right? How much could all of this cost? You might be surprised…

Disclaimer: In order to estimate some of the cost, a yearly compensation of $60,000 has been assumed for every machine learning engineer. According to glassdoor this is the average salary for ML engineers in Europe. We’re aware that in US it’s twice as expensive, but we wanted to get some more optimistic numbers.

The cost of data

Data is fuel for any machine learning project. Most of the existing research and solutions focus on variations of supervised learning. It is well known that deep supervised learning approaches are particularly data hungry. And not only do they crave a lot of samples, but the data also has to be manually annotated beforehand.

A recent study by Dimensional Research, on behalf of Alegion, show that 96% of all organisations run into problems related to training data quality and quantity. The same study shows that most projects need over 100,000 data samples to perform well.

A chart from the Dimensional Research study illustrating the most common issues companies are facing when it comes to data.

If you don’t have your data yet, it would be fair to assume that you could collect 5–10 samples together with their labels and annotations in about an hour. Using a service like Amazon’s Mechanical Turk to crowd source the whole process, it would cost you around $70,000 to generate a 100,000 samples dataset.

But, if you’re lucky and you’ve gathered a fair amount of data already, you can use a service like Scale to get it annotated. To get 100,000 data samples labeled you could pay anywhere from $8,000 to $80,000 (ouch), depending on the complexity of the annotation.

So now we have the quantity. But how about the quality? It is almost as time consuming to check and correct data samples as it is to generate and annotate them. The same Dimensional Research study mentions that 66% of companies run into bias and errors problems into their data set.

This is a fairly well known issue when outsourcing data collection and annotation. In order to offset this, most organisations will have their own in-house team responsible for annotations and data cleaning. Some opt for a full in-house approach (doing the whole labeling themselves), others go for a mix of outsourcing and in-house. The second is the most common scenario — outsource the bulk of the work and then have one or two people in charge of validating and cleaning the data samples and the labeling.

This is how an annotated image for object detection (in this case care detection) looks like. Human annotators have to draw the blue bounding boxes around the desired objects. Image from Playment, an image annotation tool.

Having an small in-house validation team can add to the initial cost of outsourcing the 100k data samples around $2,500 to $5,000 more. This assumes that you can find cheaper talent for your annotation team than for the ML one.

Drawing the line, a solid dataset will set you back anywhere from $10,500 to $85,000, depending on the nature of your data and the complexity of your annotations.

The cost of research

The research, for the type of project that we’ve narrowed down in the beginning of the article, consists of the initial feasibility study (can this be solved by AI?) and the algorithm search and experimentation phase. Basically the incipient exploratory stages that each project gets through before production.

The Dimensional Research report states that the majority of enterprise AI teams have under 10 members. It is therefore safe to assume that teams have, on average 5 members. Out of these 5 maybe 3 are outsourced (either services or freelancers). The team, in this configuration, can probably work on 2 projects in parallel and they can conduct the research for a project in one to two months (let’s average it to 1.5).

So that’s 2 employees (2 x $5,000), 3 freelancers (3 x $3,000) resulting in $19,000 per month. If the team can handle two projects at the same time and the length of research is 1.5 months, it means that the cost of this phase is about $14,250 per project.

Please remember the initial assumption for this estimation — that the methods and algorithms needed to solve the problem at hand already exist and you can just take the state of the art solution, or whichever solution works the best within the computational restraints you’ve got.

The cost of production

The costs of production includes infrastructure costs (cloud compute, data storage), integration costs (data pipeline development, API development, documentation) and maintenance costs. We will estimate these costs for a 12 months period.

Out of these, probably the lesser expenditure is the cloud compute. Of course, it varies depending on the complexity of the algorithm being deployed. If the model is not deep and it’s trained on low dimensional tabular data you will get away with 4 virtual CPUs running on 1 to 3 nodes for $100-$300 per month, meaning $1200-$3600 each year. On the other end of the spectrum, for latency free deep learning inference you can shelf out from $10,000 to $30,000. Realistically, an instance with 4 vCPUs and one old GPU will do decently enough for most use cases. Such virtual machine would cost you approximately $4,000.

The integration can be quite tricky. In most cases putting an API endpoint in the cloud and documenting it to be used by the rest of the system is all that’s needed. Preparing machine learning models to be served and writing the scaffolding for the API should take 20 to 30 development hours at most, including testing. This means a cost of around $1,500, plus what it takes to modify the rest of the of the system to use the new API (which in itself could prove troublesome, but we’ll ignore that). A solid data pipeline will take significantly more, probably somewhere around 80 hours. So that’s $2,500 more.

One aspect that people seem to overlook when setting off to develop a machine learning system is the fact that they need continuous support during their life cycle. The data that keeps coming in from the APIs has to be cleaned and annotated. Then models need to be trained on the new data, tested and deployed. New, better architectures might appear that improve on what’s currently implemented, so the main algorithm might need a change from time to time. Hopefully it will be an effortless swap.

According to the same Dimensional Research study, most organizations keep committing 25%-75% of the resources used to build the initial solution on the machine learning project. But let’s be optimistic and assume that the initial solution was very well though out and a good part of these recurring tasks are mostly automated. In this case one part-time engineer is probably enough. So the maintenance could end up costing you around $30,000.

This puts the total cost of production to $37,000.

The opportunity cost

The biggest cost of implementing AI is by far the opportunity cost. Too many people get enamored with the buzzwords ML and AI represent and sink their development budgets into pursuing the technology rather than addressing a real problem.

We’re in a stage where ML is still a highly experimental technology reflected by a high variance in success rates. Garter predicts that by 2022, 85% of AI projects will deliver erroneous outcomes. We’re not going to go into too much detail on why they fail, but here’s a very good write up on the subject by Alison Doucette.

Five Hypotheses as to why Artificial Intelligence and Machine Learning projects fail

There are numerous articles and published papers around why AI/Machine Learning/Natural Language Processing projects…

towardsdatascience.com

If your users or customers pains are addressable through rules based systems or traditional non-learning algorithms it’s a good idea to implement those first. You should carefully assess the return of investment of ML when compared to other traditional alternatives before jumping head first and hiring a data science team.

Total

Based on our assumptions, a machine learning project can cost your company (excluding the hard-to-determine opportunity cost) $51,750 to $136,750. The high variance is given by the nature of your data. This is a very optimistic estimation. If your enterprise is located in the US and you’re working with sensible data (freelancers won’t do it) the talent-related costs surge up, putting ML projects upwards of $108,500.

The steep price of machine learning makes it less accessible for individuals, small teams and startups that want to tackle a new problems or automatize their processes and decision making. And the steepest step is the first one: getting the data. Without the data it’s almost impossible to validate a machine learning solution in the research phase resulting in a near deadlock.

This is one of the issues we are trying to address through Cognifeed. We enable our users and customers to quickly experiment and iterate through machine learning solutions with very little data. All you need is a human teacher that has all of the knowledge that needs to be passed on to the machine.

Getting a machine learning solution to market as quickly as possible is the secret of generating high amounts of quality data. We are providing the tools so you can do just that!

Be sure to sing up if you want to give Cognifeed a spin!