Should You Use Machine Learning?

How to know if machine learning can solve your engineering or business problem

Devin Soni đź‘‘
Jul 19, 2019 · 5 min read
Image for post
Image for post
Photo by Franck V. on Unsplash

Introduction

However, how do you know if you really need machine learning?

Even though these techniques can be very useful, they don’t fit every situation. If you try to apply machine learning to an inappropriate task, you may waste time and money, and could end up with a poorly-performing model that is not useful.

In this article, I will go through some questions you should ask when determining whether or not machine learning is right for your situation. These should act as a framework to guide your decision-making process.

The Task

Do you have a well-defined problem with clear inputs and outputs? It is essential that you have a clear idea of what your model would have as inputs and outputs. Otherwise, you may have a difficult time in the feature engineering and evaluation stages of producing a machine learning model.

Do you have metrics that you can use to evaluate a model’s performance and to compare different models? Without an easy way to evaluate models, you will have a difficult time determining whether your model was successful, and in choosing which model to use. You will also have a difficult time iterating upon, and improving, your model as your use-case evolves over time.

Does the problem require an approximate solution? Most machine learning algorithms are used in situations where there is no exact way to find a solution, or the exact solution is too costly to implement. If your problem does have a method to solve it exactly, such as through the use of regular expressions, classical optimization techniques such as linear programming, or older AI techniques such as constraint satisfaction problems, then you may be better off using these methods instead.

Does the problem fit the machine learning paradigm? Most machine learning algorithms rely on the idea that current data will be useful in predicting or classifying future data. If your situation is prone to external events invalidating previous data, then machine learning will most likely not be effective. Similarly, if previous data has no relevance to future data, your model will not learn any useful trends that help you understand incoming data in a real-world setting. It is essential that your model sees relevant past data in order to use machine learning effectively.

The Data

Do you have reliable data labels? Most machine learning methods (the supervised kind) rely on the presence of labels for each data point you have. These labels should be as free of noise as possible, and should be obtainable at a reasonable cost. If your labels are too noisy, either due to inherent situational difficulty in data collection, or due to poor labeling quality, then your models will most likely fail to properly learn the relationships in your data. Additionally, if it is too costly to obtain labels, you may not be able to obtain enough training data over your model’s lifetime for it to be able to learn properly.

Does the data suit machine learning? The data you use to train your model must accurately represent the real-world data that it will be used on. This does not mean that it must perfectly reflect it, but the closer it does, the more useful and accurate your model will be. Even though there are techniques to ameliorate issues surrounding class imbalance and lack of data availability, it is always best if you can sufficiently supply your models with training data that reflects its real-world inputs. If you train your model with biased training data, and the available feature engineering and preprocessing methods are not sufficient, your model may perform unexpectedly poorly when it faces real-world data. For example, this may occur if you train your model with a heavily imbalanced data set in a classification setting. If your model is expecting to see 1% class A and 99% class B based on its training data, it will perform poorly if the real-world situation has 50% class A and 50% class B (assuming your goal is to maximize accuracy).

The Model

Are the effects and risks of the model well-understood? Are you fully aware of the societal effects of your machine learning model? For example, have you researched whether or not this model may further socioeconomic inequality, or if it may create divisions between people in different socioeconomic groups? It is important that you fully understand the effects of your model beyond your specific engineering or business problem. If your model is prone to algorithmic bias, it is important that you try to address this problem ahead of time, and try to create a training pipeline that removes as much bias as possible from the data. In addition, it is important to be aware of how this model may be abused by adversaries. Are there any ways for personal information to be obtained by reverse engineering the model’s outputs? Can industry secrets be leaked? With respect to these issues, it is important that you understand what data is visible through the outputs of your model, and who is able to access these outputs directly. It may be useful to obfuscate the outputs, so that you can control what information is revealed, and provide strictly what is necessary.

Can you maintain the model over time? In most cases, machine learning models are used throughout time, and are not confined to a single instance of usage. So, it is important that your organization has people who can maintain the model over time. As real-world situations change and drift over time, you are likely to need to retrain your model on current data. For example, a model that inputs natural language will need to be retrained periodically to incorporate changes in language usage such as slang. Even if the real-world data does not change over time, you may want to study the model’s errors and continuously iterate on the model in order to improve performance. Therefore, you most likely need a dedicated group of employees who are able to monitor and improve upon the model. Otherwise, it may quickly become obsolete or even useless, depending on how prone its domain is to change.

Better Programming

Advice for programmers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store