How to define and contextualize machine learning problem

12 min readJan 5, 2020

Albert Einstein once said that ‘If I have an hour to solve a problem, I’d spend 55 minutes thinking about the problem and 5 minutes thinking about the solution’.

Spending time consciously defining a business problem is vital for successful change. How a problem is framed or described can determine the kinds of options you will consider for addressing the problem. You can have most powerful machine learning algorithms, talented data scientists, ton of data but the results will be meaningless if you are solving the wrong problem.

“There’s nothing worse than solving wrong problem correctly.”

In my earlier, I discussed in details about a topic “How to identify viable AI use case?”. This post is a continuation of my earlier post and you will learn the process for thinking deeply about machine learning problem framing before you get started with the actual model design and development.

Solving business problem using machine learning is not a one-size fits all approach and depends on specific situation of the organization, industry in which an information system has been implemented and used.For example, Fraud detection in Banking and healthcare will have different approach to solve the same problem. In healthcare, the most common type of frauds are committed by dishonest providers or insurer which are known participants for healthcare companies but in banking, the most common type of frauds are account handover, identity theft, phishing etc and mostly done by external fraudster.

Each problem should, therefore, be carefully considered in the context in which it is carried out.

Building blocks for machine learning problem framing

Define Business Problem:

Most often business problem don’t come out fully formed for machine learning and in some case you need to deal with broad category of qualitatively requirements. This is ok until it captures your actual business goal, not an indirect one. Hence, articulate clearly what you would like machine learning system to do for your business. For example:

Machine learning systems are built and evaluated based on exact quantitative requirements but most often business provides requirements in qualitative terms. Effectively translating business problems to machine learning problem is critical and can have far reaching consequences for business if not framed correctly. For example:

Show personalized ad based on customer visit history -> it’s a recommendation problem. This problem would be good If company’s priority is to raise brand awareness in the digital marketplace and keeping your company at the forefront of consumers’ minds.
Show personalized ads to hot lead customers -> it’s a classification (Identify hot lead) + recommendation (personalized) problem. This problem would have much better ROI and is good fit for company trying to optimize marketing spend.

Align business problem with business drivers, so that you could frame your problem correctly.

Define business outcome:

The real ask for machine learning system is to produce some desirable outcome or take some decision at scale. But before thinking about how to fit machine learning to solve business problem or which metric you need to consider for ML training and evaluation, you need to clearly mention your ideal business outcome. This will help you to set context for your machine learning system as the same business problem in different business context will have different approach to solve.

For example, consider to the marketing problem business outcome as outlined in the below table, Here the ideal business outcome is to have much better ROI on marketing spend and therefore can be solved using a combination of classification and recommendation approach. Now change the context of the ideal outcome to say that business wants to raise brand awareness in the digital marketplace by recommending new products.

You can see that the problem in new context only needs a recommendation system. Context matters!

Defining outcome of your product and service determines how and what can or can’t be done using machine learning.

Define business objective:

The main goal of machine learning is to support business strategy and use ML/AI as a tool to achieve organization vision. So it’s vital for leaders to reconcile business and technical vision to remain focus on the activities that matters most for the organization. Clarity around business objective i.e. reducing the cost or increasing revenue or increasing brand awareness will allow you frame and narrow down the problem to the point when it becomes solvable by machine learning. As illustrated in the above section, how same marketing problem can be framed and solved by different machine learning approach.

Defining objective of your product and service determines how to stay focused

Define your ideal model output

For any supervised learning, you must define output as a quantifiable with a clear definition which ML system could produce. ML model will trained to optimize the output. Hence, make sure that the model output is something you care about.

Sometime you need to use proxy label to define model output as the ideal model output is not known and you can’t measure it directly. For example, In order to understand client attrition risk from complaints, it would be good to identify complaint sentiment (Proxy label) for ML system to produce and then feed angry customer complaint to a topic modeling (Unsupervised learning) to identify key pain points, so that you could take some actions to resolve customer’s pain. Sentiment is a quantifiable and provides a decent predictive signal due to strong correlation with attrition risk.

Proxy label is not a perfect approximation for your ideal business outcome but the stronger connection between proxy label and true outcome, the more confident you would be about your decision. You need to explore different things to understand what proxy label produce desired outcome.

Define business Success and failed criteria:

“Numbers have an important story to tell. They rely on you to give them a voice.” — Stephen Few

How will you know if you system has succeeded or failed? As you know, Machine learning system learns complex pattern purely based on data and normally doesn’t throw any errors like we normally get in traditional programming language. So how will you know whether ML system has failed? Similarly the benefit or impact of machine learning system prediction or decision is not known immediately and become obvious only after few days, weeks or months, so how will you know whether ML system is a success?

For example, business success metrics of your marketing ad campaign is to achieve 1% conversion rate within 14 days from the start date of ad campaign. This is your end goal. In practice business metrics and machine learning metrics are not always tied to each other. For marketing example, you might want build a machine learning system based on Click through Rate because we know that CTR has a strong correlation or even causation with conversion rate but from the business metric perspective we really don’t care about CTR. As models get better and better in optimizing CTR, they might end up “just driving clicks”, without any actual effect on conversion. So in this example, you can see that ML system is successful in terms of optimizing CTR but actually failed to achieve business success metrics in terms of conversion.

Hence you always need to define your business failed criteria clearly like ML model deemed to be unsuccessful if the conversion rate is less than 0.25% within 7 days from the start date of ad campaign. The actual numbers used are only indicative to show how we need to define the acceptable threshold for both success and failed metrics. This is critical to monitor as this will raise an early warning and possibly tell you that ML model is not able to target right customers and may need some intervention to analyze and handle this.

Defining success and failed metrics separately will help you to predict the future, give you an opportunity to anticipate the problem and correct them.

Do you have enough historic decisions?

Labeled data or historic decisions are the foundation for supervised machine learning. If it doesn’t exist then you need to either allocate time for data annotation (data labeling) or re-frame your business problem and goal so that you can train a model on your data. But you also need to aware that historic decision or manual labeling can be subject to bias which is one of the reason of unfairness that could arise either from your past decision or implicit bias of the experts involved in the manual labeling.

Recently I read an article about how Audi motors built fastest car for 24 hour Le Mans race.Let’s say you are tasked to design a racing car to win at the 24-hour Le Mans race, what would you aim for? The obvious reaction would be: “I want to build the fastest car possible”. Audi’s chief engineer took a different approach when developing their new car for the 24-hour Le Mans race and re-framed the problem to “How can we win Le Mans if our car is not the fastest?” — He challenged the team.

By re-framing the problem, the design team came-up with a simple yet powerful solution: a fuel-efficient car. Fewer pit stops not only offset not being the fastest car but also helped Audi win four years in a row. This happens all the times in machine learning. You can improve your critical thinking by challenging assumptions and diversify your thoughts by asking different question.

On the other side if find that problem re-framing will not help for your business and have decided that to annotate data before it is useful to train machine learning model then you have an option to either do it internally or outsource to third-party service providers. This depends on your data labeling approach:

Do your annotators need specialized expertise?
Do you have the capability and bandwidth to build the annotation tools yourself?
Forecasting how much labeled data you will be required to achieve reasonable model performance.

Define how business will use model output

Machine learning allows product, services and process to be much more intelligent, smart and intelligent but how it will be integrated and used in live systems will influence in a way you design and build your model.

Will it be used in batch, online or streaming application? Will it be used in distributed environment? Will the model produce prediction or decision? How model prediction can be converted to a decision? What would be latency requirements? How model output would be used in the workflow and how it will help in making intelligent decisions within workflow?

These are important questions for you to ask and understand. If certain features are not available at prediction time or expensive from latency standpoint then there is no point in using those features during ML training. Understanding business and data context is vital for you to plan and execute your machine learning pipeline effectively.

NASA/JPL/Cornell University, Maas Digital LLC

How data is generated

Data dependencies in ML system cost more than code complexity and can be difficult to untangle.

More than ever, the ability to understand, measure and manage data is crucial for organization to succeed in big data era. Sometimes for the sake of improvement, it is often convenient to use features that are produced by other systems. Over time, as the system evolves, some of those features may become unstable and fail to deliver value in long term and this can happen implicitly when the behavior of that features changes over time.

Let’s write down input data source and assess how much work is required to construct input features for machine learning system training. Initially focus on those inputs that are easily available and can be obtained from single system and scale you data pipeline slowly to build more complex data pipeline. Make sure all your inputs are available at prediction time in exactly the format you’ve used for model training to avoid any misalignment between model training and prediction.

https://www.visualcapitalist.com/wp-content/uploads/2019/12/map-us-population-change-county.html

Define user/population of interest

“The world represented by your training data is the only world you can expect to succeed in.” — Cassie Kozyrkov

Supervised Machine learning (ML) technology assumes that the data used for training a ML model has come from the same distribution as the test data against which the model will be applied. However in many real world scenarios this assumption can fail dramatically, especially when data is generated, collected and integrated over multiple sources over long period of time. The resulting discrepancy between training and testing distributions leads to poor generalization performance of the ML model and hence produce a biased prediction in production environment.

Generalization is the ability of a trained ML model to accurately predict on examples that were not used for training. This is a key goal of any practical machine learning systems. Hence, it is important to understand the user context, population of interest using machine learning system in production. Designing your train and test data to simulate the real world data is paramount important to deliver business value from your machine learning initiative. For example, speech recognition system trained on US and UK accent user’s voice data will be less effective for non-english speaking countries. Or building a food recommender system based on Asian taste and then launching it on European market.

Define business risk for wrong decision

Machine Learning generates tremendous business value but it is also giving rise to plethora of unwanted and sometime serious / disastrous consequences. This implies significant challenges for organizations, from diminished public trust, reputational damage and revenue losses to regulatory backlash. Machine learning systems makes prediction or takes decision under uncertainty against unseen data and can make mistakes. As the decision maker, you must therefore consider which mistakes you can live with and how costly is one mistake versus the other. These early decision will help data scientist to build more robust model and statistics will help them to make better prediction under uncertainty.

If you understand where risks may be lurking, ill-understood, or simply unidentified, you have a better chance of catching them before they catch up with you. Hence considering risks during problem framing would allow you to plan your effort and work effectively. Effectively trading off between different risks is key to applied machine learning systems. For example, sacrificing model performance over mitigating bias is acceptable for business. Throughout the entire model lifecycle, from conceptualization to operational deployment, you need to ensure that an AI/ML solution is lawful, ethical and robust.

That introduces the potential for mistakes McKinsey Quarterly article of 2019 “Confronting the risks of artificial intelligence” provides a pragmatic approach to mitigate the risks of applying of Artificial Intelligence.

Conclusion

While model and data gets all attention, you can see that machine learning problem framing can pay big dividends when done right. It is important to take the time to define the machine learning problem, look at it from different angles to understand it completely.

Ultimately, training an AI platform — it is very much like molding a child. If you treat it the right way and teach it the right things, train it to know what’s right and wrong, it will inherently grow up to become a productive member of society that cares about people and the future. Just like any one of us.” — John Stecher, Group Managing Director at Barclays Investment Bank

Some key takeaways:

Use the pragmatic approach to define and explain your machine learning problems to stakeholders and co-workers.
Each machine learning problem should be considered in the context in which it is carried out.
Always start with business outcome and then work backwards to understand what details or data we have to achieve business outcome.
Set your performance criteria in advance for machine learning success and failure, so that it can’t be gamed.
Take time to understand your users and how/where it would be used.
Machine learning is almost never an unconstrained optimization problem — there are always constraints. So use your resources effectively and frame your problem appropriately and build robust, lawful, ethical AI.

Thank you for reading my post.Good luck and keep an eye out for my next posts!

Reference

How to define and contextualize machine learning problem

Written by Vivek Kumar