Beyond the Numbers: The Role of Problem Framing in Data Science

Table of contents:-

6 min readJun 21, 2023

· What is problem framing?
∘ Defining the dependent variable:
∘ Defining the level of granularity of your analysis:
∘ Assessing data availability:
∘ Defining potential features:
∘ Defining potential modeling approaches:
· The problem-framing gap for new data scientists
· Six additional tips for problem framing

Problem framing is a vital and often overlooked step in the data science process. Here’s how data scientists can do it successfully:

What is problem framing?

Problem framing is setting up your business problem in a way that can be addressed with data. It’s the process of taking an abstract goal like “we want to know which customers are likely to churn” and translating it into what data will be used, how the data will look, and what modeling approaches might be applicable. It’s a combination of understanding the underlying mechanics of your problem and matching your problem to techniques and approaches that fit.

Problem framing is enacted in the way that you compile your data for analysis (what are your potential features? How are you defining your target? What is the appropriate level of granularity of your analysis?) and the techniques used to address your problem (many modeling approaches require specific data configurations).

Problem framing should be the first thing a data scientist does when working on a new project. The process of problem framing involves asking questions about the system you’re trying to model. It typically includes:

Defining the dependent variable:

What are you trying to predict or model? What outcome are you trying to model against?
Customer churn example: How does the company define churn? Are they interested in downgrades vs. complete churn? What time window of no activity defines churn?

Defining the level of granularity of your analysis

What represents a single record in your data? At what level are you making predictions?
Customer churn example: Are you interested in user or customer accounts that could contain many users? Are you interested in evaluating individual teams using a product or entire accounts churn?

Assessing data availability:

What do you think affects the dependent variable you’re trying to predict? How are you tracking these impacts? What data can help you model the system we are evaluating? Is the data accessible and available in a timely fashion? How can you securely and compliantly access this data? Can this data be responsibly used?
Customer churn example: Is there customer behavior data that you believe influences churn? Is there customer demographic data that you believe separates customer behavior? When is this data available? Can this customer data be used responsibly to protect customer privacy? Is this data at the right level of granularity?

Defining potential features:

How can you separate the available data into things that represent unique attributes of the system you’re evaluating? What do you think has an impact on the thing you’re trying to predict? Can you create data columns to represent the impact of this data?
Customer churn example: What customer demographic data might make churn more or less likely? What customer behaviors are being captured that might make churn more or less likely? How are customers using your products or interacting with your company?

Defining potential modeling approaches:

What approaches provide a reasonable framework for evaluating the system? What are the best ways to model the dependent variable? What does your data set need to look like for these approaches?
Customer churn example: What classification or regression modeling approaches might be appropriate? Are you predicting whether churn happens, the time until churn, or a numeric value like the dollar impact of churn? Are you focused on accurate predictions of churn or the accurate ordering of customers likely to churn?The problem-framing gap for new data scientists

The problem-framing gap for new data scientists

Problem framing isn’t explicitly discussed in most data science university programs, bootcamps, or online data science classes, even though it’s a key part of being an effective data scientist.

Data science academic programs and bootcamps tend to focus on technical skills like classification or regression modeling approaches, only providing glimpses into real-world problems through pre-built data sets that are already formulated for machine learning. Real-world problems require the communication skills to work with both technical and nontechnical users to understand the system or process you’re trying to evaluate and the creativity to map the problem to appropriate data and machine learning models.

There’s a large gap between these academic programs and what’s expected of data scientists when they move into industry roles, and this disconnect between learning about algorithms and learning about applications is a part of that.

Given these expectations, data scientists often need additional training. Data scientists used to collaboration in academic programs and bootcamps often find themselves embedded in functional teams with limited opportunities to work directly and exchange ideas with other data scientists. Companies that pair new data scientists with more experienced data scientists for collaboration and mentorship are able to more quickly build problem-framing skills for new data scientists. Deep collaboration — meaning more than just brown-bag lunches and occasional review sessions — is the best way to improve problem-framing output and more quickly benefit from the skills of new data scientists.

One recommendation for data scientists looking for more problem-framing experience is to get exposure to as many different types of data science problems as possible. Take to heart the real-life examples you’re exposed to in classes and training, and try to internalize how these examples are being solved and what the possible data sources could have been.

What data transformations may have been needed to address the problem? What other data sources could have been added to make the model better? Are other industries or organizations experiencing a similar problem? Does this type of modeling approach apply to other problem sets?

Often, we see really interesting developments and results by transferring approaches from other industries to problems they weren’t designed for. We’ve seen computer vision approaches used to predict protein structure and survival analysis models (used heavily in healthcare) used to identify vehicles likely to break down. Exposure to many different approaches and understanding how they were used enable you to draw new connections.

In data science, we’re usually interested in how our models will generalize to new data. The exciting thing is that problem approaches can generalize to new industries and problem sets in the same way.

Six additional tips for problem framing

Don’t be afraid to ask simple or “dumb” questions. These are necessary to get a good understanding of the system you’re modeling.
Research the problem! There are typically blog posts, research papers, or instructional videos about what you’re trying to model. Use them and pull from existing knowledge bases.
Reach out to others you think can help and collaborate. Problem framing can be a great project step for data scientists to collaborate and learn from one another.
Consider the timing of your data. Know when it’s available and what will be known at the time of prediction when you’re creating your data set.
Simplify, simplify, simplify. We often see data scientists relying on overly complex models when a simpler option might be the better choice. Don’t assume complexity is better. The simple solution might be the right one.
Know what information is important for your problem and define what success looks like before you move on to problem framing. To use the customer churn example, do you want an accurate prediction of whether each customer will churn, or do you want a list of the 100 customers most likely to churn so you can target them with some sort of promotion? This will inform the output you’re looking for from your model.

Problem framing is an absolutely vital step in data science projects that is sometimes overlooked. Often, issues with problem framing only reveal themselves much later on in the data science process, and team collaboration can help prevent some common pitfalls. With the types above, you’ll be able to improve your problem-framing output and better enable data science impact.