How to ask questions data science can solve.
My students frequently have trouble finding good data science questions.
Usually, this is because they’ve yet to figure out how questions map to data solutions. I’ve found it insightful to use Bloom’s Taxonomy with data technologies to draw a clearer picture.
Data science tools may seems very limited at first, but we can rephrase most real world questions into the language of our tools.
What kinds of questions can we ask?
Bloom’s Taxonomy categorizes learning objectives that educators use to lead their students. I find it useful for categorizing insights as well. After all, if we provide applicable insight, we take partial roles as educators.
Bloom’s Taxonomy also puts forward questions we can ask to test students. These same questions lead to excellent insights.
We split the learning process into 6 objectives with related questions for each objective. As data scientists, these are questions we can ask, solve, and share to build insight.
Cognitive objectives in Bloom’s Taxonomy
- Remember — Who, what, where, or when did something happen?
- Understand — Can you summarize what happened?
- Apply — What happens when …?
- Analyze — What are the key parts and relationships of …?
- Evaluate — Is this the best approach?
- Create — Can you predict what will happen to … under new conditions?
Available tools
(practitioners may want to skip this overview)
There are many tools of the trade, but you can break them down into a few areas.
R/Python/SQL/Etc
Data manipulation with SQL, R, Python, etc. allows us to search and aggregate data.
These tools answer questions related to remembering and understanding. “When did my largest user make their last purchase?”
Hypothesis Testing
Just because we split data by a group it doesn’t mean we found a relationship. Hypothesis testing tells us if our data applies to new situations. “Do cat pictures drive more traffic than dog pictures?”
Scenario Analysis
Scenario analysis analyzes many possible future outcomes under various conditions. We create many possible scenarios and then predict what will happen. “What would happen if we raise the price of our product?”
Optimization
Optimization is a huge field, but it generally asks simple, yet hard to answer, maximization and minimization questions. “What supply routes minimize the cost of delivering packages?”
Reinforcement learning
Reinforcement learning observes data and optimizes an outcome in real-time. “When should I click to survive in the game Flappy Bird?”
Statistical Modeling and Machine Learning
This gets tricky because these are enormous fields. Let’s look at several major tasks:
- Classification and Regression — “How many …?”, “What kind…?”
- Feature Selection — “What variables are relevant?”
- Dimensionality Reduction — “What are the key components of my data?”
- Clustering — “Can I categorize my data?”
- Anomaly Detection — “Is this observation weird?”
Classification and Regression
Classification and Regression answer questions like “Is there a relationship between my data, and one or more outcomes.” Classification focuses on predicting groups “Is this A or B?”. Regression focuses on quantities “How much — or — How many?”
Feature Selection
Feature selection identifies which features in our data relate to a specified outcome. Imagine we want to identify whether a fruit is an apple or an orange. We use color and sweetness in our data as characteristics of fruit. A feature selection algorithm would narrow in on color as a useful discriminator since apples and oranges are both sweet.
Dimensionality Reduction
Dimensionality Reduction takes data and reduces it to its core components. This is like image compression where we show the same image using less information. Imagine we have data about disposable fork, knife, and plate sales. A dimensionality reduction might show one column of disposable utensil sales. We approximately ask “What are the key patterns in my sales data?”
Clustering
Clustering attempts to take data and automatically group together similar observations. We can organize and approach our data as a collection of several types of observations. We ask “Do I have any distinct types of customers or are they all completely unique?”
Anomaly Detection
Anomaly detection answers whether an observation belongs to the dataset. We approximately ask “Is this temperature reading normal, or is it weird?” It’s important to notice that we can often simplify this question. A classification problem that asks “Is this weird or not?” is like anomaly detection.
How does it all fit together?
I’ve listed common data science questions below organized by Bloom’s Taxonomy. Each question is rephrased to work with a common data science technique. Questions are ordered from easiest to answer to most difficult.
Remember — Who, what, where, or when did something happen?
We answer remember questions with data collection and manipulation with SQL, R, Python, Etc.
What browser is a particular user using to browse this site?
We find the user in our data using SQL, R, or Python and what browser they were using.
How did that particular user find this site?
We find the user in our data using SQL, R, or Python and the recorded source of traffic.
Understand — Can you summarize what happened?
We answer understanding questions by aggregating or summarizing data.
What browsers do my users tend to use?
Again, using SQL, R, or Python we can count the number of users by browser in our data.
Apply — What happens when …?
We answer application questions by requiring our results to generalize. Hypothesis testing, cross-validation, and experimental approaches are techniques that ensure generalization.
Is there a relationship between time spent under the sun and the height of a plant?
This is a regression problem, Y = f(X). Y = height of plant. F represents any model that captures the relationship. X = the quantity of time the plant has spent in sunlight.
Will this air conditioner fail in the next 3 years: Yes or no?
This is a classification problem, Y = f(X). Y = {fail, don’t fail}. F represents any model that captures the relationship. X is data that records the history of air conditioner failures and related characteristics.
Which animal is in a given image?
This is also a classification problem, Y = f(x), sometimes called multi-class classification. Y = {dog, cat, horse, other}. F represents any model that captures the relationship. The data, X, would be images encoded into tabular form.
What is the likelihood that this customer will buy?
This is a classification problem, Y=f(X), Y = {buy, not buy}. X is data related to customer purchasing habits. Many algorithms will be able to give you the probability of falling into a particular class.
Is this bank transaction fraudulent?
This is a classification problem, Y=f(x). Y = {fraudulent, not fraudulent}. X is bank transaction data. Anomaly detection may also handle this problem. Anomaly detection may work even if you don’t have past data that labels fraud, but it’s a harder problem.
Analyze — What are the key parts and relationships of …?
To answer analysis questions you break your data apart and look for patterns. Feature selection, dimensionality reduction, and clustering are the key tools.
What factors best predict electricity demand?
This is a regression problem with feature selection, Y=f(X). Y = quantity of electricity demanded. F represents any model that captures the relationship between your data and electricity demanded. X probably has the features price, temperature, season, region, and many other features. To find the most important factors we use feature selection to remove factors that don’t predict electricity demand.
What are the key differences between apples and oranges?
This is a classification problem with feature selection, Y=f(X). Y = {apples, oranges}. F represents any model that captures the relationships in your data. X has many characteristics such as height, weight, color, taste, and toughness. Feature selection finds the characteristics that best distinguish apples and oranges.
Which groups of sensors in my HVAC system tend to vary with (and against) each other?
This is a clustering problem because we group similar sensors with each other. We organize the data with sensors as rows and ‘time of reading’ as columns.
What combination of sensors in my HVAC system best displays the overall health of the system?
This is a dimensionality reduction problem. We take a wealth of data and turn it into a few key performance indicators. In this case, we organize the data with different sensors as different columns.
Which viewers like the same kind of movies?
This is odd because we try to group similar users and similar movies. This is typical for a recommendation engine. We can also write a simpler application as “Does this user like this group of movies?” or even simpler as “Will this user like this movie?”
What leadership practices do successful CEOs have in common?
This looks like a grouping question at first. It comes back to key differences once you read between the lines. All successful CEOs eat, so do all unsuccessful CEOs. We are much more interested in what predicts success.
Evaluate — Is this the best approach?
To answer evaluation questions, you need to extrapolate your data into complex hypothetical cases.
Can we save money by pricing different products better?
This comes down to scenario analysis. We come up with several pricing schemes, then predict their effects using models. This will likely involve classification, regression, and critical thinking.
Create — Can you predict what will happen to … under new conditions?
Creation questions ask you to create new optimal solutions.
What route should my delivery truck take?
This is a well known optimization problem. The main criteria is to minimize money spent on fuel while making all deliveries on time.
Where should we set up our new location?
Here we need to optimize to a specific criteria. A simple one is to maximize profits, but in reality, more goes into consideration. To write an optimization we need to be able to evaluate locations. Which takes us back to the apply, analyze, and evaluate phases.
Where should I place this ad on the webpage so that the viewer is most likely to click it?
You might write this as an optimization, but there are better options. It is inexpensive to move an ad around and see how it performs. That means we can experiment rather than decide ahead of time. Try to position the ad and test its effectiveness. You can even automate this process through A/B testing or reinforcement learning.
Should my automated cooling and heating system adjust the temperature higher, lower, or leave it where it is?
This is a good area for reinforcement learning. Your cooling system adjusts to input data such as electricity price, time of day, and your stated preferences.
Everything is a nail when we have a hammer
We should ask questions first. It’s easy to get caught in our data and tools. We forget there are broader questions we can tackle.
Data questions follow a continuum from easy to difficult. Asking many small questions will lead to progress and lead you toward big insights you never expected.
If you find this helpful, please recommend and share, click the💚 below so other people will see this.