Problem Solving Framework Part 2: Problem Chunking

Timothy Lu
7 min readJan 9, 2023

--

This is part 2 of a series on using problem decomposition to solve coding problems as a data scientist: problem chunking. See part 1 (here) for an overview of the framework.

Now that we have understanding of the problem solving framework let’s talk deeper about the first step: Problem Chunking. We will talk about the concept of making your chunks using the “mutually exclusive, collectively exhaustive” (MECE) method. I will be applying this to a problem in data science so that you can understand how to apply this step to the framework.

Although this step is less technical in the sense you are not writing (much) code, it is an important storyboarding step and a great moment to clarify objectives, metrics, and design needs. It is here you are abstracting your problem into its components and understanding each necessary step to accomplish a task. We want to create chunks which are mutually exclusive meaning that each chunk performs a specific task without overlapping into others. We want to make sure our process is completely exhaustive meaning we accomplish every task necessary to accomplish our goal.

In general, when we are problem chunking we will follow a series of steps:

  1. Take the larger problem and divide into major chunks
  2. Take major components and break it down into small sub-chunks
  3. Check if these subcomponents are MECE, if not, break them into smaller chunks
  4. Confirm the chunks accomplish the task
  5. Communicate to stakeholders regarding deliverables, metrics, and milestones

The final step requires some finesse. It is a great idea to go back and jot down notes on how you might do each chunk. At this time, check if there are any metrics that are necessary for each chunk and whether or not these need to be additional steps. Check if the end result is the desired deliverable. Write down stakeholders who may be interested at each step and the milestones you need to communicate.

You may need to go through these steps multiple times in order to accomplish the task and you might make multiple storyboards with different options. You might be halfway through solving your current storyboard and realize you missed a step. That is all good. By having these small chunks, we can easily slot in more chunks as needed, move parts around, change our flow as needed depending on infrastructure changes, and so on. It is easier to move small pieces around than one large gargantuan thing.

Let’s apply these steps to a real problem and dig deep into one sub-chunk so we can actually understand what that means. Since we want to get a better scope of the framework, let’s talk about developing end-to-end deployment of a machine learning model. Imagine you are working for a startup which wants to take user feedback in a restaurant app to provide recommendations. You are part of the team solving this problem and have been asked how you would tackle the problem. How could we go about this?

Let’s start with step 1, taking the bigger problem at hand and breaking it down into smaller parts. Our goal is to build a recommendation system which uses user feedback. What are the major components? Well, the major chunks we need are things like collecting user data, processing and storing the data (ETL), building the model, validating the model, and deploying the model. Okay, of all these, let’s say we want to get really nitty gritty with building the model.

Now that we have the major chunk of building the model, what are some of the general parts of building a model? Well, we need to get data from our data storage system, pre-process the data, perform exploratory data analysis, wrangle the data, perform feature engineering, and finally train/test the model. We don’t consider validation here because that is a separate chunk and the process of validating a model can be quite involved. These might not be your exact chunks and the definitions of phases may differ from team to team but these would be my breakdowns.

Let’s get into each of these chunks. Is there anything smaller for getting data from our data storage system? Potentially! Maybe we need to connect to a SQL server, collect the data from the server, and save it locally. Is this worth being separate chunks?

What about pre-processing the data? There might be a chunk for filling in or removing missing values, deleting empty columns/features, and so on. In general, we are cleaning the data to make sure that all we have is good, clean data and that might involve some initial exploration to better understand our data. If we are working with text-based data it may be a good time to perform some initial pre-processing such as lemmatization and removal of stop words. This helps us lead into exploratory data analysis. Here we are looking at our data visually or statistically. Maybe we plot some histograms, look at summary statistics, correlations between values, and explore the text reviews.

Afterwards, we dive deep into wrangling and feature selection. Perhaps in wrangling we are now one-hot encoding our categorical variables, standardizing our data, and vectorizing our text features. In feature selection, we are using the information gained from EDA to decide which features we wish to keep. Perhaps we combine features which belong together, remove features with high collinearity, or remove features which are constant.

Lastly, we are training/testing our model. We need to create a train/test split. Then we need to decide which models/algorithms we wish to test, how we want to create our baseline performance, how we want to tune the models, and how we will use the testing data. Lets take a look at all these chunks and how we might view them.

This is the process just for understanding how to build the model. We will now go back and take notes on how we might accomplish these tasks. Are we gathering the data using SQL? Are we pre-processing with Pandas? Are we exploring using seaborn? What are some of the tools we will use? Additionally, are we going to create benchmarks as to when we want to check in with certain teams? What metrics will we be using to understand performance? Note that I’ve laid the chunks here in a more free-flowing manner. This is primarily for visual formatting. You may find it beneficial to lay these chunks out in a more linear format.

If we were to do this in our Jupyter Notebook or IDE of choice, we might want to write some pseudocode! Take a look below as to how you might format these chunks in Python. The goal isn’t to write the exact code but just to begin structuring it out so that we can get an idea of what we might need and make sure we’ve got all of our packages and workflow ready.

## -- BUILDING MODEL -- ##
### -- Imports -- ###
import pandas, nltk, matplotlib, seaborn, sklearn

### -- Getting Data -- ###
connect = sqlite.connect to server
cursor = connect.cursor
data = cursor.query.all_data

### -- Pre-Processing Data -- ###
data_df = pd.DataFrame(data)
data_df.fillna # fill in our N/As
data_df.replace # replacing as needed
data_df.drop # unnecessary column dropping
data_nostop = StopWordRemover(data_nostop)
data_lem = Lemmatizer(data_nostop)

### -- EDA -- ###
data_lem = seaborn_heatmap(data.corr) # Look for correlations
sns.pairplot(data_lem) # Visualize correlations
make_hist(data) # See how data is distributed
summary_stats(data) # check summary stats
boxplots(data) # look for outliers

### -- Wrangling Data -- ###
wrangLe_df = one_hot_encode() # one-hot encode our categorical var
wrangle_std = StandardScaler(wrangle_df) # Consider standardizing our data
wrangle_vect = WordVectorizer(wrangle_std) # Vectorize our data

### -- Feature Selection -- ###
wrangled.drop(bad_features) # Remove unnecessary features
wrangled['combined_feat'] = wrangled['column_1'] + wrangled['column_2']
wrangled.drop(combined_features) # remove combined features

### -- Training and Testing -- ##
X_tr, X_te, y_tr, y_te = train_test_split(data)

model_1 = TensorFlow()

neural_net = Sequential( input, Dense, Dense, output)
nn.train
nn.test
nn.metrics/eval

model_2 = Transformer()
transformer.train
transformer.test
transformer.eval

We will take this initial step and go to any internal stakeholders that may be involved to get their input. Get insight into which chunks work and which are missing. Consider whether each chunk is mutually exclusive and are we being completely exhaustive with our work.

In the next part of the series, we finally get into the exciting part… how we actually *solve* each of these chunks and get into the technical parts of writing code and building out each component!

--

--