Unlocking Warehouse Efficiency: A Data-Driven Journey through Operational Statistics — Pt 2: Linear Coefficient Estimation and Sample Size

9 min readFeb 23, 2024

This is part of a 4 part series on Operational Statistics :

Pt 1: Problem Background
Pt 2: Coefficient Estimation and Sample Size
Pt 3: Coefficient Estimation and Information Density
Pt 4: Coefficient Estimation with Constrained Optimization

By now, you should have read part 1 of this article series, which sets up the background for this article. The goal today is to build an intuition around the question “Do I have enough data?”

To recap, our goal is to generate accurate coefficients in a linear regression problem. The training data for my model consists of:

X — counts of item types completed by an operator during a shift
y — the total time working on the shift

Our linear regression model looks like the below, where we are trying to estimate 𝛼, the time it takes to process each item category.

In other words, if I have an operator that worked for T = 4 hours and they worked on I = 4 sofas, I would run my linear regression and find that 𝛼 = 1 hour / sofa.

The Dilemma

In most traditional machine learning setups, I’m interested in model accuracy. I would create a train-test split, run my model through different sample sizes and assess the bias-variance tradeoff to help me understand, “Do I need more data”?

For the problem above, we’re interested in accurate coefficients. While this often correlates with low-bias / low variance models, it doesn’t directly tell you how off your Sofa processing estimate is. It just tells you how well the coefficients added up can predict a shift length. I personally wanted to get a better idea of was I 20% off my sofa estimate? 30%?

When I was building my model, I felt like I lacked intuition around how accurate my coefficients would be with more or less data. Specifically, I was interested in how the ratio of the number of categories against my sample size impacted coefficient accuracy.

In the problem I set up in the last article, I had around 20 categories, which happened to be the furniture categories we use at my business (Chairs, Ottomans, Recliners, etc). While it would be amazing to get processing times for every category, I was worried that for the amount of data we had, 20 categories might be too big of a problem space. Could I get accurate coefficients with that many categories?

In the case that I didn’t quite have enough data to do every category, I figured one way to reduce the dimension of the problem would be to, quite literally, create fewer categories, or group my categories into more general bins. For example, rather than categories, I could do item sizes {small, medium, large, X-large}. 5 categories vs 20 categories would be a much smaller space for my model to explore and optimize over, resulting in likely improved coefficient estimates.

The problem was, I didn’t have an intuition about how much improvement my model would see if I did this. I also didn’t have any ground truths to test my assumptions.

When you don’t have real data to work with, simulations are one way to get traction. Hence, I decided to simulate my problem.

As we go through the next couple steps, I’ll walk you through how I generated my synthetic datasets and then how I ran my modeling simulations.

Creating the Datasets

To create my simulation data, I ultimately wanted to create dataframes that looked like this:

I had a total_time the operator worked that day (in minutes). The remaining dataframe columns would be the categories and the counts of how many items an operator completed in that category during their shift time.

I would need to know and store the “true” category time for each category so that I could compare a ground truth with my model simulations later on.

Some basic assumptions:

The simulated category times were uniformly distributed between 0–10 minutes per item. In real life, they might follow a stronger pattern, but I thought the time range likely mapped the reality of my problem and my distribution choice was ok.

def get_category_time():
    return np.random.random_sample() * 10 # we can take up to 10 minutes to process a category

2. Worker shifts could be arbitrarily long, but max out at 8 hours. This loosely reflects the reality at our warehouses.

max_minutes_worked = 8 * 60 # 8 hour workday

def get_min_worked_limit(max_minutes_worked):
    return np.random.random_sample() * max_minutes_worked

3. Operators vary in their efficiency. Some operators are better at their jobs than others. Some days are better or worse than others. Sometimes you get a weird item that takes longer or shorter than a normal item. I made the assumption that, on any given shift, an operator could work 25% better or worse than expected. This was totally random, based on my above 3 assumptions.

def get_operator_efficiency():
    """
    The assumption here is that no operator is more than twice as efficient/inefficient
    than another operator. So, the efficiency can range from 0.75 - 1.25
    """
    return 0.75 + (0.5 * np.random.random_sample())

Now that we’ve got that out of the way, we can see my code below. I created datasets on multiples of 5 categories ranging between 5–50 categories (5, 10, 15, 20… categories). As you can see, I’m storing my ground truth category_times into a pandas dataframe for use later. The code below effectively simulates training data for a shift, ensuring the randomness I outlined in my above 3 constraints.

def get_samples(num_categories, num_samples=10000, max_minutes_worked=max_minutes_worked):

    # Set up Dictionaries
    categories = list(range(num_categories))
    category_times = {k: get_category_time() for k in categories}
    item_counts = {k: [0]*num_samples for k in categories}
    totals = [0] * num_samples

    for sample in range(num_samples):

        # Create a Shift length
        max_mins = get_min_worked_limit(max_minutes_worked)
        total_minutes_worked = 0

        # fill the shift with items
        while total_minutes_worked < max_mins:

            category = np.random.choice(categories)
            item_counts[category][sample] += 1
            total_minutes_worked += category_times[category]

        # Every shift has a different efficiency, create some noise
        totals[sample] = total_minutes_worked * get_operator_efficiency()

    out = item_counts
    item_counts["total_time"] = totals

    return category_times, out

num_categories = range(5, 55, 5)
category_times = []
dfs = []

for idx in num_categories:
    category_time, out_dict = get_samples(idx, num_samples = idx * 1000)

    category_times.append(category_time)
    dfs.append(pd.DataFrame(out_dict))

    pd.DataFrame(pd.Series(category_time)).to_parquet("data/category_times_" + str(idx) + ".parquet")
    dfs[idx].to_parquet("data/data_" + str(idx) + ".parquet")

Modeling

Now that I created my training sets, it was time to run my simulations. For each number of categories, I wanted to run a linear regression on a range of sample sizes and calculate the accuracy of my coefficients.

To estimate coefficient accuracy, I opted for mean absolute percentage error (MAPE). This kind of accuracy metric would allow me to ask simple questions like “what percent are we off when I have 200 samples vs 2000 samples”?

The code to run a simulation on a dataframe for a given sample size and get the error is pretty simple:

def get_error(df: pd.DataFrame, n_samples: int):

    # Sample DataSet
    sample = df.sample(n_samples)
    
    # Fit Model
    X = sample.drop("total_time", axis = 1)
    y = sample["total_time"]

    l = LinearRegression()
    l.fit(X, y)

    # Get Coefficients
    estimated_coefficients = {k: v for k, v in zip(
           [int(c) for c in l.feature_names_in_], l.coef_)
    }

    # Estimate Error
    result_df = pd.concat([
        pd.Series(estimated_coefficients, name="estimates"),
        pd.Series(category_times, name="actuals")
    ], axis=1)

    return mean_absolute_percentage_error(
              result_df["actuals"], 
              result_df["estimates"]
    )

The code to loop through all my dataframes and sample sizes is only a little longer:

import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_absolute_percentage_error

error_dict = {
    "n_categories": [],
    "n_samples": [],
    "coeff_mape": []
}

for n_category in range(5, 55, 5):

    df = pd.read_parquet("data/data_" + str(n_category) + ".parquet")
    category_times = pd.read_parquet("data/category_times_" + str(n_category) + ".parquet")[0].to_dict()

    sample_size_step = int(n_category / 5)
    n_repeats = 10
    
    for n_samples in range(sample_size_step, n_category * 1000, sample_size_step):
    
        errors = [get_error(df, n_samples) for i in range(n_repeats)]
    
        # Write Results
        error_dict["n_categories"].append(n_category)
        error_dict["n_samples"].append(n_samples)
        error_dict["coeff_mape"].append(np.mean(errors))

        if n_samples + sample_size_step > df.shape[0]: # can't take more samples than we have available
            break

Note n_repeats . For each sample size, I repeated sampling and model fitting n_repeats times to bring a little more stability to the estimate at that sample size. This ultimately gave me smoother and more intuitive plots.

Results

To recap, the research question at interest is how does coefficient accuracy change as a function of the ratio of n_category to n_samples . I figured for more categories I would need more data. But how much more?

Alongside this, I needed to define what we considered a good MAPE. Did a change in 2% mean a lot, or a change in 20%? For my problem, I thought anything below 20% error would be acceptable to the business. Anything higher and you get the uncomfortable feedback from your stakeholder that “this data stuff is too fancy and not even accurate. Let’s just time it and we can get better estimates”. I prefer to avoid arguing statistical significance with non-technical stakeholders, so let’s see if we can get those numbers here first try.

Below is a plot for each dataset containing 5–50 categories and the coefficient MAPE as we increased the number of samples:

I think the cool part of this graph is, even though the plots for each n_category spread a bit vertically, the general horizontal shapes are the same. Using the famous elbow method (i.e. 80–20 this thing), it seems like, across the board, once I hit 200 x n_category samples, I start getting diminishing returns on my coefficient accuracy. In other words, to determine how much data I’d want in real life / how much I’d want to cut down my number of categories given the amount of data I have, could use this 200:1 ratio as a general rule of thumb.

To speak to the vertical spread a bit, it seems like despite having a proportionate amount of data, there is a negative correlation between number of categories and coefficient accuracy. In other words, it’s not enough to just get more data for problems with higher numbers of categories. These problems are more complex. Either you just accept you won’t get as accurate measurements, or you need some additional techniques to make up for this.

Conclusion and Next Steps

The 200:1 sample to category ratio was super interesting. I realized that, with the amount of data I had available, I likely wanted to bin my categories into more abstracted groups to be able to get coefficients that meet that 20% accuracy window.

At this point in time, however, I still had more questions. The first being around information density. As I increased the number of categories, I found the performance of the model flatlined at a higher MAPE than models with a lower number of categories. I wondered if this was more of a result of how I had set up the constraints I outlined earlier than the nature of that feature-dimension. If I had richer training data, could the performance of the model in higher dimensions be mitigated? More concretely, given how long an operator’s shift was, they could only process a couple of items in a day. Depending on how many categories I set up in my model, the likelihood of us collecting any data on a given category goes down as categories increases and n_items stays fixed. What this means is the sparsity of our training data goes up and the potential amount of information we collect per sample decreases. How that ratio changes for different problems is an interesting question to answer. How does the ratio of n_category vs %_sparsity impact model performance?

Secondly, I was wondering if there were ways to improve coefficient accuracy just with smarter modeling techniques. I had some ideas about how my model coefficients should look and relate to each other. Could I bring them into my model?

Stay tuned for the next two articles in this series to find out more!

Pt 3: Coefficient Estimation and Information Density
Pt 4: Coefficient Estimation with Constrained Optimization