# Machines can now tell you how long your move will take (Part 2 of 3)

Core contributors: Abraar Ahmed, Naveen Lekkalapudi
Editor: Ben Cake

In our previous post, we described methods of statistical analysis employed for move-length estimation and the early stages of transitioning from statistical analysis to machine learning in order to improve and personalize predictive performance. Our drive to work on move-length estimation comes from the belief that estimation plays a key role in setting the proper expectations for customers. In this post, we describe the machine-learning algorithms we use and the process of settling on them.

## Enter learning machines

With a start-up like Bellhops that is constantly experimenting and focused on scalable growth, there is a constant influx of new information (in the form of categorical and numerical data) that comes from every sale. Early on, every batch of moves we completed required manual analytical work, which, as volume grew, was bound to become unscalable. The manual process that we were going through falls neatly into a class of problems called supervised learning. With good data preprocessing and model validation, this is a class in which algorithms can outperform (and scale faster than) humans. Of course, just knowing the class of problems is not enough to decide which specific algorithm will best solve a particular prediction problem. We decided to use the following:

The reasons for picking these are outlined in subsequent sections of this post, but the gist is that the methods used in each of the algorithms are traceable and thereby prevent the machine learner from being an absolute black box. This makes the algorithms easy to explain and to adjust, and therefore the default-estimate-setting method can smoothly transition to serving estimates through an application program interface (API) via trained models held in memory.

## Decision-trees recap

Decision trees serve as the foundation of the algorithms we selected to address this problem. They use the idea of inductive reasoning; they use a number of specific examples to arrive at a conclusion. The anatomy of a decision tree consists of mainly three parts: a root node, leaf nodes with probabilities assigned to them (in a binary problem, this is represented by a 1 or 0), and branches that have terminal conclusions. It is worth noting that the bias, the training time, and the prediction time for a decision tree can all be controlled by changing the maximum depth of the tree.

We use decision-tree-based regression because the move-length-estimation problem isn’t binary but one that has possibilities that lie in the domain of infinity. (Theoretically, move length can go from zero hours to infinity hours, although we hope not.) Regression using decision trees is fairly complex, but it follows the same general principles. For regression, the number of nodes, the number of leaves, and the depth of the tree being used can be closely monitored and assigned suitable stopping criteria to appropriately partition the curve of the data.

## Ensembling

Unfortunately, decision trees are fundamentally weak learners, so we use a technique called ensembling, which turns weak learners into strong learners. To qualify as a weak learner, a model merely has to be consistently better than random chance. A strong learner is a model quantified by near-zero value for residuals that are defined as the difference between the actual and predicted values. There are a number of common ensembling methods, such as boosting, bootstrap aggregating, Bayesian model combination, and stacking. We will give a quick rundown of the methods that produced the best results for us.

Boosting is an advanced technique that works well when using a number of weak learners to build a strong learner. Weak learners, like decision trees, usually have high bias and low variance. Within the decision-tree model, the building blocks are shallow trees, sometimes being as small as trees with no leaves, called decision stumps. Boosting reduces error mainly by reducing bias — and, to some extent, variance — by successively reducing loss in output from a large number of consecutive models. The error residuals are represented by the loss, and, using this value, the predictions are updated to minimize the residuals. To illustrate, in the image below many results have been generated (depicted by the green and pink lines) to arrive at the dark red line that closely approximates the solution (represented by the blue line). Extreme gradient boosting uses the same principle, but it has a more regularized model formalization that controls over-fitting. This allows it to push the extreme of computation limits of machines that it is run on.

Adaptive boosting is another advanced modeling technique but one that is fairly simple to explain. Unlike the previous two, this is a sequential ensemble model implying that the process requires a weak learner to run a prediction followed by a weak learner employed to work on the resultant output. The subsequent process depends on weights given to the regressors and samples in a manner that makes the regressors focus on observations that have a higher loss.

## Why not use weak learners?

After completing thousands of moves, we’ve learned from the data that our company is not the poster child for the central limit theorem. Even after running a type of logarithmic transformation on the data, we can see that the bottom half of the data deviates from the expectation regression line. As weak learners have an error rate that’s only slightly less than 0.5, they would not perform adequately. And this problem is further compounded when the distribution of target data does not fit neatly in a normal distribution.

## Separate models for different high-level features?

We have experimented with the idea of building separate models for high-level features. In Bellhops’ case, these features could be a particular market or move type, which inherently determine the effect of other characteristics. This turned out to be an unwieldy development strategy, and in terms of model performance there was a lack of consistency across the models used. Featured above are examples of two markets for which separate models have been built. As it turns out, when everything goes well (the charts on the left), the testing performance (the green line on the lower chart) is close to the training performance (the blue line on the lower chart) in terms of error, but when things don’t go well, we have a big divergence (the charts on the right).

There are several reasons why “things don’t go well,” and chief among them is that there is far less data in the flow of selected features to split and build models on. In a way, we have made a split akin to one that a decision tree makes, but on an arbitrary feature. This is not acceptable, hence we designed a unified, integrated model. What we observed is that learning can be transmitted across all the features and this information is valuable.

## Feature engineering

As we developed the model, we knew that we were working within the constraints of a lookup-table data structure, which was a short-term solution. This resulted in the development of two models with differing philosophies: One model purely targeted reduction of runovers using the explicit questions asked in the order flow by estimating quantiles, and the other model used data from explicit questions, behavioral data, and metrics of importance that targeted hit rate while preventing the runover rate from increasing. This meant that performance metrics were different for each of these models and the features had to be engineered differently. For example, the numerical market identifiers are not supposed to be used as continuous numerical features, so they were ranked on the basis of historical executed order count and a forecast of expected orders, and this ranking was used as ordinal features.

This was a working solution, but it added a subjective, preprocessed element of market prioritization. A more appropriate solution would be to have the market identifiers categorized and the output reshaped like we were doing with all other categorical features. The reason we didn’t go down that route was that all the other features we used as categories had a finite number of categories that would hold over a long period, and we could estimate the amount of time it would take to convert the prediction output to a lookup table. On the other hand, being privy to Bellhops’ market-launch schedule, we were able to recognize that the linear increase in time for converting every new categorical market identifier would not be sustainable for the lookup-table data structure. Another drawback of the lookup tables was that it creates infinite splits for continuous data. We had to work on creating equidistant categorical variables from the continuous features available to us. The trade-offs that we made with the former model in the feature-engineering space were another indication that we had to work on a solution that didn’t limit the accuracy of the estimates due to the structure of the table defined by the constraints the CX engineering team had at the moment.

Eventually, we developed a different architecture that is summarized later in this post (and explained in detail in future posts). The salient feature of that architecture is holding the most recent batch-processing model in memory and making a prediction on the fly. A lot of complexity in feature engineering is better handled in this manner, as quantitative and categorical features can be used in their true nature no matter the range of options. In the featured example, the market identifiers when used as categories still held a majority of feature importance, which caused us to develop structural metrics for the markets and study their value as features in place of market identifiers and then add those of value as features as well. The new architecture also opened up avenues of using customer behavior on the website as features predictive of customer preparedness.

## Hacking uncertainty estimates

Quantile regression is a method by which uncertainty in regression estimates can be quantified. We figured that to meet our estimation goals within the lookup-table framework, we could modify the loss function of the models that we had trained and, using the measure of uncertainty as the target instead, predict for a specific quantile so as to keep the runovers consistently below a certain threshold. After running a number of tests, we deemed the 85th quantile as the target to keep the runovers consistently around 17 percent.

Our implementations of quantile regression:

1. The popular machine-learning library sklearn’s gradient boosting regressor has parameter provisions for this; the loss parameter can be set to target quantiles, and the alpha parameter allows for choosing the quantile to be estimated.
2. Sklearn’s extreme gradient boosting and adaboosting regressors on the other hand do not have parameter provisions to do this with ease; a custom loss function must be written. We’ve tweaked the extreme gradient boosting and adaboosting regressors by making changes in the base score or weight cumulative distribution functions within pertinent python classes to get results at the 85th quantile. Developing special gradients for the loss function are a work in progress that we will use with models held in memory for the original purpose of estimating uncertainty.
3. The methods we employed to set the lookup-table defaults were appropriate to meet runover-rate goals, but using them in the transition to personalized estimates would result in high estimates per order. In the latter case, we use loss functions that target the 50th quantile, and other quantiles are used to maintain a record of uncertainty per order estimate.

When computed for the model, feature (or variable) importance yields the feature set that the model makes its splitting decisions on. Above is one example of this using a gradient-boosted regression model in development, and this was used to prune features that provide little or no value to the model, which in turn reduced the training time in subsequent training runs as well as used less memory.

## Building a model picker, stacked ensembles, engineering a better solution for model deployment, and serving move-length estimates

Once we completed developing the model, we needed an algorithm for picking the customer segments to which estimates are delivered. In this scenario, we employed divergent concepts: The lookup table of estimates did not require a picker, as there was no personalization aspect involved in the estimates produced. We simply computed the root mean square error (RMSE) of the actual move length and the estimates after every week to ascertain that the model was consistently performing well.

In the case of the models used to serve the personalized estimates, we built a more complex picker. First, we computed the RMSE of the estimates and the actual length of move inclusive of the training set and the weekly additional orders. Next, we took business tolerance of hits and runovers and assigned thresholds for all of the metrics. Then the customer segments, which have hundreds of features, are rolled up into constituent elements of four features (market, property type, property size, and move type) on which the metrics are applied to measure if they meet the standard. This is done for each of the three models (gradient boosted, extreme gradient boosted, and adaptive boosted) and the segments are assigned to the respective models for the next week. Only customers who fell into these segments were served move-length estimates from the appropriate models. Those who didn’t fit into any of the segments were served estimates from the lookup table of estimates. The strategy is to work on different modeling techniques to capture as many customers in the segments to be served as possible. To that end, we experimented with stacked ensembles.

Stacking models is a technique of using meta-learners, which are generally strong base learners, and running another learner on the output of those to provide a final output. Since a lot of research has gone into the strong learners already developed, we will work on methods of ensembling these to provide better estimates.

We hinted at new engineering work done at Bellhops that is central to this initiative. In the next post, we will dive into details of the work done in the data-engineering space: the infrastructure, the architecture, the APIs needed to deliver machine-learning recommendations/estimates/forecasts—and how all of this interfaces with existing applications.