By the summer of 2017, Bellhops had been around for six years and completed more than 100,000 moves. Along the way, we expanded the services we arrange from labor-only to labor with the option to include a truck. As the company grew, we noticed that one of the biggest contributors to customer satisfaction was our ability to complete moves in the time we estimated. In the initial stages, the company’s estimates relied solely on the size and type of property (as provided by the customer), but as we grew, we noticed a wide variance based on just these two factors. The rate of runovers kept increasing, and the rate of misses also increased. This is when we began to study move-length estimation in earnest.
Move-length estimation plays a key role in setting the proper expectations for customers. The length of a move varies significantly based on the market, the type of service, and customer behavior. It is important for us to understand the effects of each of these factors before providing a customer with an estimate. In this post, we describe the process of transitioning from statistical analysis to machine learning in order to improve and personalize predictive performance.
Till now, you probably never thought a crew of data scientists would be just who you wanted to help you move. This investigation should help change your mind.
Estimation and customer joy
As data scientists, we seek insights that will inform the development of our product while also remaining cognizant of the effect of the product on the end user. In this case, the end user is the customer, and the metrics we use are industry standards such as net promoter score and the percentage of moves that require a cash appeasement. As reliability and satisfaction are core company goals, these customer metrics are used throughout the business. We explored data collected over the history of executed moves, with a focus on the executed move length, in order to better understand appeasement tolerance and customer satisfaction in relation to move-estimation accuracy. We learned that customers are happy if we complete their moves in half the time that they expected, with a tolerance toward the move going over by 10 percent of the expected time, while appeasement payouts drastically increase for moves that exceed by 110 percent of the estimate. Bearing this in mind, we defined the runover-rate metric as move length less than 110 percent of the estimate and the hit-rate metric as move length between 70 percent to 110 percent of the estimate.
Strategy and feature additions
As a data team working in an industry that traditionally hasn’t leveraged data science, we realized that we needed to approach this problem in increments rather than attempt a large overhaul. We decided to update the default estimates shown to customers and then use new features that would both have an impact and also fit into the existing framework of our lookup table.
We added markets and move types to the lookup table, as they create a discernible effect — plus, during the exploratory-analysis phase we saw quick benefits of doing so. While these weren’t the only additional features we considered, we went into this with the knowledge that at some point the Cartesian product of the options under each feature would create such a massive lookup table that it would become inefficient to parse through. In our initial analysis, we noticed effects of the following features aside from those discussed above but were held back by size constraints:
● Booking lead time
● Booking day of week
● Booking time of day
● Day of week of move
● Time of day of move
● Number of stops
● Years spent by occupants at the property
● Square footage of the property
● Presence of elevators
Product moment correlation coefficient
The Pearson correlation of features range of values lie in [-1;1], with -1 signifying a perfect negative correlation, +1 signifying a perfect positive correlation, and 0 signifying no linear correlation between the two variables. We confirmed that the features we often use in our default setting were independent of one another, which thereby prevented the unintentional use of a proxy variable and unintended amplification of signal from pairs of proxy variables.
Generalized pair plots
We observed that our feature set included variables that were categorical as well as those that were quantitative. To seek out relationships between variables that have this fundamental difference, we ran a generalized pair plot, which gave us interesting insight into the data we had collected. For example, the booking lead hours for people moving from properties with an elevator had a tighter scatter than those who were moving from properties without an elevator.
Box and whisker plots
We dissected all existing features in relation to newly proposed features for the purpose of studying their effect on move length. For example, we charted box plots for the length of moves measured in man hours and noted the property type and service selection.
Empirical cumulative distribution function (ECDF)
An empirical cumulative distribution function is a nonparametric estimator that assigns a probability to each data point, ordered from smallest to largest in value, and calculates the sum of the assigned probabilities up to and including the largest data point. Plotting these provided a descriptive sense of both the range of move length per interaction of two features and what the percentiles of the man hours for each interaction was.
Distribution bar charts
The addition of new service options introduced new variables that affected the length of moves, such as the time required to drive from a customer’s first location through all the stops to the final location. We noticed that every 10 minutes of drive time pushed about 10 percent of orders to run over, and that many orders had a drive time < 30 minutes. In response, we developed a new order flow that allows us to add small increments of time (15 minutes) to a move estimate. We also worked through User Interface solutions with the customer-experience (CX) engineering team to provide for a granular, additive layer to the earlier crew-hour increments in booking.
How do we ensure we’re providing the right recommendations?
We needed a system by which we could ascertain that any method we employed for estimating moves could generalize and sustain itself, and not be overfit to existing data. We planned on employing batch processing, so for back-testing we decided to train the model on the batch of data that would be available to us at the end of every week, with a look-back of two years, and then apply it to the orders a week ahead. We did this every week for a couple of months before resetting the defaults and noticed that we could’ve prevented approximately 10 percent more orders from running over. The effect over the previous year is even more significant. We back-tested our recommendations for the year prior to deployment and confirmed that our recommendations would have consistently reduced the weekly runover rate by 20 percent.
From the perspective of individual service options, for our best-selling service, which includes labor with a truck, we could have prevented another 20 percent of orders from running over during the months before resetting the defaults by making use of the new default estimates while still maintaining the hit rate, which has consistently been the more difficult metric to measure ourselves against using the lookup-table-based default-setting method.
In 2018, we focused on stability and reliability while also launching in a few metropolitan cities to test our ability to expand our markets. In a two-month testing period for those new markets—an example of which is Washington, D.C., in which we have much less data—our new estimates outperformed booked man hours by more than 20 percent. This is compelling evidence that we’re on the right track. For 2019, we have laid the groundwork for reliable, rapid expansion.
Initially, we aggregated the drive time between stops into the actual length of a move. The issue with this method was that it introduced noise to move-length targets that could not be accounted for consistently, such as traffic and weather conditions.
The back-testing process indicated that we could further reduce the runover rate, as well as increase accuracy, by using the actual drive time instead. In future iterations, drive time will be computed by leveraging Google’s API and then adding it to the estimate within the order flow.
User behavior, processes, and automating default settings
The tale: 834 orders were booked and executed to defaults, and 25 percent of them ran over. About 1,449 orders were adjusted by the customer, and 37 percent of them ran over. Orders that were booked to default yielded about a 2 percent appeasement rate compared to 3.71 percent for estimates that had been overridden.
Clearly, creating defaults that meet expectations benefits both the customer and the company, which begs the question: how do we convince customers to trust our defaults?
Were we really using the defaults in practice?
The answer is no. Sixty percent of orders were adjusted by the customer — 80 percent of which were downward revisions. A similar trend was seen with customers who placed orders over the phone. Seventy percent were revised, with an alternating trend that involved either increasing or decreasing man hours. After extending our analysis, we found that fewer customers revised their estimates in the four weeks after we deployed the new defaults.
The process to reduce user intervention and increase engineering response time
To reduce runovers and increase hits, we needed to do more than deliver a high-quality estimate; we also needed to work closely with the concierge/operations team and customer-experience (CX) engineering team to establish methods to provide more context to customers.
To make the case to the concierge team, we analyzed their performance estimating orders and then ran the numbers by them to illustrate the value of the recommended defaults. Additionally, we helped the CX engineering team design new pages in the order flow to set better expectations for the customer, and then assisted them with A/B tests to ascertain that conversion rate did not drop. In the process, we also prepared them to set up a backend process to consume new weekly defaults.
The process to improve data collection
Our work with the concierge team helped us prioritize a list of questions for them to ask while completing orders over the phone. We used methods discussed in the exploratory-analysis phase to validate the intuition behind those questions before prioritizing them as questions to be asked in the customer-facing order flow. This is a low-cost, high-impact solution to the cold-start problem of studying the effect of a new feature in estimation modeling, as the risk to the conversion rate is mitigated. The CX engineering team then added these questions to the online order flow.
We observed that due to the variety of moves we completed, the behavior was fluid; we saw this as an opportunity to design an automated system to capture runovers due to constantly changing probability distributions instead of manually looking at these evolving patterns every week and then changing defaults through a lengthy analysis-and-release process. For this, we used Airflow to trigger training and testing runs at the beginning of every week, and then we set the defaults by updating a Postgres database table based on a target lookup table consisting of all the features to be used in the following week.
- In our Raleigh-Durham-Chapel Hill market, during the period of January to April 2018, we completed 80 full-service, 1-bedroom apartment/condo orders, which averaged a total of 5 man hours, with the 80th percentile of 7 hours. But from April to June we had 128 orders of the same category but the average total man hours increased to 6 hours, with the 80th percentile of 8 hours.
- In Atlanta, during the period of January to April 2018, we completed 27 full-service orders for 3-bedroom homes, which had an average total of 5 man hours, with the 80th percentile of 9 hours. But from April to June we had 61 of the same category but the average total man hours increased to 15 hours, with the 80th percentile of 25 hours.
This kind of disparity over small periods of time caused us to seek a more proactive solution through automation, which I will describe in part two of this series.