How Urban Company uses Machine Learning to improve reliability of the marketplace
By — Resham Wadhwa ( Data Scientist, Data)
At Urban Company, we act as a matchmaker between customers and service professionals for home services like salons and cleaning. Our goal is to find the right professional, ensure they accept the request, arrive on time, and deliver quality service that meets the customer’s expectations.
However, one of the most important aspects of fulfilling a request is the reliability of a service partner — will a partner chosen by Urban Company fulfill the promise?
Flexibility of professionals and customers
To understand why reliability is even a concern, we need to keep in mind that both UC customers and professionals have extreme flexibility in how they operate.
For customers, there are currently 24 possible slots available on the app that the user can choose from. 8 am to 7.30 pm. Usually, available slots are provided for upto 3 days resulting in 24 x3 = 72 possible slots.
For partners, when a request is placed, we send out ‘leads’ — partners then may accept/decline the lead. As such, lead acceptance is an important step in fulfilling customer requests.
There are several reasons why a professional might not accept a given request: these include being delayed in a previous job, being physically or mentally fatigued, or unforeseen circumstances like bad weather etc. A short fraction of providers are very selective with the type of jobs they want to do as well. The list goes on and on.
The Problem
Given this flexibility for both customers and providers, we often end up with some requests that are placed by the customers yet no provider is found or assigned to the job. This situation is termed as No-Response or NR. This not only results in customer’s anxiety but also works against the company’s values. We want to minimize the NR i.e., fulfill every request that we accept. The higher the value of NR, higher the number of unsatisfied customers and affected future services.
A simple solution here can be to look at historic data, look at slots and order combinations that result in most NR and simply stop showing those to the customers. Problem solved, right? If we look only in terms of NRs, then yes. But we cannot ignore the fact that most of the requests actually do get served across categories. So blocking all tricky slots will result in us losing out the requests that could have actually been served. We lose the customers by not showing them the required slot and revenue that would have come eventually. We call such request scenarios as Requests Lost or RL. The higher the value of RL, the higher is the loss of potential business for UC. We have trained a predictive model that takes session data as input and returns the corresponding RL.
Which is how we land our problem statement — how can we minimize NR while keeping RL in control.
Solution? Lets only show the slots to customers for which the probability that at least one available provider will accept the job is high. We call this probability the Request Score.
Request Score
Before going into details of Request Score, here are a few terms that are relevant for deriving the score.
Lead is the term given to the notification of a customer’s request with necessary information needed by the provider to take the decision to accept or reject a lead.
Lead Score (LS) of a lead is the probability that a provider p will accept a lead of request R with features as known at the time of request creation.
Request Score (RS) of a slot shown to the customer is the probability that at least 1 available provider eligible for the request will accept the request lead.
In the most basic form of the model, the Request Score for a slot can be calculated as follows -
Lead Score Model
Lead Score is the fundamental block to calculate reliability of any slot.
Here, Lead Score is calculated using a Machine Learning pipeline and subsequently, Lead Score becomes an input to the Request Score.
Constraints and Restrictions
a. Explainability : Business teams are continuously working on monitoring and improving the systems. So if we don’t fan out lead to a provider based on low scores, the business team needs to know what can be done to improve that provider. Hence, explainability is important.
b. Point in time Data : We need to take the decision at the moment when slots are to be shown hence all the data to be used for training needs to have values as on the request creation time.
c. Performance : For any slot, 15 or more providers can be eligible. So at the time of showing slots, we need to calculate 24 ( slots ) x 15 (providers) x 3 ( days) = ~1100 scores for 1 customer session. Thus, our response time needs to be in milliseconds.
d. Scalability : UC is frequently expanding with newer products and categories. We wanted to build a solution that won’t go into specificities of categories but use features that will be applicable and available for all eligible categories.
e. Matchmaking filters : Not every eligible provider is shown the lead — matchmaking systems have multiple filtering criteria ( ex — Skill based filtering : Facial v/s Pedicure experts ) in place hence data at a slot level for every provider isn’t necessarily available.
f. Feature freshness : All features might not be available with realtime updated values hence we have to live with stale values for some of the features.
All of the aforementioned constraints pose their own challenges in building a solution but with every other challenge, it becomes an even more interesting problem to solve.
Feature thinking
Since this revolves around three entities — Providers, Requests, Matchmaking system — we needed to understand the core thinking of a provider that goes into the decision making of accepting or rejecting a lead.
In order to come up with the first list, we called several random providers across categories and understood their reasons for not accepting leads in the last few days. This was in addition to looking at the reasons providers select while rejecting a lead.
Provider’s point of view
Major reasons were understandably the distance between their home/last job and the requested customer’s place. Second reason that came out was them not having the required inventory to deliver the service. Similarly, other reasons were sudden family problems, sickness and problems with customer’s behavior. This was very helpful but it seemed like providers could only give politically correct answers but won’t talk about the reasons when they were at fault like cherry picking.
Third person’s point of view
Next, we surveyed UC tech employees, category teams and Product managers — asking them to think of reasons why they wouldn’t take a job. This was very insightful and led us to considering the provider’s historical behavior pattern on the platform as a feature.
By the end of this activity, we were left with 20 features to use for EDA and the training process. Our amazing Data Analytics team helped us in pipelining these features with accurate data as it was at the time of request creation. The derived tables not only consisted of the actual historic leads but also simulated leads data for providers that were eligible for the slots but didn’t actually get the leads.
Training Criteria
Since, we have the historic data of lead details, providers, request and system states along with the actual acceptance of the leads, this is a supervised learning problem. Target being whether the lead was accepted or rejected. 1 being lead is accepted, 0 being rejected lead. We modelled this as a classification problem.
Evaluation Criteria
Before deciding to train any model, it is important to decide on an evaluation metric. To decide a metric, its important to understand how the score will be used. A classification score is a fraction between 0 and 1. It can either be used in its absolute terms like we do, or it can be used with applied thresholds with a decision. Based on what you do, the evaluation metric will differ.
We want to use the scores in their raw format — as what they are — probabilities. So for us, its important that all leads with 0.5 lead score should have 50% acceptance. For this, we use the weighted absolute error of the grouped lead score.
Following are the steps to calculate error (and hence performance)
- Group the leads on the basis of lead score buckets: 0 to 10, .., 90 to 100 bucket
- Get the weighted average absolute error for each bucket.
Given the criticality of the score application, the acceptance criteria of any model is that the error should be less than 5% ( percent because we multiplied the scores by 100). The lower the better. This shall also become our criteria to compare multiple models and solutions.
Why a grouped error ?
Since we are looking at probability predictions, the more samples per group, the more confident we will be about the metric. However, if the size of buckets is too huge, it will generalize to a lower error. If the size of the bucket is too small, it will be based on very few data points for some cases hence won’t be reliable. We observed the trends of the data and took the bucket size to be 10pp.
Exploratory Data Analysis (EDA)
So far, we have a gut feeling about the selected features. Exploratory Data Analysis (EDA) is a critical initial step in understanding the data before building any machine learning models. It helps in understanding the data distribution, identifying patterns, and finding potential issues such as missing values, outliers, or imbalanced classes.
As a standard practice, we perform all the standard tests on the training data.
Following are the charts of a two of those tests performed -
Data Distribution
Nullity Correlation
Feature Engineering
Many of the features from the final 20 feature list made sense intuitively but needed processing in order to make sense quantitatively.
For instance, distance is an important information. However, there are some constraints to it -
- Distance is wrt a reference location. Distance of the provider’s current location from the customer’s place or provider’s home location from the customer’s place?
- Distance has an incomplete relevance for us without location data. A distance of 5 Kms is nothing in a city like Gurgaon translating to a travel time of 15–20 mins but in a smaller city, this can mean travel time of upto 40–45 mins. In Bangalore traffic, this can easily take an hour.
Hence, we replaced it with the relative location group of the provider. If at the time of request creation, provider’s last known location (last 30 mins) was of another job, their home or Unknown with no ping in last 30 mins etc. The following chart shows the different values created for the location feature.
EDA on these categories helped us establish confidence in the grouping showing a great variation in the lead acceptances. Similar engineering was applied on other raw data features as well. By the end of this exercise, we had three categories of features with some of the features listed within the table below -
Training, Testing and Evaluation
As explainability was an important aspect, we began our training with the most basic Logistic Regression model that gave us feature importances as well as feature weights. The interesting observation was that weighted error for the validation data was merely 2%. But we didn’t go ahead to use it. Why?
Because of the following error table -
Overall WAE = 1.6%, isn’t this great? Well, no. Like every other data scientist, a model this good in the first attempt gave me anxiety instead of making me happy.
If you look closely, you will see that while the logistic model worked beautifully in predicting the acceptance for good leads, it failed tremendously to predict the bad leads — where probability of acceptance is too low. If we want to block the slots that are bad, identifying bad leads is a crucial unsaid step.
So in this case, the error of 1.6% is unacceptable and irrelevant to us.
Final Model
We finally trained a model with XgBoost classification using hyperparameter tuning via Amazon Sagemaker. After several iterations of feature selection (forward selection) and feature changes, we arrived at a stable model for our 22 supercategories all with WAE < 3% for Lead Acceptance Prediction.
Consuming the Lead Score : Request Score V0
Now, we had a pretty good model that was able to identify if an exclusive lead for a given request will be fanned out to a given provider, next step was to identify if a slot should be opened or blocked.
As the most basic version, v0, we considered just the probability scores -
So, for every slot, we needed to have -
- List of all available providers
- Their features
- Predicting Lead Scores
- Evaluating the aforementioned probability equation.
While we knew there are more constraints to a slot acceptance than mere lead acceptances, while collecting features and pipelining data for a more complex model for Request Score, we didn’t want to stay blocked.
Trade off between NR and RL
Now that we are ready to block the slots, we are excited to save NR. But we cannot forget that we will be causing some RL as well.
What is the right balance between these two? How do we know after launching if we are losing out more or are we in the right balance of the two?
Cost Function
This is where causal ML comes to our rescue. Using a causal inference model developed by the Data Science team, we were able to establish the long term effect of both NR and RL on the customers in the next 3 and 6 months.
While in 3 months, NR had a higher impact on Customers’ Lifetime value(CLTV) to Urban Company, in 6 months they both had a similar impact on CLTV. How does this information help us?
It means for every blocked bad request, we cannot lose more than 1 good request. Mathematically,
A/B experiment
With one of the most popular categories across cities — Salon Prime, we launched the RSM in half the hubs while considering the other half as the control group. Once we saw an immediate improvement in the numbers, we soon scaled it up pan India. Our product managers took care of the experiments and their evaluation wrt the business metrics — NR and RL. The data science team tracked the model side of things.
Scaling any experiment across different categories means keeping a track of the following -
- Increased throughput and ensuring our hosted model and micro-service is ready for the increased load.
- Tracking Response times
- Tracking payload sizes given we predict for 1200+ provider x slot combinations at Lead Score level.
- Tracking errors and feature values
- Tracking data Drift
- Setting up relevant alerts for all of the above.
Thanks to our ever so active data science and data platform teams, this was rarely an issue.
Results
The NR% in Salon Prime for India came down by 2pp by April end ( v0 fully scaled to India ) .
However, since we started blocking slots, RL was expected to increase.
Looking at the success of Salon Prime, we scaled it across categories.
Similar trends of cost functions and NR savings were observed.
However, there was still room for improvement. Which is where we trained a complex model for consuming the lead scores and producing slot reliability score = Request Score.
Wait for Part 2 for technical details of the same.
About the author
Resham Wadhwa is a Staff Data Scientist at Urban Company. She leads the Data Science team with a passion for turning information into actionable insights. Her love for exploration extends beyond the data world — you might find her behind the camera, planning her next travel adventure or immersed in a captivating novel.
Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (@UC Blogger). Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).
If you are interested in finding out about opportunities, visit us at http://careers.urbancompany.com