How Urban Company uses Machine Learning to improve reliability of the marketplace

By — Resham Wadhwa ( Data Scientist, Data)

Published in

Urban Company – Engineering

12 min readMay 9, 2024

At Urban Company, we act as a matchmaker between customers and service professionals for home services like salons and cleaning. Our goal is to find the right professional, ensure they accept the request, arrive on time, and deliver quality service that meets the customer’s expectations.

However, one of the most important aspects of fulfilling a request is the reliability of a service partner — will a partner chosen by Urban Company fulfill the promise?

Flexibility of professionals and customers

To understand why reliability is even a concern, we need to keep in mind that both UC customers and professionals have extreme flexibility in how they operate.

For customers, there are currently 24 possible slots available on the app that the user can choose from. 8 am to 7.30 pm. Usually, available slots are provided for upto 3 days resulting in 24 x3 = 72 possible slots.

Customers can choose from a number of slots throughout the day.

For partners, when a request is placed, we send out ‘leads’ — partners then may accept/decline the lead. As such, lead acceptance is an important step in fulfilling customer requests.

Providers get notification of an incoming request with an option to Accept or Decline

There are several reasons why a professional might not accept a given request: these include being delayed in a previous job, being physically or mentally fatigued, or unforeseen circumstances like bad weather etc. A short fraction of providers are very selective with the type of jobs they want to do as well. The list goes on and on.

The Problem

Given this flexibility for both customers and providers, we often end up with some requests that are placed by the customers yet no provider is found or assigned to the job. This situation is termed as No-Response or NR. This not only results in customer’s anxiety but also works against the company’s values. We want to minimize the NR i.e., fulfill every request that we accept. The higher the value of NR, higher the number of unsatisfied customers and affected future services.

A simple solution here can be to look at historic data, look at slots and order combinations that result in most NR and simply stop showing those to the customers. Problem solved, right? If we look only in terms of NRs, then yes. But we cannot ignore the fact that most of the requests actually do get served across categories. So blocking all tricky slots will result in us losing out the requests that could have actually been served. We lose the customers by not showing them the required slot and revenue that would have come eventually. We call such request scenarios as Requests Lost or RL. The higher the value of RL, the higher is the loss of potential business for UC. We have trained a predictive model that takes session data as input and returns the corresponding RL.

If we take a request (slot), it can lead to NR. If we do not take that request, it can lead to RL.

Which is how we land our problem statement — how can we minimize NR while keeping RL in control.

Solution? Lets only show the slots to customers for which the probability that at least one available provider will accept the job is high. We call this probability the Request Score.

Request Score

Before going into details of Request Score, here are a few terms that are relevant for deriving the score.

Lead is the term given to the notification of a customer’s request with necessary information needed by the provider to take the decision to accept or reject a lead.

Lead Score (LS) of a lead is the probability that a provider p will accept a lead of request R with features as known at the time of request creation.

Request Score (RS) of a slot shown to the customer is the probability that at least 1 available provider eligible for the request will accept the request lead.

In the most basic form of the model, the Request Score for a slot can be calculated as follows -

Mathematical notation for Request score and Lead Score calculation

Lead Score Model

Lead Score is the fundamental block to calculate reliability of any slot.

Relationship between Request Score and Lead Score

Here, Lead Score is calculated using a Machine Learning pipeline and subsequently, Lead Score becomes an input to the Request Score.

Constraints and Restrictions

Constraints related to the Request Score and subsequently Lead Score Models.

a. Explainability : Business teams are continuously working on monitoring and improving the systems. So if we don’t fan out lead to a provider based on low scores, the business team needs to know what can be done to improve that provider. Hence, explainability is important.

b. Point in time Data : We need to take the decision at the moment when slots are to be shown hence all the data to be used for training needs to have values as on the request creation time.

c. Performance : For any slot, 15 or more providers can be eligible. So at the time of showing slots, we need to calculate 24 ( slots ) x 15 (providers) x 3 ( days) = ~1100 scores for 1 customer session. Thus, our response time needs to be in milliseconds.

d. Scalability : UC is frequently expanding with newer products and categories. We wanted to build a solution that won’t go into specificities of categories but use features that will be applicable and available for all eligible categories.

e. Matchmaking filters : Not every eligible provider is shown the lead — matchmaking systems have multiple filtering criteria ( ex — Skill based filtering : Facial v/s Pedicure experts ) in place hence data at a slot level for every provider isn’t necessarily available.

f. Feature freshness : All features might not be available with realtime updated values hence we have to live with stale values for some of the features.

All of the aforementioned constraints pose their own challenges in building a solution but with every other challenge, it becomes an even more interesting problem to solve.

Feature thinking

Since this revolves around three entities — Providers, Requests, Matchmaking system — we needed to understand the core thinking of a provider that goes into the decision making of accepting or rejecting a lead.

In order to come up with the first list, we called several random providers across categories and understood their reasons for not accepting leads in the last few days. This was in addition to looking at the reasons providers select while rejecting a lead.

Provider’s point of view

Major reasons were understandably the distance between their home/last job and the requested customer’s place. Second reason that came out was them not having the required inventory to deliver the service. Similarly, other reasons were sudden family problems, sickness and problems with customer’s behavior. This was very helpful but it seemed like providers could only give politically correct answers but won’t talk about the reasons when they were at fault like cherry picking.

Third person’s point of view

Next, we surveyed UC tech employees, category teams and Product managers — asking them to think of reasons why they wouldn’t take a job. This was very insightful and led us to considering the provider’s historical behavior pattern on the platform as a feature.

Filtering out features at different levels.

By the end of this activity, we were left with 20 features to use for EDA and the training process. Our amazing Data Analytics team helped us in pipelining these features with accurate data as it was at the time of request creation. The derived tables not only consisted of the actual historic leads but also simulated leads data for providers that were eligible for the slots but didn’t actually get the leads.

Training Criteria

Since, we have the historic data of lead details, providers, request and system states along with the actual acceptance of the leads, this is a supervised learning problem. Target being whether the lead was accepted or rejected. 1 being lead is accepted, 0 being rejected lead. We modelled this as a classification problem.

Evaluation Criteria

Before deciding to train any model, it is important to decide on an evaluation metric. To decide a metric, its important to understand how the score will be used. A classification score is a fraction between 0 and 1. It can either be used in its absolute terms like we do, or it can be used with applied thresholds with a decision. Based on what you do, the evaluation metric will differ.

We want to use the scores in their raw format — as what they are — probabilities. So for us, its important that all leads with 0.5 lead score should have 50% acceptance. For this, we use the weighted absolute error of the grouped lead score.

Following are the steps to calculate error (and hence performance)

Group the leads on the basis of lead score buckets: 0 to 10, .., 90 to 100 bucket
Get the weighted average absolute error for each bucket.

Mathematical equation for the error metric used to evaluate Lead Score Model

Given the criticality of the score application, the acceptance criteria of any model is that the error should be less than 5% ( percent because we multiplied the scores by 100). The lower the better. This shall also become our criteria to compare multiple models and solutions.

Why a grouped error ?

Since we are looking at probability predictions, the more samples per group, the more confident we will be about the metric. However, if the size of buckets is too huge, it will generalize to a lower error. If the size of the bucket is too small, it will be based on very few data points for some cases hence won’t be reliable. We observed the trends of the data and took the bucket size to be 10pp.

Exploratory Data Analysis (EDA)

So far, we have a gut feeling about the selected features. Exploratory Data Analysis (EDA) is a critical initial step in understanding the data before building any machine learning models. It helps in understanding the data distribution, identifying patterns, and finding potential issues such as missing values, outliers, or imbalanced classes.

As a standard practice, we perform all the standard tests on the training data.

Following are the charts of a two of those tests performed -

Data Distribution

Percent of requests generated in different slots during the day for a category. This shows that most of the requests are placed by evening with relatively lesser requests booked for early morning or the nights. Top preferred slot for requests by customers is the Noon slot.

Nullity Correlation

Observed Null correlation between the independent features. If the duration of a request is missing in the records, the end hour is also missing with 1 nullity correlation value which validates the hypothesis (and the formula). Lack of any other strong correlation shows that no two columns have strong relationship wrt missing values — there is no pattern in null values of features.

Feature Engineering

Many of the features from the final 20 feature list made sense intuitively but needed processing in order to make sense quantitatively.

For instance, distance is an important information. However, there are some constraints to it -

Distance is wrt a reference location. Distance of the provider’s current location from the customer’s place or provider’s home location from the customer’s place?
Distance has an incomplete relevance for us without location data. A distance of 5 Kms is nothing in a city like Gurgaon translating to a travel time of 15–20 mins but in a smaller city, this can mean travel time of upto 40–45 mins. In Bangalore traffic, this can easily take an hour.

Hence, we replaced it with the relative location group of the provider. If at the time of request creation, provider’s last known location (last 30 mins) was of another job, their home or Unknown with no ping in last 30 mins etc. The following chart shows the different values created for the location feature.

Instead of using multiple features to indicate distance, a categorical column was engineered.

EDA on these categories helped us establish confidence in the grouping showing a great variation in the lead acceptances. Similar engineering was applied on other raw data features as well. By the end of this exercise, we had three categories of features with some of the features listed within the table below -

EDA resulted in a table with all features categorised in these three segments.

Training, Testing and Evaluation

As explainability was an important aspect, we began our training with the most basic Logistic Regression model that gave us feature importances as well as feature weights. The interesting observation was that weighted error for the validation data was merely 2%. But we didn’t go ahead to use it. Why?

Because of the following error table -

Weighted Absolute Percentage Error for Lead Score Model using a simple model.

Overall WAE = 1.6%, isn’t this great? Well, no. Like every other data scientist, a model this good in the first attempt gave me anxiety instead of making me happy.

If you look closely, you will see that while the logistic model worked beautifully in predicting the acceptance for good leads, it failed tremendously to predict the bad leads — where probability of acceptance is too low. If we want to block the slots that are bad, identifying bad leads is a crucial unsaid step.

So in this case, the error of 1.6% is unacceptable and irrelevant to us.

Final Model

We finally trained a model with XgBoost classification using hyperparameter tuning via Amazon Sagemaker. After several iterations of feature selection (forward selection) and feature changes, we arrived at a stable model for our 22 supercategories all with WAE < 3% for Lead Acceptance Prediction.

Consuming the Lead Score : Request Score V0

Now, we had a pretty good model that was able to identify if an exclusive lead for a given request will be fanned out to a given provider, next step was to identify if a slot should be opened or blocked.

As the most basic version, v0, we considered just the probability scores -

Mathematical relationship between Request score and the Lead Scores

So, for every slot, we needed to have -

List of all available providers
Their features
Predicting Lead Scores
Evaluating the aforementioned probability equation.

While we knew there are more constraints to a slot acceptance than mere lead acceptances, while collecting features and pipelining data for a more complex model for Request Score, we didn’t want to stay blocked.

Trade off between NR and RL

Now that we are ready to block the slots, we are excited to save NR. But we cannot forget that we will be causing some RL as well.

Notation for interpreting the impact of the experiment

What is the right balance between these two? How do we know after launching if we are losing out more or are we in the right balance of the two?

Cost Function

This is where causal ML comes to our rescue. Using a causal inference model developed by the Data Science team, we were able to establish the long term effect of both NR and RL on the customers in the next 3 and 6 months.

While in 3 months, NR had a higher impact on Customers’ Lifetime value(CLTV) to Urban Company, in 6 months they both had a similar impact on CLTV. How does this information help us?

It means for every blocked bad request, we cannot lose more than 1 good request. Mathematically,

Cost function to understand and evaluate the system status.

A/B experiment

With one of the most popular categories across cities — Salon Prime, we launched the RSM in half the hubs while considering the other half as the control group. Once we saw an immediate improvement in the numbers, we soon scaled it up pan India. Our product managers took care of the experiments and their evaluation wrt the business metrics — NR and RL. The data science team tracked the model side of things.

Scaling any experiment across different categories means keeping a track of the following -

Increased throughput and ensuring our hosted model and micro-service is ready for the increased load.
Tracking Response times
Tracking payload sizes given we predict for 1200+ provider x slot combinations at Lead Score level.
Tracking errors and feature values
Tracking data Drift
Setting up relevant alerts for all of the above.

Thanks to our ever so active data science and data platform teams, this was rarely an issue.

Results

The NR% in Salon Prime for India came down by 2pp by April end ( v0 fully scaled to India ) .

Model’s performance wrt NR and RL in production after the stablisation of the experiment.

However, since we started blocking slots, RL was expected to increase.

Cost function evaluated for the experiment wrt Salon Prime Pan India.

Looking at the success of Salon Prime, we scaled it across categories.

Similar trends of cost functions and NR savings were observed.

However, there was still room for improvement. Which is where we trained a complex model for consuming the lead scores and producing slot reliability score = Request Score.

Wait for Part 2 for technical details of the same.

About the author

Resham Wadhwa is a Staff Data Scientist at Urban Company. She leads the Data Science team with a passion for turning information into actionable insights. Her love for exploration extends beyond the data world — you might find her behind the camera, planning her next travel adventure or immersed in a captivating novel.

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (@UC Blogger). Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).

If you are interested in finding out about opportunities, visit us at http://careers.urbancompany.com