How Urban Company is using Microsoft’s cognitive APIs to build Heimdall — the gatekeeper!

By — Rahul Teotia (Product, Fraud.Identity.Safety.Trust. aka. FIST)

UC Blogger
Urban Company – Engineering
14 min readJan 29, 2020

--

This post is from Urban Company’s service delivery experience vertical, where we solve customer and partner problems to ensure that we are delivering a stellar, consistent experience on each service being delivered on UC platform.

Today, we will walk you through our journey over last few weeks, where we built a partner identity verification journey to ensure only verified and trained partners deliver services to our customers.

At UC, this project is code-named Heimdall — the gatekeeper, which will become clear after reading this post :)

Source: https://sciencefiction.com/

Like always at UC, let us start with — the Why?

What is the big deal anyway?

We think sometimes visual imagery speaks much louder than words, so we will let this meme do the talking for us.

Source: “You” TV series on Netflix

For those who aren’t binge watchers, “Joe Goldberg” is a character played by Penn Badgley in the show “You” on Netflix. Joe, to put it mildly is a charming gentlemen with nice intentions but sometimes horrifying actions.

So, basically you never ever want someone else to show up at customer’s house to ensure that service is delivered by trained UC professional in a safe manner.

On rare occasions, professionals can send someone else on their behalf for very minor reasons. Once, a professional had sent his brother, because he got occupied with something else and didn’t want to miss the earnings from the job and didn’t want no-show penalty to be charged, which is applied in case they fail showing up at customer’s house after accepting the job. Now, the professional is permanently barred on the UC platform, such things have a zero tolerance policy on the platform.

So, we wanted to put an automated identity verification check on every job, since we are a growing services marketplace which will be delivering more than 10 million jobs per month by 2022 in 30+ cities across multiple countries. And, it will become very difficult to control such instances.

We invest a lot in on-boarding our partners:

  1. Screening: A screening test when they walk in which includes personality assessment and a basic skill assessment for the professional
  2. Training: We provide 1–2 weeks of training, where professional is up-skilled on UC’s standardized way of delivering services with safe practices while creating customer delight. It has classroom training, behavioral training, on-job scenarios, mock diagnosis for appliances, mock customer interactions, App training etc. This is provided by trainers with over 10–15 years of hands on experience and experience in training other professionals.
  3. Assessment: After training, a detailed assessment is done on pro’s hard and soft skills. Only those who match UC’s world-class delivery standards are on-boarded on the platform
  4. Background checks: Complete background verification done by a third party which includes a series of criminal databases and address verification.

All this effort is done to ensure a safe and world-class service experience to our customers.

The last thing you want is someone else to show up at customer’s doorstep, after putting in so much effort!

OTP and Selfie check

To avoid this, from the very start we had put in the OTP check before starting the job to ensure only the UC professional can start the job after taking start job OTP from the customer. Later on, we started capturing the professional selfies on the job which were checked manually on sampling basis. This is also a massive deterrent for professionals, combined with zero tolerance policy.

However, these checks alone aren’t full proof and had gaps which if someone really wanted, could exploit. Also, manually checking selfies is not real-time and will not be viable at our scale.

What is the fuss about, just use face recognition?

We also thought it would be that simple before we approached the problem in more detail. Stay a bit patient to understand this better and we promise that it will be an interesting read.

Face recognition works on machine learning, which always gives a probabilistic and not a deterministic answer. Simply put, any face recognition algorithm will only tell you a probability that there is a match between the two faces being compared e.g. 53.64%.

Now, it is up to you, to use this information i.e. set a threshold value which decides whether this case will pass or fail e.g. if you are happy with 50% match then the output will be “face successfully matched”, however if you set 60% threshold the output will be “face verification failed”.

So, this threshold value is clearly quite important, but what is the best value to set here? Like always, the answer is not very straight forward and it depends on your use-case :)

The key thing you look at is the confusion matrix on your data-set. After running it on various threshold levels, plot the ROC curve and figure out the threshold value that maximizes true positives while minimizing false positives.

  1. True positive: This means that correct professional took the selfie and algorithm output said “face successfully matched”.
  2. True negative: This means that Joe Goldberg started the job and the algorithm successfully stopped it from happening i.e. output said “face verification failed”
  3. False negative: This means that correct professional started the job and algorithm output said “face verification failed”. If this number is not kept under check, the professional experience goes for a toss!
  4. False positive: This is the worst case scenario, which means that Joe Goldberg started the job and algorithm said “face successfully matched”!
Confusion matrix showing various scenarios you come across in a machine learning problem

All of these scenarios will happen, yes they will happen, doesn’t matter which threshold you pick. Even Apple’s Face ID has 1 false positive in 1 million attempts even after putting in custom hardware and hundreds of best engineering minds in the world working on it to address this!

However, what you can control is the threshold value which distributes the cases in these four buckets. Think of it as a classifier line on observed labels which divides your data set in four parts below:

Source: https://classeval.wordpress.com/

So, what is an ideal output of confusion matrix? 100% true positives and true negatives, but that will never happen. So, you strive to maximize true positives while keeping false positives in check, it is a classic trade-off problem. The metrics which are used to measure the same are:

  1. True positive rate (TPR)= (#True positives)/(#Actual positives)
  2. False positive rate (FPR)= (#False positives)/(#Actual negatives)

We plot these two on a curve while varying the threshold from 100% to 0%. This curve is known as ROC curve ( Receiver Operating Characteristic curve). Yes, a fairly complex name :) , but we didn’t name it so. It is actually a fairly old concept, much older than you think. It was first used in World War II for analysis of radar signals and is still relevant. ROC curve looks like this:

An ROC curve for a face recognition algorithm, compared with a coin flip

So, imagine a perfectly random face recognition algorithm, that works like a coin flip what will its curve look like? It will essentially be a straight line, splitting all selfies into equal number of positives and negatives.

Then, what would be the curve of a perfect face recognition algorithm? It will actually be L-shaped with all points concentrated on (2) in the above figure. So, you can tell a lot by simply looking at ROC curve, about the effectiveness of the face recognition algorithm.

So, where do we operate on the curve?

  1. Worst case: This is worst point (1) to operate on the curve as it has 100% false positives and threshold value is set to zero.
  2. Ideal case: You would aspire to operate as close to point (2) as possible, but will never reach there. This has 100% true positive rate with 0% false positive rate
  3. Conservative play: Point (3) on this curve has relatively lower false positive rate but it also reduces true positive rate to ~50%, which is bad user experience but highly secure
  4. Optimistic play: Point (4) on this curve has relatively higher true positive rate, which gives better user experience but compromises security with ~30% false positive rate.

As you can expect, we set out to create a highly secure system, while still delivering a great user experience :)

How did we decide to go ahead with Microsoft?

We assessed the face recognition solutions from various firms i.e. Microsoft, Amazon, and many others which we can’t name as we are bound by NDA terms. We had a total of seven firms in the pool, some of them being large tech giants, some firms specializing in bio-metrics identification, AI and others being small nimble start ups.

So, then it was time for… yes the try-outs!

We decided to test out all of the available solutions on our data set and thankfully we had loads of it, as we deliver more than 0.5 million jobs every month and had been capturing on job selfies from professionals for a few months, which actually served its purpose at the time.

The key parameters which we were looking to get right:

  1. Zero impersonation i.e. as close to perfect face recognition algorithm as possible, ensuring zero impersonation on the platform
  2. Professional experience i.e. deliver a great experience to professionals by ensuring low latency, real-time UX feedback
  3. Data security i.e. ensuring that professional data is not compromised and is complied with global data protection guidelines across countries
  4. Commercials (you know this one)

Problem with the data-set

Most of these firms limit the number of API calls that you get for free. So, then how do you decide which professionals and selfies to pick? So, we picked up professionals across different geographies with sufficient mix of gender, ethnicity and various service categories, which gave us around ~7500 selfies.

However, to really test the algorithm, this data-set wasn’t challenging enough, as it didn’t have any negatives. So, we decided to insert negative data-set of 500 selfies for doing this head-on comparison.

Initial results were promising, but not workable

We were getting only ~80% TPR for ~0% FPR with Microsoft and the numbers for other players were even less encouraging.

Initial face recognition results with Microsoft’s APIs

When we delved a bit deeper, we figured out that loads of bad results were coming because of imperfections in the selfie taken by professionals with key aspects being:

  1. Face angle: Professional facing away from the camera or at an awkward angle
  2. Low light: Underexposed images, usually taken in dark environment
  3. Low quality/blurred image: Shaky hands or badly taken image
  4. Incomplete image: Professional’s complete face not visible in the image
  5. Glasses: Professionals wearing dark shades while taking photographs
  6. Masks: Professionals taking selfies with theirs masks on (mostly Delhi)

Microsoft’s Edge — Face detection API with attributes

This is where Microsoft’s solution separates itself from the pack. The solution has two APIs — one for face detection and another for face verification.

The face detection API itself gives useful information which can be used to give feedback to user and help take selfie in the correct manner to maximize TPR.

It includes all the six attributes above and a lot more. See all list of attributes here. We found most of them to give good results during our testing.

So, then we manually labelled all images (8000+ images) for such imperfections and removed them from data set since that can be handled by designing a smart UX on the app.

Now, we reached up to 94% TPR at ~0% FPR. This was good, but still 6% of partners will get rejected falsely.

ROC curve after adjusting for UX feedback
UX Feedback given to professionals while clicking their photograph

How do you solve for these 6% jobs?

Do you allow these partners to start the job or not? Thing is, it is not possible to separate these 6% false negatives from actual true negatives. However, one thing is clear, that we cannot ban these professionals from starting the job else 6% of jobs will get hampered.

So, we used a simple trick, by allowing up to three attempts per job to take the selfie. Now, you are left with the funnel below where on 5–6% jobs, professionals will have to make two attempts to verify themselves, with 3rd attempt being there in every 300th job of the professional i.e. once in every 4 months.

Microsoft supports a self-learning solution

Also, we came across a few cases of false negatives where professionals weren’t getting matched to original profile picture because of changes that happen over time:

  • Gaining/Losing loads of weight
  • Growing/removing a thick beard after on-boarding

Microsoft allows you to keep on adding more images to professional’s profile over time so that the profile remains up to date. This gives opportunity to not only handle the cases above, but also further improve performance of algorithm e.g. add a selfie to professional’s person id after every 50 jobs which had a match score of >70%.

Microsoft also had good measures to ensure data security and offered solution at a price point which made sense for us.

Just two more problems to solve!

Now, we only had two more problems to solve, however quite non-trivial:

  1. Automated blocking: How to automate blocking professionals from delivering the job, when you still have 0.02% false negatives?
  2. Live detection: What if Joe Goldberg kills the professional, takes his photograph and then starts the job by taking selfie of his photograph. Just kidding :) , one can get hands on a photograph very easily today.
Don’t believe that for a second :)

Heimdall — Checkpoint system to start the job

We solved these two problems by introducing additional identifiers on the job, and not just relying on face recognition output.

  1. Primary Device ID check: We started capturing device ID (device id, android id and our own App install identifier i.e. gui id) from professional’s UC App on various API calls which happen through the day. Then created our own simple logic to identify which device is professional’s primary device, in case multiple devices are being used. Non-primary device indicates someone else went on to deliver the job.
  2. Location check: Check the distance between customer’s house to the place where job is getting started. A large distance (>500 m) indicates job is being started remotely, and someone else might be delivering the job.

We made location capture mandatory at all times and restricted professionals to stay logged in from only one device to ensure both of these data points are captured reliably.

So, now you have two additional data points for those 0.2% cases which fail three attempts for identity verification check.

The sad part is, both primary device ID check and location check don’t give you a 100% reliable answer and are not workable individually. Because they also have false negatives. What if professional bought a new phone? What if customer entered address incorrectly?

But, when you combine all three of them together, it becomes quite powerful.

>90% of professionals end up using same device for a window of up to 30 days, as can be seen from the chart below. Here again, the sweet spot will be choosing the correct time window.

  • Short window: Too short a window of 1 day will mean, device will actually change but it is possible that we miss it and false positives happen
  • Long window: Device change will give persistent negative result for much longer than desired, leading to many false negative alerts

With the correct window and correctly implemented primary device id logic, one can conservatively expect >95% hit rate on this variable.

Spread of #devices used in last 30 days by UC professionals

Similarly for location, you can expect an even higher hit rate of >99% by playing around with the distance allowed between customer location and start job location.

So, now you are only left with (0.02% X (1–95%) X (1–99%)) number of jobs where false negatives will happen, with Device ID and location check coming into picture i.e. 0.00001% jobs, which to put in perspective is 1 out of every 1 crore jobs on the UC platform.

Even in this case, partner will get re-instated after customer confirmation. So, basically now you have a Checkpoint based system (similar to police roadblocks where they check multiple things before letting you through), which looks like this:

Heimdall — Checkpoint system to eliminate potential impersonation at Urban Company

Closed-loop system

For such products, it is crucial to ensure they are water-tight, even one leakage can drain all of your hard work. So, for all cases where we are unsure, we have a human-in-the-loop system (manual inspection) where a dedicated central trust and safety desk looks at such cases to take necessary action.

Since, we have a zero tolerance policy on such behavior, professional gets immediately barred from the platform. Also, because of our strict on-boarding process, such professionals can never come back to the platform.

Device ID, location and closed loop system combined with the fact that professional has to take selfie while entering OTP in front of the customer ensures that both live detection and true negatives problems are solved without any additional hardware requirement on device.

Meet UC’s Gate-keepers!

Here is the FIST team (Fraud, Identity, Safety and Trust) which worked on the Heimdall project to enable our vision:

Urban Company’s vision is to empower millions of professionals worldwide, deliver services at home like never experienced before.

FIST team — Left to right: Rahul Teotia (Product), Mayur Garg and Vijay Gupta (Backend), Shreya Saxena (Design), Vaibhav Jain (Android), Sumit Kumar (Backend).

So, what’s next?

At UC, one thing we believe quite strongly in is to create platform capabilities rather than solving for specific use-cases. We have designed this capability in a manner, that it will extend to many upcoming use-cases:

  • Uniform detection: Whether UC professional is wearing official uniform on the job to further establish customer trust
  • Grooming detection: Whether UC professional is well-groomed on the job to exude professionalism
  • Equipment detection: Whether professional is carrying UC provided equipment to deliver the job flawlessly in a safe manner

Want to join UC team?

Feel free to connect via Linkedin and express interest to join the team in whatever domain your passion lies — product, design, engineering or even business team.

I am just a ping away: Rahul Teotia (Linkedin)

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (UrbanClap Blogger) . Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).

You can read up more about us on our publications —
https://medium.com/urbanclap-design
https://medium.com/urbanclap-engineering
https://medium.com/urbanclap-culture https://www.urbanclap.com/blog/humans-of-urbanclap

If you are interested in finding out about opportunities, visit us at http://careers.urbanclap.com

--

--

Urban Company – Engineering
Urban Company – Engineering

Published in Urban Company – Engineering

Read about how we tick — the engineering, data science & product behind the scenes.

UC Blogger
UC Blogger

Written by UC Blogger

The author of stories from inside Urban Company (owner of Engineering, Design & Culture blogs)