Using Computer Vision to Evaluate Scooter Parking

Published in

Lime Engineering

8 min readJul 13, 2020

This article was jointly written by Yi Su and Andrew Xia

With more people integrating electrical scooters into their daily transportation needs, an increasing number of dockless scooters have appeared in all corners of cities around the world. While the flexibility of a dockless micromobility service provides convenience to the user, it is also very important that our scooters are parked in compliance with the city’s transportation regulations, specifically in sidewalk riding and non-compliant parking. Our latest blog post describes how Lime uses machine learning to curb sidewalk riding behavior among users. In this blog, we will explain how Lime applies computer vision to tackle non-compliant scooter parking.

Our problem can be framed as follows: given a scooter image, determine whether the scooter is parked correctly and if not return the violation label. The violation label may be whether the scooter is not in the image at all, or whether the scooter is parked too close to the road or blocking the sidewalk.

Datasets and Modeling

As our problem is inherently one of image classification, convolutional neural networks are a natural choice for this task. All machine learning projects start from a well-labeled training dataset, and this one is no exception. Intuitively, we would like the dataset to contain good quality scooter pictures, in that a human would easily be able to distinguish a correctly parked scooter from an incorrectly parked scooter.

Datasets

In the Lime app, there are two sources that generate image data:

Rider app: riders are asked to take a photo of the parked scooter at the end of each trip.
Deployment or “Juicer” app: Partners (“juicers”, logistics partners, or subcontractors depending on the market) who help Lime retrieve, charge, and redeploy the scooters, are required to take a photo of the parked scooter(s) after the deployment.

There are pros and cons in each of the above data sources.

Rider Parking Photos

Pros — high photo quantity. Riders take many more photos than deployment partners.
Cons — low photo quality. The disadvantages of the photo taken by riders come in two folds. Rider photos vary in quailty, as only 7% of the rider photos are deemed good. The rider photos are also not labeled.

A sample of photos taken by Lime riders at the end of their trip.

Deployment Photos

Pros — high photo quality and labeling quality. Deployment photos have been reviewed by human reviewers since late 2018 to provide feedback, and they are well trained with high-quality labels. Furthermore, the parking requirements or non-violation classes are well-documented based on city regulations. Consequently, there is much less ambiguity for human labelers.
Cons — different parking requirements from riders. The requirements for good deployment photos are not exactly the same as rider parking photos.

The foundation of any successful supervised machine learning project is a high-quality labeled dataset. Otherwise, garbage in, garbage out. Hence, rider parking photos are not well suited for training. We can, however, use them later for testing. Good deployments are stricter than for rider parking. If we can build an accurate classifier for deployments, the same model may be used to solve the problem on the rider side. With this in mind, deployment photos were chosen as the training set for our CNN model.

End-to-End vs Modularized Approach

We considered two approaches to building our classification model. In one approach, we could build an end-to-end model in which a single model would take in the input photo, and output a multiclass output vector corresponding to each violation class label. Another modularized approach we could take would be to train a series of binary classifiers, outputting either “parked correctly” or the respective violation label, and subsequently combine the models via a waterfall approach.

On the surface, an end-to-end approach is more suitable for this classification task, i.e. training a multiclass classification model. Given a scooter parking photo as an input, the model outputs a vector of length n, which corresponds to the n non-compliant requirements. The value in the ith position of the vector represents the probability of violating requirement i. If the probability is greater than the certain predetermined threshold, the photo is said to violate requirement i. In the end-to-end approach, a photo may potentially have more than one violation.

End-to-End Approach

Our experiments showed us that the multiclass classification model in the end-to-end approach required more data to train, took longer to converge, and did not have a satisfactory result. Furthermore, if violation requirements were added or removed from the current set of requirements, the whole model would have to be retrained for this new set of requirements.

On the other hand, a modularized approach is more flexible if the parking requirements change more frequently. In this approach, n different binary classification models are trained. A waterfall architecture, ordered by the importance of violations, is illustrated as in the following image. A photo is passed into Model 1, and the model outputs a binary output if the photo is in violation of requirement 1. If it indeed violates requirement 1, the whole prediction process stops and the photo is labeled as violating requirement 1. Otherwise, Model 2 accepts the photo as an input and performs prediction. Continue this procedure as long as the photo does not violate any of the requirements. If all of the n binary classification models do not find this photo in violation, the output will be compliant parking.

Modularized Approach

Easily seen from the architecture, if a new requirement is added, only one binary classifier needs to be trained. And if a requirement is removed, no additional training is needed. With the constantly changing regulatory landscape for the micromobility industry, this modularized, light-weight approach better suits our needs.

Model Architecture, Training and Explainability

Architecture

For the ease of model maintenance, the same model architecture is used for all binary classifiers. Different training sets are fed into this model architecture to train a model for the specific requirements respectively.

Several well-known architectures such as AlexNet, VGG-19, and InceptionNet are used in training. However, the performance did not appear to be satisfactory. Then we decided to construct our own customized architecture using several blocks of convolutional layer + max pooling + batch normalization layers as well as fully connected layers with hyperparameters tailored to this specific problem. This model architecture yielded the best results.

Training

Amazon SageMaker Notebook and GPU instances are used in the model training. Each model is trained on a dataset of around 54k images for 50 epochs, which takes around 40 minutes. The convergence is fast and steady.

Explainability

One of the biggest critics of the deep learning model is that it is difficult to explain what is really happening under the hood. We made some effort to visualize what the model learns after several convolutional layers.

The following example is chosen for the not-too-close-to-road requirement. This requirement states that the scooter shouldn’t be parked so close to the curb of the sidewalk that it may obstruct the vehicles on the road which are also close to the sidewalk.

Visualization of Activation Intermediate Convolution Layer

If a person looks at the original picture in the upper left, the most important feature to capture is whether any part of the scooter intersects with the curb line on the sidewalk. In the upper right picture, the highlighted area shows what this deep learning model pays attention to. Noticeably, the curb line and the scooter frame are well captured in those highlighted areas. This sheds some light on why the model for this specific classification task performs well.

Performance Analysis

Recall that we trained one binary classifier for each requirement. Some of the models have over 95% accuracy in a balanced test set. However, some models only have ~85% accuracy. In order to objectively evaluate the model performance, we asked a group of human reviewers to review a sample set of photos from the test sets to approximate Bayes error rate of each classification task. Ten reviewers independently reviewed the same set of sample images, and their average pairwise agreement is taken to be the approximation of Bayes error rate. Our models perform at around the same level as the approximated Bayes error rate.

Inspecting a sample test set of photos, we found that some of the photos are intrinsically difficult for humans to label, which may contribute to the Bayes error rate.

In summary, the performances of these models are high enough to replace the current manual parking photo compliance workflow. The next section will illustrate how Lime leverages this automated compliance technology into the user workflow to enhance the user experience.

Production System Design

Having trained our model, With our current computer vision model, every time someone takes a photo of their scooter deployment, we asynchronously call our model to determine compliance of the scooter. If our model determines that the scooter is not in compliance, we send a friendly reminder to the person that their deployment is not in compliance, and encourage them to delpoy correctly next time.

*A diagram of our backend model architecture*

However, if the person deploying continually does not park in compliance, we will enforce penalties for the user. We also give them the option to contest our review, to mitigate false positive issues. This flexible approach gives deployment partners the opportunity to override our model, which can then finally be determined by manual review. With the opportunity to contest the model’s review we can also use feedback to continually improve our model performance.

*Deployment partners are able to contest our review in the Lime app. We also provide them the opportunity to provide feedback, which can continually help us improve our model.*

Experiment Results

To determine the effectiveness of our model and feature, we launched an experiment in Seoul and Busan in Korea. We set up an A/B test, randomized on the deployment partner, in which users in the control group would not receive automated emails to give parking feedback, while users in the test group would receive emails when their deployments had parking violations. We saw that the deployment compliance rate increased by 14% compared to the control group, while there was no deterrent effect on juicer task collection activity. We also removed the need for human labelers to review deployment photos, representing a cost saving for Lime. With this encouraging result, we are planning a global rollout to all markets to ensure better deployment and meet city regulations.

Acknowledgments

Special thanks to Chen Zheng, Adrian Luan, Ben Laufer, Colin McMahon, Michele Weiner, Lei Xu, Aaron Nojima, Charlie McIntyre, Tucker Risman, Jinsong Tan, and many others who made this project possible.