Exploring New Machine Learning Models for Account Security

Lenny Evans, data scientist, and Karthik Ramasamy, data science manager

Protecting Uber accounts from abuse and unauthorized use is a core responsibility for many engineers, analysts, and data scientists across multiple security teams. A traditional approach to account security is to train your systems and your analysts how to differentiate unauthorized from legitimate activity. Once you can identify fraudulent behavior, you can more easily block it.

However, when we look at something like unauthorized login attempts, the difference between fraudulent and legitimate behavior can be harder to recognize as we never know the ground truth. Phishing and password reuse can result in a fraudster having the correct credentials to gain access to an account, so we built multi-layer defenses to protect these accounts.

Our account security system includes foundational elements like rate limiting and rules based on heuristic features as well as more advanced components like machine learning models. Our analysts are able to deploy rules in quick response to specific attacks while rate limiting and machine learning models target a broader range of attack patterns.

Machine Learning Models

Building machine learning models for account security at Uber is an exciting challenge because we must build them in an adversarial environment, meaning fraudsters will try to figure out how they can change their attack patterns to get around the model.

Below are two specific models we were able to build despite these challenges which allow us to detect suspicious logins. Once identified, we can ask the user to verify their identity through methods such as two factor authentication. If a login attempt is particularly suspicious, we take proactive actions to further protect the account such as resetting the password and notifying the account owner.

Semi-Supervised Approach

One way to detect anomalous login patterns is to look at patterns of logins across entities shared by attackers, such as an IP address. Depending upon the sophistication of the attackers, they’ll use a few IPs up to several hundred thousand IPs. Common sources of these IPs are hosting providers, tor proxies, and personal devices compromised by malware that are part of massive botnets.

A visualization of the output of our IP based clustering model. We use PCA to reduce the dimensionality of features to two for easier visualization. Each point in the image is an IP address. Top and bottom images shows the clusters for logins in mid 2016 and mid 2017 respectively. These clusters are developed with different features as the login patterns of the attackers have changed considerably over the last year.

We use a semi-supervised approach to group together malicious IPs. For each IP, we use the labels we have to find features that are helpful in separating the good IPs from the bad ones. We tune the features and the parameters of our clustering algorithm by looking at how effectively the clusters separate the known good labels from the known bad labels. We preferentially use features that are harder for the attacker to control so that our models remain robust. In order for each feature to have the same influence on the clustering, we opt for features that are percentages. We select about ten features and then use the DBSCAN clustering algorithm to find clusters of IPs.

Once the model returns clusters, we compute some custom metrics per cluster that identifies the whole cluster as good or bad. This lets us expand the limited labels we have to label the unlabeled IPs. If there are new clusters that form in which we don’t have enough labels, then we get the labels either by challenging the users in the cluster with 2FA or through manual review.

Unsupervised Approach

The semi-supervised approach above still relies on labels and is reactive to how fraudsters are attacking our system. To proactively protect against a broader range of attacks, we train models only on good user behavior and flag anything inconsistent as anomalous. Unless the attack is highly targeted (in which case the fraudster will be limited by the economics of scaling), the fraudster cannot know the normal behavior of the user and thus it is more difficult for them to get around anomaly detection models.

For example, a user who historically took trips only in Uberlândia, Brazil is unlikely to start taking trips in Hyderabad, India. We built a deep learning model to learn these relationships between cities. The model takes as input the anonymized history of trips and food orders with Uber. This sequence is passed as input and the model predicts where the next trip will take place.

Deep learning is well-suited for this problem as traditional machine learning approaches such as tree models have difficulty learning on large high dimensional datasets. Neural networks have far more parameters than these traditional machine learning approaches and thus require more data to fully constrain the parameters. In addition, traditional machine learning approaches are ill-suited for handling sequences of varying length. For traditional machine learning models, these must be awkwardly encoded as a long list of features, whereas neural networks have recurrent layers such as LSTMs that naturally handle mixed-length sequences.

We investigated representing cities in an embedding, which is a low dimensional mapping of cities in which distance between cities represents how likely users are to travel between the cities. This problem has a natural embedding of latitude and longitude, but it fails to capture the tendency for users to travel between big cities which are not necessarily the closest cities. We applied the word2vec algorithm common to natural language processing (NLP) to our trip sequences in order to capture some of these relationships better than latitude and longitude. We then used our internal GPU infrastructure to train this model on hundreds of millions of training datasets.

A visualization of our city2vec embedding globally (top) and for cities in US and Canada (bottm), with PCA used to reduce the number of dimensions to two. Note the scales of the two plots differ. In both plots, each point represents a city but not all cities are labeled for visibility.

What’s Next?

We’re already exploring semi-supervised and unsupervised approaches on related fraud problems where detecting anomalies is useful. For example, a fake account behaves very differently from normal users, thereby showing as anomalies in models trained on legitimate users. The unsupervised model described here is also generic enough to detect anomalies in an event sequence such as anomalies in types of devices (IOS, Android, Mac, etc.) used by a specific user.

This work will be presented at the USENIX Summit on Hot Topics in Security in Vancouver and the 2017 Annual Data Institute Conference in San Francisco.