Anomaly Detection at Scale

My summer internship prepared me to be a Full-Stack Data Scientist

Liyishan
Wish Engineering And Data Science
9 min readMar 16, 2022

--

Last summer, I did my co-op at the centralized data science team at Wish as a Data Scientist intern. I was assigned to work on a high-impact project, called anomaly detection. I never thought that an intern could make a huge impact. It was such a valuable experience for me, as I was able to work on the project from scratch, including understanding the business needs, communicating with stakeholders, designing the solution, implementing the solution, building forecasting model, improving model accuracy, building a visualization dashboard, and eventually collaborate with stakeholders to realize the impact of the project.

I really appreciate Wish giving me the autonomy to drive the project with the support of experienced coworkers, allowing me to face complex challenges and grow in different ways. Throughout the summer, my mentor (Max Li) and manager (Pavel Kochetkov) were extremely supportive and helped me succeed in the role along the way. I am certain that this on-the-ground experience provided me with valuable lessons and prepared me well for my future career.

Project Background:

Business metrics are crucial indicators of business health. As a data-driven company, Wish has business metrics that measure a wide range of business success often broken down on different dimensions (e.g, country, device type etc.). It is challenging to monitor the health of those metrics manually since the metrics naturally change due to a variety of factors, such as seasonality, holidays, special events etc. Imagine we have more than a hundred business metrics to monitor every day, each metric has more than 100 time-series broken down by different dimensions. Each day, we need to continuously monitor more than 10k time series. As the number of metrics rises, manual monitoring is surely not a feasible approach in today’s big data world. The difficulties of finding and diagnosing the root cause of the anomalies in time would level up without a well-designed anomaly detection system.

Anomaly detection is crucial as it could help the company proactively monitor the application performance and produce actionable insight to remediate the relevant issues and thus prevent catastrophic consequences. Negative anomalies could be extreme spikes in web traffic that point to severe server outages or production defects. If not handled in time, it may harm the experience of customers and lead to revenue loss. On the contrary, positive anomalies can potentially help the company spot opportunities for growth and boost the business performance. Wish has millions of daily active users worldwide. It is critical to identify and resolve anomalous events before they negatively impact users’ shopping experience. Therefore, we need to build a reliable and scalable surveillance system to automate the anomaly detection process and surface the real anomalies before they become a true problem.

In this article, I will walk you through how we constructed the solution by applying existing technology and highlight the key component of tackling the technical challenge.

Goals

The main goal we want to achieve in this project is that we want to extend the scalability, reliability and usability of anomaly detection so that we can deliver a robust anomaly detection service that can seamlessly capture abnormal behaviors in business metrics.

Scalability

One of our primary goals for this project is scalability. On each day, we require our anomaly detection service to screen tens of thousands of time series for anomaly checks. Speed is essential. If the anomaly detection system takes the whole day to scan through all the candidates’ metrics, delaying the anomaly reports delivery would lead to irreparable consequences. It would defeat the purpose of having anomaly detection. So our goal is to deliver timely and seamless anomaly reports daily or hourly. This can only be achieved if we design and build a scalable system that could automatically scale as the number of onboarded metrics rises.

Reliability

Another goal is to improve the reliability of our anomaly results. Constantly receiving false alarms is likely to annoy users and thus make the anomaly detection alarm less reliable. We want to ensure the rate of false positives is low and the false-negative rate is properly controlled so that our users trust the anomaly results we deliver. We took different actions in response to different challenges regarding the model enhancement.

Usability

Finally, our goal is to enhance the usability of the system by improving the user experience and simplifying the metric onboarding flow. We aim to abstract out all the technical components for users and rely on an internal platform for users to onboard metrics so that it only takes users several minutes to onboard a metric of interest. Additionally, we also aim to enhance the usability of our service by building a dashboard, where users can interact with the time-series graph and traceback, and thus investigate the anomaly more efficiently.

Problem Formulation

We used a time-series forecasting model to do anomaly detection at Wish. In fact, there are a great many anomaly detection techniques nowadays and which approach works the best depends on the use cases. Given the fact that our use case requires the application on unlabeled data and needs to take into account a lot of factors, such as seasonality, holidays, and interpretability, we leveraged the forecasting approach to model the time series. Another benefit of the forecasting approach is that we could further improve the quality of the forecast by tuning the hyperparameter via cross-validation.

It is essential to incorporate the metric historical data to determine if the metric is an anomaly. To uncover the underlying pattern of the historical data, we built a forecasting model that takes into account the historical trend, seasonality pattern, and holiday effect before making the predictions for the metric of interest. Hence, the predictions can guide us to compute the standardized distance normalized by variance and therefore determine the anomaly severity of the metrics.

Our solution

I will discuss our anomaly detection solution at Wish, in particular, how we build up our solution to achieve the goals — scalability, reliability, and usability.

Scalability

Forecast at scale

To guarantee the model produces reliable forecasts at scale, we applied Facebook Prophet as our main forecast model. Prophet is a generalized additive model where a time series could be decomposed to trend, seasonality, holidays, and error. More specifically, we performed one-step ahead forecasting when carrying out anomaly detection. The Prophet fits the overall trend, seasonal pattern, and holiday effect, which guides the model to predict the expected range of the metric for today without looking at today’s metric value.

At Wish, our goal is to check if today’s metric value is an anomaly on the metric company level and the drilled down dimension level. We compute the anomaly score of today’s metric value by the following formula:

Note:

y : actual metric value of today

yhat : model predicted value of today

yhat_upper : the upper bound of the expected range

yhat_lower : the lower bound of the expected range

We used the prediction interval provided by the Prophet model to validate whether the metric adheres to the historical data pattern, if not, we define it as an anomaly. The anomaly score is determined by how far away the observed metric value deviates from the expected range. The further the data point is away from the expected range, the severer the anomaly is.

Scalability enhancement

The analysis of tens of thousands of time series is likely to consume many computational resources. It is important to enable the scalability of our application in order to guarantee the seamless anomaly alert. Firstly, we expand the scalability by leveraging the python multiprocessing library. This approach allows the program to fully leverage the resources on a given server. Thus, we could allow our program to fit the time series model, make inference, and define anomalies in parallel. It is easy to scale up this layer by having a much CPU-powerful server to host the cron jobs. In addition, we also leveraged the internal K8S infra tools to scale out the service so that we could flexibly add more equivalently functional components in parallel to spread out the load.

Reliability

Cross-validation

We introduced cross-validation to our model for better forecast accuracy. There are quite a few tunable parameters in the Facebook Prophet model. Thus, we have a large parameter space to search for in order to tune our model for better prediction. We automate the hyperparameter tuning process using a black-box hyperparameter optimization framework called Optuna. It effectively searches large spaces, prunes unpromising trials for faster results, and automatically finds optimal hyperparameter values for our prophet model in a minimal amount of time, which substantially helps us detect anomalies with more confidence.

Model misspecification

Another observation is that our model tends to misbehave especially after an extreme anomaly happens. In other words, the expected interval produced by the Prophet tends to be inaccurate after the extreme event occurs. To make the model predictions more reliable, we borrowed the idea from the Uber blog. Uber aimed to solve the model misspecification problem after the extreme events. Their blog guides us to adjust the prediction interval by a certain degree. In our case, cross-validation MSE is treated as an indicator to reflect the quality of the fitted model. Therefore, the prediction interval could be adaptively adjusted depending on the accuracy of the model. The smaller the cross-validation error, the more we trust model prediction intervals and vice versa.

Usability

Onboarding Metric

It’s essential to provide users with a smooth process for onboarding metrics. Two data sources are required before anomaly detection — base table containing the metric time series data and the metadata for metric. Our anomaly detection service abstracts out all the technical layers for users and automatically consumes the base table based on the anomaly requests behind the scene. Combining those two procedures into one is the key to improving the usability of the anomaly detection system, as users can manage the source tables and anomaly requests with minimal effort.

At Wish, we have an internal platform, called MRF(Metric report framework), where users can build the ETL pipelines to build a base table using the dashboard. To leverage this advantage of the platform, we collaboratively worked with the MRF team to enable the service of creating anomaly detection requests on the MRF dashboard. In this way, users can onboard the metrics for anomaly check with several button clicks.

Anomaly visualization

Information aggregation and result delivery in an intuitive way are crucial for stakeholders to spot the problems in time. Our system sends out the email alert to metric owners once the anomaly is detected for certain metrics on certain dimensions. The email lists all the metric segments information ordered by the anomaly score in the email. To further enhance the usability of our system and accelerate the anomaly investigation process for users, we developed a dashboard using Plotly. We attached the link in the email so that clicking the link will redirect users to the dashboard for investigation purposes. Users can gain rich information by interacting with the plots, focusing on the segments of interest by manipulating the filters, finding the correlation between time series and the anomaly, tracking the other metrics’ health status, tracking the history anomaly report and eventually discovering the root cause in time.

Future study

This project’s primary goal is to solve our current pain point of the anomaly detection service — usability, scalability, and reliability. We are gluing the existing technology and approach to construct our current version of anomaly detection. In the future, we are continuously expanding our system by introducing more models into our framework to capture different types of anomalies and improve our system more intelligently. We also have plans to add more features to the anomaly detection dashboard. In particular, to group the data in a way that users can obtain more insightful takeaways after exploring the dashboard.

I am grateful for being assigned to work on such an impactful project last summer. I would like to take a moment to shout out to my mentor Max, my manager Pavel for guiding and mentoring me throughout the whole summer, leaving me with a memorable and rewarding co-op experience at Wish.

Acknowledgment

Special thanks to Max Li, Pavel Kochetkov, Chandler Phelps, Eric Zhang for contribution to this project. I am also grateful for the support from Pai Liu, Simla Ceyhan, Yue Song, Lance Deng during my entire co-op experience.

If you are interested in solving challenging problems in this space, join us! Click here to view open roles.

--

--