Automatic and Self-aware Anomaly Detection at Zillow Using Luminaire

Sayan Chakraborty
Zillow Tech Hub
Published in
12 min readJan 7, 2021

Luminaire Homepage: Luminaire Github

At Zillow Group, we have built many data-driven products to empower our customers on their journey to unlock life’s next chapter. These systems include Zestimates to estimate the value of a home given current market conditions, recommender systems that provide a personalized home shopping experience, and computer vision models that allow customers to virtually explore a potential property with Zillow 3D Home®. These complex products require processing massive amounts of data through many system pipelines, creating several vulnerability points that could create risks for product performance and experience.

In order to ensure system stability, it is critical to incorporate relevant monitoring at each potential vulnerability point. Anomaly detection systems are often used to monitor operational and business impacting metrics that are formatted as time series. We found that many available tools in this space were either not robust enough to be effective on different data profiles, or required very strong modeling and domain expertise in order to be accurate.

At Zillow’s size, hand-tuning an anomaly detection model for every metric that required monitoring would not scale. We were also interested in investing in a central internal platform that could serve a variety of data quality related needs, instead of duplicating this work across independent vertical systems in different teams.

Figure 1 — Examples of time series anomalies

At Zillow, we developed a centralized anomaly detection platform, called Luminaire, to democratize the data quality service across our organization. Luminaire can perform online detection of time series anomalies for a wide range of data profiles and requires very minimal onboarding and maintenance requirements.

Luminaire leverages state of the art statistical and machine learning techniques at different stages of training the ML system for anomaly detection to produce reliable classifications of anomalous and non-anomalous data. It tracks temporal correlations or periodicities and the amount of signal present in the time series which results in better predictions and estimates for uncertainty. Specifically, the system leverages several classes of time series models to catch diverse anomalous instances and also leverages a built-in configuration optimization feature to run with almost no specification required from the end user. Moreover, a method on top of the automation layer makes the whole system self-aware by auto adapting with changing patterns in the data and automatically triggering corrective actions for any model underperformance.

Anomaly Detection Process

The automated and self-aware anomaly detection platform consists of two processing pipelines, the training pipeline and the scoring pipeline. During the training process, the time series data undergoes configuration optimization, data preparation and cleaning, and the actual modeling. During the scoring process, the incoming data points or data over a time-window are scored for anomaly classification using the trained models.

Training Components

Data Preparation and cleaning

The two key components for an ML system are the data and the model. Typically, the system needs to prepare the unstructured data to make it ingestible by the training process. The data preparation step consists of several phases, including:

  1. Data aggregation: Raw data generally needs to be aggregated in order to make it efficient to monitor as a time series (for example, raw user traffic can be bucketed hourly to create an hourly traffic metric).
Figure 2 — Data Aggregation and slicing
  1. Imputation: Data points that are null in the training datasets are replaced using a mathematical imputation technique. Past detected anomalies are similarly imputed in training data in order to prevent the model from training using those instances.
  2. Change Point Detection: Change points are important transition points for a time series where the statistical property changes significantly. Such points are identified in order to understand whether the data before a change point needs to be adjusted or truncated due to a shift in the distribution.
  3. Stationarization and Transformation: Any stochastic process with fixed statistical properties over time is called a stationary process. A typical time series data shows non-stationarity at different levels and that creates challenges in the modeling process. Non-stationarity in the data is automatically identified and removed in order to get an unbiased fit. Some additional transformations are also required if the data tend to show higher order polynomial or exponential relationships within a sub-domain of the input space (for example, higher order percentile data tends to be right skewed).
Figure 3 — Example of non-stationary time series — Air passengers numbers between 1949 to 1961

Input Data Profiling

Data profile information about the input time series provides valuable context about a time series that cannot always be caught during an online scoring process. Luminaire detects both trend shifts and sustained shifts in the metric values of input time series. At Zillow, we compute and store this data profile information during the data preparation for consideration in the modeling process and to support potential investigations of data issues. Below, we walk through how Luminaire’s DataExploration module can be used to retrieve important information and prepare the input time series for training.

We can start by generating a simple time series to test:

Let us plot the time series with:

Figure 4 — Example of time series change points

Now we can use the Luminaire DataExploration module with necessary actions, such as taking log transformation of the input data or truncate the data from the last observed change point etc.

Modeling

A key difference between time dependent data and cross-sectional data is that the former tends to show higher levels of non-stationarity since one of the features of the model (direct or indirect) is the time itself. Hence, training needs to be triggered over a periodic schedule in order to adapt with the continuous structural changes in the data.

There are several ways to featurize a time series model where any internal (or temporal) or external (or event-based) features can be integrated to the model directly or indirectly. A direct way to integrate such features can be to explicitly add hours, days, holidays or any other external events as features (as seen in sequential deep learning models). An implicit way to add temporal information can be to incorporate the structural information such as auto-correlation lags, moving-average lags etc (as seen in traditional statistical models like ARIMA).

The target here is to build a generalized anomaly detection system to work for a wider range of time series profiles and with a moderate amount of training data. It is important to develop models that can achieve at least a moderately high degrees of freedom for the estimated model coefficients to attain a favorable estimation performance. Therefore, it may not be a good idea to featurize every possible piece of information but rather define a model over the structural information, when present (such as models with AR lags or MA lags, or filter based models). In order to capture any periodic components or seasonalities in the data, we extract the most significant frequency terms from the Fourier transformation over the time series to be considered as model components. The model also allows some level of externalities to be considered during modeling such as holidays or other significant events.

The Luminaire structural modeling module is capable of catching various intrinsic and extrinsic signals present for a given time series. This module provides the full flexibility to configure the model with AR or MA lags, specifying a multiplicative model by doing log transformation of the inputs, adding intrinsic Fourier signals or adding holidays as extrinsic features.

The following example shows how the structural modeling module can be used to effectively model any time series having moderate structural signals. We start by creating a dummy time series with periodic signals:

Figure 5 — Simulated seasonal time series

We initiate the modeling process by calling the DataExploration class to first preprocess the data and then run the structural model by manually setting all the required configurations.

The scoring output generates several useful information related to anomalous status of the data point, forecasting, confidence intervals etc.

Another way of dealing with time series anomaly detection models is to build a model to quantify the level of uncertainty rather than focusing on prediction capabilities. These kinds of models are well suited for anomaly detection for the time series that contains little to no signal present for predictability. Examples of such models include the Kalman-filters.

The Luminaire filtering model focuses on the uncertainty level for a time series to quantify the anomaly information for a data point. This is particularly useful when the time series carries little to no structural signals. The following example illustrates how to manually configure the luminaire filter based modeling module for anomaly detection (we will use the same data shown in Figure 5 along with the same preprocessing information).

Scoring outputs for the filter based model include information about the anomalous status of the data (with a pre-specified level of significance), anomaly probability, prediction information, model freshness score (to understand how recently the model has been trained to score the specific data point) and so on. Moreover, the underlying stepwise update logic for this type of model requires making minor updates to the model object at every scoring step in order to be able to score the next time instance.

We store all trained models in a Postgres data store along with model metadata such as unique model identifiers for retrieving a specific model with respect to a specific time series date, model create date, and model expiry information. Model expiry is very important for time series data because this type of data often has non-stationarity, making predictions too far in the future invalid. Models should be trained with new data on a periodic schedule.

Configuration Tuning

Achieving optimal anomaly detection results for a given dataset depends on running each model with the optimal configuration. At an organization level, setting the configuration manually is not scalable because of the diversity in data cleanliness and structures, the number of metrics to be monitored, and the domain expertise required for proper manual tuning. To address this problem in Luminaire, we created an optimization system on top of the training process. This tuning process builds the optimal anomaly detection models for a given time series by optimizing configurations such as selection between a structure based model vs a filter based model given the data pattern, truncating the time series if a change point is observed, deciding whether to apply a log transform to the data , and deciding how many Significant Fourier terms should be considered to get a proper fit of periodic patterns without overfitting.

The Luminaire Hyperparameter Optimization module can be used to find the best configuration for a given time series. Here we revisit the time series considered in the Structural modeling example:

This optimal configuration for the given time series can be used in both the data preparation and modeling steps instead of manually looking for the best configuration.

Since the underlying properties of a time series do not change often, this configuration optimization step can be called much less frequently compared to the training process. At Zillow, we trigger the configuration optimization process only for new time series or when a drastic change in the data (such as a change point or significant change in the correlational or seasonal pattern) is observed.

Serving and Alerting Process

New data points for a given time series can be scored by retrieving the model from the model store using the model identifier. This process computes the probability of a new data point being an anomaly based on the most recently trained model and stores the output to a result database.

Another critical component of an anomaly detection system is to alert the relevant stakeholders of anomalies that have been identified. This can be done by creating an alerting service which pulls the recent scoring data to flag the data point as anomalous vs stable (using a pre-specified threshold set by the stakeholders) and sends alerts for data points that exceed this anomaly threshold.

Figure 6 — Architecture for the proposed anomaly detection system

Performance Logging and Self Awareness

In order to keep our anomaly detection models healthy with minimal manual intervention, we implemented a process of continuous performance evaluation and logging with a feedback loop to trigger the model optimization process when performance is poor. The key challenge of evaluating the performance of an unsupervised anomaly detection system is the absence of a direct performance evaluation metrics due to the lack of labeled data.

A good alternative to obtain the classification accuracy of an anomaly detection system is to perform an indirect evaluation using the concept of Mass Volume and Excess Mass. The main assumption behind this concept is the alignment of the anomaly score distribution and probability distribution of the underlying data. Ideally, low anomaly scores (i.e. stable data) should be observed in the high probability region in the data domain whereas high anomaly scores (anomalous data) should come from the low probability regions. Therefore, under the optimal performance of an anomaly detector, we will find a Lebesgue measure that can be treated as a true representation of the data distribution. Any deviation from the true distribution can be tracked under continuous monitoring of the Mass Volume (MVs) and the Excess Mass (EM) of the anomaly detector given as

Where s(.) is a scoring function integrable with respect to any Lebesgue measure Leb(.).

Figure 7 — Distributions of anomaly scores under ideal and unstable scenarios

A performance deterioration for an anomaly detector might occur due to different reasons such as a structural change in the data or training under suboptimal configurations. It is important to track performance for multiple aspects of the detection system. Moreover, any underperformance should be caught at an early stage in order to avoid overalerting (which can cause alert fatigue) or underalerting (where real problems go undetected). These are the key indicators that we use at Zillow to track the health of each anomaly detection model:

  1. Mass Volume
  2. Excess Mass
  3. Training failure
  4. Model freshness (whether scoring process pulls an expiring model)
  5. Rate of being anomalous

In order to create a self-aware loop, all of these key indicators are pulled during the scheduled training process and compared with a pre-specified expectation from the anomaly detector (for example, the proportion of anomalies for a metric over a rolling window of certain length should be less that 5%). An optional configuration re-tuning can be triggered for the corresponding metrics violating such expectations.

Conclusion

At Zillow, building a self-aware anomaly detection system has allowed us to democratize the process of data quality monitoring within our organization. Luminaire is used by a broad set of teams with a very simple onboarding requirements and without requiring any ML expertise, reducing the friction of maintaining an ML based monitoring system, reducing the time and effort required to onboard new metrics, and finally allowing us to host a central UI for users to interact with the configurations, charts, and alerts.

To learn more about Luminaire, you can also check out our recent scientific paper that was published as part of the 2020 IEEE Big Data conference, watch the conference talk, and listen to our Python Podcast interview.

Get Started with Luminaire

The modeling library of the Luminaire anomaly detection platform are now available open source. We would love to see contributions from the community in order to expand Luminaire’s reach to a wider range of anomaly detection problems.

--

--