Adopting Statistical Process Control for Monitoring and Alerting System on Business Metrics

An alternative anomaly detection technique for time-series data sets

Putri Wikie
Bukalapak Data
6 min readApr 14, 2020

--

Image Source: Unsplash

To quantify the performance of a company, business metrics are defined and are further monitored on a daily basis. The monitoring itself is aimed to assess whether the company is well-performing or to detect abnormal phenomenons that could negatively impact the business. In a growing start-up company, the total number of business metrics could reach up to thousands of metrics. The manual monitoring on those metrics requires an enormous effort, when an automatic diagnostic tool is not yet available in the company.

This article gives an alternative solution by Bukalapak to do automatic monitoring activity on thousands of business metrics and to send an alert to stakeholders, when an unusual pattern is detected.

Figure 1. General working flow of business-metric monitoring and reporting at Bukalapak, if an anomaly was detected

Introducing internal automatic monitoring and alerting system

To answer the aforementioned challenge, we initiated to build an in-house tool that allows us and our stakeholders to do direct monitoring on their metrics of interest as well as to quickly detect if there is an unusual pattern in the data. This tool would automate and standardize the diagnostic work across the company. The tool adopts Statistical Process Control (SPC) framework, where variations and seasonal effects are considered.

SPC is commonly used in the manufacturing industry to determine whether a process is under control, i.e. the process is consistent, where the estimated-variations are coming from commonly known sources of the process. We focused on one of basic tools of quality control in SPC, namely Shewhart-chart or well-known as control chart, to mainly monitor the variations on the measured data points (in our case, a data point is referred to as a value on a business-metric at time t).

Figure 2. The monitoring and alerting system on high-level flow diagram

Let’s get technical!

Caution. This section shows the technical part of the system. If you are a business stakeholder who does less care about the statistical technique, please feel free to jump to the next section (Section: An Example).

Step 1
Supposed we have a series of data points that have a seasonal pattern S (unit: “day”). On time point t, we would determine whether a data point Yt does not significantly deviate from the seasonal-common pattern.

Step 2
At the data preprocessing step (first box on Fig 2), we first take the ratio of a value to be monitored with respect to the corresponding value at the previous S days. The ratio R of the monitored-value Y for a given time t (unit: “day”) is defined as,

Further, the Rt is transformed by sigmoid function as such the values fall in between 0 and 1. The monitoring is then followed proportional control chart.

Step 3
The upper (UCL) and lower (LCL) control limit is estimated by

where N is the sample size (i.e. the number of data points) and R_bar is as follows,

It is worth noting that a minimum of twenty-five data points is needed to have enough power on the control limits [1].

Step 4
The system would create an alert and would notify users, when R’_t is greater than UCL or R’_t is less than the LCL. Hereafter, such a data point is called an anomaly. The treatment of the detected anomaly depends whether its cause is known. The unknown anomaly indicates that there is an intrusion in the data set and a further investigation is worthy to be initiated. Montgomery [1] suggested excluding the known anomalies for the calculation of LCL and UCL, as it is not the usual process that we would like to detect. We, however, replace the outlier values by the center point of the data set (e.g. median or mean), as we need this data point to determine the control limits for the next following months. The Rt is then adjusted by,

The aforementioned step-by-step calculation is summarized on Fig.3.

Figure 3. Flow diagram for calculating lower and upper bound on a control chart for a given data set within a certain time-window. Please note that this process is repeated as the time-window is moving, as such the threshold keeps being updated overtime.

An example

Supposed a series of synthetic data set that contains 47 points and has weekly seasonal cycle (S=7), is available for monitoring and for alerting (Fig. 4a). The scenario is as follows. Let’s assume that the values reflect the daily number of transactions. A proportional control chart is built using the first 46 data points and the task is to determine whether the 47th data point is a “normal” or an “anomaly”.

To that end, the values are transformed to the ratio scale (R’t) and UCL as well as LCL are further estimated, resulting in a proportional control chart on Fig. 4b. The last data point (t=47) crosses the UCL, which indicates that an anomaly presents. In this case, an alert is created and is sent to users by the system.

Figure 4. (a) Time-series plot on the synthetic data set with weekly seasonal pattern. X-axis is time point t and y-axis is the number of transactions at the corresponding time point t; (b) Ratio plot from the synthetic data set (solid-line), as well as its UCL and LCL (dashed-line). Note: the first seven data points were not plotted as it was used as references to calculate ratio on Eq. (1).

As UCL and LCL highly depend on the average value of a data set, the thresholds may be overestimated if an outlier/anomaly exists in the data set. As a consequence, the control chart is much less powerful to detect upcoming potential anomalies. To handle this limitation, the anomaly is modified for the purpose of further monitoring (t>47), by first re-estimated the R’_47 with the median of all previous R’_T (T = 1, …, 46); and second re-estimated R_47 by Eq. (5).

Figure 5. (a) Ratio plot from the synthetic data set (solid-line), as well as its UCL and LCL (dashed-line), where the point on t=47 is re-adjusted. (b) Time-series plot on the synthetic data set, where the number of transactions on t=47 is adjusted for further monitoring purposes.

We are aware that the proportional control chart may be not intuitive enough to detect the seasonal trend on the data set (as the y-axis that may be confusing for business-stakeholders) and to make an adjustment for a new data point. To preserve the interpretability of the data, we re-transformed the proportional control chart to its original scale, especially for UCL and LCL (Fig. 6).

Figure 6. Time-series plot (solid-line) on the synthetic data set and its thresholds (dashed-line), where the data point at t=47, marked by orange-crossed-line, is (a) the true value (b) adjusted for further monitoring purposes.

Remarks

We employed the concept of statistical process control to build an automatic system that could help monitor business metrics activity on daily and real-time basis. The system detects a data point that does not follow the common trend. Should such a data point occur, the system creates an alert that is sent to users. We are aware that applying time-series methods might be more relevant. The alternative approach, however, is more adaptable, easily maintained, and scalable on production level. More importantly, it does not sacrifice the accuracy of anomaly detection.

Disclaimer

This tool was implemented in the internal Bukalapak monitoring platform, when our databases were built on-premises. The system has been deprecated, since we moved to cloud databases system. We, however, think that start-ups in their early phase and face similar concern may benefit from this information. This article also shows an alternative anomaly detection technique on time-series data sets.

Contributors

Data Scientists behind the system: Fatia Kusuma Dewi, Nur Siti Muninggar, Arief Yudha Satria, Putri Wikie Novianti.

Grateful to have Adrianus Galang, Guruh Hapsara and Hafidh Rashemi Rafsanjany, our fellow-engineers, who implemented the system and made it widely available in the company.

Main Reference

[1] Montgomery, Douglas C. 2013. Introduction to statistical quality control, fifth edition. 2005. Hoboken, NJ: Wiley

--

--