Analytics Using Feature Selection for Anomaly Detection

Published in

GSK Tech

9 min readFeb 17, 2020

GSK’s ultimate goal is to help patients do more, feel better, and live longer. One way we strive to achieve this is by harnessing emerging technologies to enhance the discovery and improvement of pharmaceutical treatment options. One important and necessary element in achieving this goal is the robust and accurate handling of data as well as high performance on the database and infrastructure backbone. As a database specialist at GSK, my role for the past 10 years has been as a performance tuning expert where I take a metrics-based approach to finding the root cause of bottlenecks and solving them. To improve the performance of databases throughout the organization, I have developed code and a process that represents a step-change paradigm shift away from the traditional methods for database performance metrics analysis. This method was published last year in Dynamic Oracle Performance Analytics (Apress publisher: http://www.apress.com/9781484241363) and uses “big-data” and advanced analytical techniques to analyze all the available metrics in order to identify performance issues wherever they exist; I call it the DOPA process for short [the SQL code is available on GitHub: https://github.com/Apress/dynamic-oracle-perf-analytics]. Conceptually, the DOPA process is quite simple, but I would say that using the DOPA process is for a committed hands-on practitioner. After giving a brief description of the technique and explanation of how it differs from the traditional approach, I will provide a few examples of how it has been used with much success at GSK.

Typical approaches to database workload performance analytics use what I call the “small-model” approach where they analyze a relatively small set of metrics chosen from the many available with the hope that the chosen metrics will be able to characterize the bottleneck. [metrics examples: host CPU usage; process memory; # of sessions/connections; disk usage; wait stats; …] . A typical database will have many, many metrics to choose from and they may be scattered across any number of tables [ex: Oracle instruments tens of thousands of metrics scattered across dozens of tables; similarly, tools such as Spotlight instrument many metrics for SQL Server as well as Oracle], therefore, using a pre-chosen set of metrics is not likely to be suitable to solve any arbitrary performance issue. Personally, I found the small model approach insufficient because it often misses relevant and important observations that can lead to inaccurate root cause conclusions. With the small-model approaches, personal preference and/or familiarity (user bias) lead certain metrics to be relied on heavily while other available metrics are ignored. The DOPA process evaluates all available data, applying big data methods and analytical techniques to flag the key metrics influencing the bottleneck, which helps the database performance domain expert draw impactful, focused performance improvement conclusions. In short, analysis is done by logically organizing the performance metrics into a particular format (key-value pairs) [Feature Engineering] and then running them through a statistical analysis in which data from “normal run times” [Establishing Normal Ranges] are compared to “problem run times” to identify atypical values [Anomaly Detection/Model Building]. Once particular metrics are identified as being atypical, further analysis can be targeted to identify root cause; this is where the domain expert needs to step in to further zero in on the root cause. I will detail the DOPA Process steps below under: Feature Engineering; Establishing Normal Ranges; Anomaly Detection/Feature Selection and Model Building.

Feature Engineering

Feature engineering is a well-known concept the machine learning field. For my purposes, I treat this as a process for creating metrics from data sources that by structure don’t lend themselves easily to large scale analysis. Since many of the performance metric sources I use are in multi-column tables [MCT], I’ll logically unpivot the columns from the MCT’s of many metric sources to key-value-pair [KVP] rows, then union all the KVP’s together for analysis. This not only greatly simplifies the analysis but allows me to massively expand the analysis to tens of thousands of metrics [and more].

To illustrate the logical unpivot, consider the following:

With UNPIVOT, columns from an MCT become KVP rows. For example, a metric source from Oracle might be DBA_HIST_SQLSTAT, which provides many metrics on the execution of individual SQL statements as per truncated example below:

To unpivot this to a more usable KVP structure, I’d use the following SQL in Oracle [there are other methods of accomplishing the unpivot for SQL languages that don’t have the same UNPIVOT]:

select snap_id, sql_id, metric_name, delta_valuefrom dba_hist_sqlstat sqlstatunpivot include nulls( delta_value for metric_name in   (FETCHES_DELTA,SORTS_DELTA,EXECUTIONS_DELTA   ,PARSE_CALLS_DELTA,DISK_READS_DELTA   ,BUFFER_GETS_DELTA,ROWS_PROCESSED_DELTA   ,DIRECT_WRITES_DELTA,PHYSICAL_READ_REQUESTS_DELTA   ,PHYSICAL_WRITE_REQUESTS_DELTA))where sql_id = nvl(:sql_id, sql_id)order by snap_id, metric_name;

Executing the “unpivot” will produce a “normalized” KVP structure for analysis, for example:

Now with the above KVP structure, I can easily perform the same analysis to all “columns”, and as I mentioned earlier, I can UNION the KVP structures from multiple metric sources together for a greatly expanded analysis.

Establishing Normal Ranges

I use the statistical rule-of-thumb [plus or minus 2 standard deviations from the mean] to establish normal ranges for all the metrics in a reference interval. Plus 2 standard deviations from the mean is the upper bound of the normal range and minus 2 standard deviations from the mean is the lower bound. This is easily accomplished using standard statistical functions in a simple arithmetic expression; in my case, the STDDEV function for standard deviation and the AVG function for the mean. I use the normal ranges computed for a reference interval to identify usual values for all the metrics. An example of the what the normal ranges for the “Current Open Cursor Count” metrics might look like and how it is used to identify anomalies can be found in the figure below:

Anomaly Detection/Feature Selection and Model Building

When running the DOPA process, metric values from a problem period/interval are compared to the values from a reference/normal period. The anomaly detection/feature selection is done by simply flagging each metric as a zero or a 1 if its value is inside or outside normal range (0 for within normal range; 1 for outside of normal range). I also calculate a “flag ratio” that expresses how far outside of normal the value is. This flagging produces in a sense, a massive bit-map of analyzed metrics. The results are ordered by the metrics that have been flagged the most number of times (for the intervals considered) and are the most out of range. When I apply the DOPA process to a particular performance issue, it is done as a model building process where I perform the analysis iteratively [tweaking the input variables that adjust the sensitivity of the model] to get to a set of metrics which I can use for further analysis (usually time-series analysis) [I’ll give a couple of examples below].

The DOPA process applied

The DOPA process is an example of our engineering industrialized solutions that do things well consistently (in this case using advanced analytics techniques). Not only does this innovative process help us lower costs by being faster and better at performance problem solving, but there are tangible performance benefits in the cases where the DOPA process was used.

In one such example, I used the DOPA process to solve a performance problem with a batch loading process in the US Pharmaceuticals business; I was able to bring the load process down from hours to minutes. In this case, it was reported that a nightly load process had regressed in its performance. The DOPA process quickly flagged up high values for the less well-known metric, “index leaf node splits”. This was a clear indicator that the indexes were turned on during the load. Although the resolution was simple and effective (i.e. disable the indexes before the load and re-enable them after the load), without the DOPA process it would have taken much longer to find the root cause (especially since the standard out-of-the-box tools don’t expose the lesser known metrics).

In another case, users of an application in the R&D space were getting application level “server busy” errors where the database application was showing extreme performance degradation impacting major functionality and thousands of users. This is a database application that is hosted in the UK but has US users as well. The problem gets worse as US users begin to connect to the application and gradually eases as UK users log off. Modeling the unusual workload using the DOPA process flagged up a few metrics involved in the root cause. There were high values for the metrics: “Host CPU Utilization%”, “Current Logons Count”, “Session Connection Count”, and “Total Table Scans Per Sec”. A further time-series analysis of the metrics confirmed the root cause to be periodic logon storms (see side-bar note * below) of up to 20,000 connections per hour and lots of small SQL doing full table scans. In this case, the root cause turned out to be that the application makes a new SQL connection for each window the application opens, and connection pooling was not used. This is obviously an un-scalable solution that required remediation at the application level [indexes solved the full table scan problems].

* Logon Storm:
A logon storm is characterized by a sudden spike in the number of client connection requests or an unusually high number of connections. Since the process of starting a database session is CPU intensive, a logon storm will likely result in unexpected consequences such as: connection errors, applications waiting on CPU, timeouts, poor response time, impacts to other DB’s on same machine, elevating wait times for everything running on the instance, … . The root causes of the logon storms are often traced back to improper connection pool settings or applications connecting, using a session, then disconnecting. As a rule of thumb, we try to keep the number of simultaneous connections to < 100 per CPU (this was a 6 CPU machine).

The time-series figure below, shows that the logon storms were co-incident with the performance degradations experienced by the application/users.

The time-series analysis (see below) of the “Host CPU Utilization %” metric shows:

1) the high CPU utilization present during the problem interval
2) Sharp reduction in CPU after SQL tuning and fixing the logon storm issue
3) CPU utilization stabilizes to a low steady state after the fixes.

The case studies described here are just a small sampling of the types of problems for which the DOPA process can provide insightful analysis. I am persuaded that the DOPA process can be used effectively for any kind of problem. Because it is metrics-based, it is an objective analysis and because it uses all the available metrics, it is exhaustive. This gives the DOPA process an advantage over the small model approaches that look at a few hand-picked metrics.

The DOPA process is not limited to instrumenting database performance metrics; the method would easily allow inclusion of other metrics garnered from other sources (network, operating system, storage sub-systems). These metrics would allow the identification of problems/bottlenecks in those areas as well. Every time a new layer is added to the technology stack (cloud or other database systems), you increase the complexity of the analysis of performance problems and hence, a greater need for an analysis that can direct tuning efforts.

Further work: My current implementation of the DOPA process is focus on “on-prem” databases, but I’d like to extend this to time-series metrics from DB’s hosted on the cloud. Moreover, I think that there are lots of opportunities for the application of machine learning in the DOPA process, using known problem patterns as training sets.

Feedback

We’d love your feedback! Our aim is to best describe our approach and its motivation, and include code examples to demonstrate our implementation and allow you to take it away and reproduce it. Did we do a good job? We’d love to hear from you in the comments below. We also welcome opening issues or pull-requests against the repository ❤️

Analytics Using Feature Selection for Anomaly Detection

Feedback

Written by Roger Cornejo