Stories by Rama Akkiraju on Medium

Harvesting insights from logs in IT Operations Management in an economically viable manner

Rama Akkiraju — Tue, 24 May 2022 08:15:13 GMT

By Rama Akkiraju, IBM Fellow, CTO AIOps, and Xiaotong Liu, Senior Data Scientist, Manager, AIOps

Collaborators: Mudhakar Srivatsa, Amitkumar Paradkar, Prateeti Mohapatra, Jae-Wook Ahn, Sarasi Lalithsena, Meenakshi Madugula, Neil Boyette, Jiayun Zhao, Anbang Xu, Lu An, Gargi Dasgupta, Karan Karuppiah, and Rakesh Ranjan.

Following our recent posts on Why logs are important [1], and Why Log Parsing and Processing are hard [2], in this article we offer strategies for parsing logs to effectively detect anomalies during IT operations management.

Introduction

Information Technology (IT) logs are events generated by software systems during the execution of a program in production environments for problem detection and diagnosis in IT operations management. Logs contain information about errors, exceptions, warnings, informational events, and other diagnostic information. Logs are semi-structured machine-generated data. They can come in many formats, structures, languages, and large volumes. These multi-dimensional attributes of logs pose many challenges in parsing and processing logs. However, since they contain valuable diagnostic information, it is important to mine them for insights.

Until recently, IT operations administrators and Site Reliability Engineers (SREs) have been looking manually in logs for diagnostic information via text search strings. While Log Analysis tools and products help by aggregating logs and enabling search via proprietary query languages, and allow users to write custom rules to trigger events when specific thresholds exceed, they tend to be static and require maintenance of those rules.

Log anomaly detection (LAD) aims to detect anomalous behavior in IT logs automatically using Machine Learning (ML) techniques. Increasingly, more and more tool vendors are starting to incorporate ML-based anomaly detection in their Log Analysis and AIOps products. However, log anomaly detection is a hard problem. It requires parsing of arbitrary formats of logs, extracting meaningful information/entities from those logs, training ML models to learn normal log patterns so that they can detect anomalies, and explanation of detected anomalies. While techniques such as DRAIN [3] have been widely used for log parsing and feature generation for downstream log anomaly detection, they don’t take the variables in logs into account (more details about constants and variables in logs are described later in this article). Not considering variables in logs misses important diagnostic information and thereby impacts the quality of downstream anomaly detection. For complete log comprehension, both constants and variables in logs must be harvested.

In this article, we present some state-of-the-art approaches to derive insights from logs. We are actively exploring some of these techniques in our labs for potential inclusion in the future releases of Cloud Pak for Watson AIOps. However, please note that this is not a product roadmap article. Our primary goal in this article is to suggest future directions for log analysis for better anomaly detection in IT operations environments.

Log format recognition and parsing: Logs can be written in plain text, XML, JSON, plain old java object (POJO), or any other format. Sometimes logs can have embedded JSONs within XMLs and vice versa. This complicates the processing of logs significantly, especially when XML and JSON structures are embedded inside each other. Extracting the needed entities to prepare features becomes more difficult as they get buried inside arbitrary levels and layers. To address this problem, one must build parsers to identify and parse standard formats such as XML, JSON, plain text, POJOs, etc., in logs so that the key entities can be extracted from them properly.

Format recognition of well-known logs: Standard middleware products, operating systems, and infrastructure have well-defined and published schemas for logs. For example, apache, Syslog — Linux and network vendor variants, mongo, Websphere, Redis, Elastic, db2, etc. have well-defined log formats. Therefore, when these formats are recognized, known schema definitions can be followed to extract specific entities of interest. In Cloud Pak for Watson AIOps, as of the 3.2 version, Websphere logs are processed out-of-the-box. More is on the way.

Tokenization: Logs can be written in many natural languages depending on who wrote the software program (e.g. English, German, Italian, Spanish, Japanese, etc). This means that the Natural Language Processing (NLP) software used for processing log messages must be able to detect the language and parse it accordingly with suitable tokenizers and dependency parsers to extract features from them.

Entity Recognition: One aspect of log parsing includes identifying entities such as IP addresses, port numbers, date-time stamps, UUIDs, etc., that occur frequently in logs. These can pose significant parsing challenges. Regular expressions, shallow semantic parsing, and dictionaries come in handy in identifying entities accurately. Depending on the format of logs, applying a suitable entity recognition technique is critical to extracting entities correctly. Separate entity recognizers might have to be built for each kind of entity extraction once an entity is identified of a certain kind. This requires both entity type identification/classification and then entity extraction by applying a suitable extractor. For example, in our prior article ‘Why Log Parsing and Processing are hard’ [2], we presented several examples of date-time stamp format variations. One needs a date-time stamp entity recognizer to specifically deal with different types of variations. Similarly, one needs an IP address, port number, and other message code recognizers as each is an entity of its own. While some entities come in multiple formats, others tend to be more standardized. In either case, specialized entity recognizers will come in handy for accurate entity extraction for downstream processing.

Log Enrichment: Each log can be enriched with metadata that is useful for downstream tasks such as anomaly detection. An example of enrichment is classifying a logline as an error, exception, informational, latency-related, saturation-related, traffic-related, etc. Supervised machine learning algorithms are often employed to classify a logline into these categories. One way to do this would be by collecting enough labeled data on logs. This requires a Site Reliability Engineer (SRE) subject matter expert (SME) to label the data. We consider such activities human-in-the-ai-loop activities. Rule-based approaches are an alternative to these enrichments. Rule-based approaches don’t need data labeling. However, SMEs must specify rules for what entities to look for and under what conditions a logline can be classified as an error, exception, informational, etc. Each approach comes with its own pros and cons. Depending on the domain and availability of labeled data, or SME time, specific techniques can be chosen.

Log Templatization: Log templatization is about clustering logs of a similar kind together and assigning them a template/group Id. One popular algorithm for log parsing and templatization is the Drain algorithm [2], which employs a fixed depth tree parsing approach. In Drain, when forming a log template, constants are retained, and variables are ignored. For example, in the logline “received block blk_ID_2345987 of size 89456873 from 10.432.34.12” block Id, blk_ID_2345987 and block size 89456873 are variables, and the phrases ‘received block’, ‘of size’, and ‘from’ are constants. When this logline gets templatized it would look like “received block <*> of size <*> from <*>”. A more sophisticated version of templatization would recognize the entities of the variables and would templatize as follows “received block of size from ”, thereby paving the way for a better explanation. The counts of these templates, known as count vectors, become features for downstream anomaly detection. When loglines of similar kinds arrive, they are grouped together, and the count of such log templates can be used as vectors in time-series algorithms for anomaly detection. While the Drain algorithm works well overall, it fails to consider critical information that may be contained in the variables of logs. The variability in the variables contains useful diagnostic information. For example, in the above logline example, if the size of the variable ‘block’ varies beyond the normal range, that could be an indication of an anomaly. Ignoring that misses that useful information. Therefore, other techniques have to be developed to derive structured features out of the unstructured loglines wherein, the pattern variations of variables are also considered, in addition to the count vector features that consider constants alone, in the downstream task of anomaly detection. In a full log parsing approach that considers both constants and variables, key-value pairs are extracted from the variables in log lines. For example, ‘block-ID’ would be a key and ‘blk_ID_2345987’ would be its value. Similarly, ‘block size’ is the key and ‘89456873’ would be the value. These key-value pairs can then be efficiently stored in databases for efficient downstream tasks of anomaly detection. The constant part of the log templates can be stored separately for full reconstruction of logs when needed for audit or explanations.

Feature Extraction: Feature extraction is the process of deriving structured features from unstructured logs. Different kinds of features can be extracted from logs. Some of them are noted below.

Word embeddings: Extract word embeddings for the natural language words in a logline. The sequences of these word embeddings form time-series data.
Count vectors of log templates: The counts of the templates of each kind form the feature vectors. The process for templatization is briefly discussed in the previous item.
The variables in loglines: Variables that are captured as the values of the key-value pairs extracted using natural language processing techniques as mentioned in the ‘Log Templatization’ section of this article.

Human inputs for data type identification in log parsing: How does the system know the correct data type of a variable to apply the right algorithm to detect anomalies? Is the variable an IP address, a queue length, a date-time stamp, an HTTP status code, a byte count, or something else? Knowing the data type helps in applying appropriate algorithms for detecting the variations in that variable. For example, one can apply a metric-based anomaly detection using z-score variance for variables such as queue length, and byte count type of variables. For a counter type of a variable, taking the first derivative followed z-score might be more appropriate. For HTTP status codes and IP addresses, pre-canned regular expressions can be used to do exact matches. Therefore, to apply the right kind of algorithm, one must know the data type of the variable correctly. This can be achieved either automatically or with human input. The point to note here is that this needs to be done only once for each log template. Once the data type is properly identified either automatically or with human guidance, this information can be stored and appropriate algorithms can be selected for anomaly detection. In our experience, trying to automatically detect the data type of each variable might be hard and might require too much log and system context. This is a case where asking a human for input is much more time-efficient and expeditious to accomplish the task rather than trying to guess every data type automatically. Therefore, it is critical to incorporate appropriate user interfaces to take user inputs during log parsing.

Once the log features are identified, these features can now be used for anomaly detection.

Detecting anomalies from log features

Any variations of these identified features from the learned normal ranges are considered anomalies. Since log data is real-time streaming data, typically, time-series algorithms are used for anomaly detection. Anomaly detection can be applied to a single variable/feature (univariate) or multiple variables/features at once (multi-variate). We list several algorithms that can be applied for anomaly detection on the time-series features.

1. Univariate Time-series Algorithms: Apply different algorithms such as Robust-bounds, Flat-line, variant/invariant, Granger, Finite Domain, Predominant Range, and Discrete Values. Several of these algorithms are implemented in IBM’s Cloud Pak for Watson AIOps already. These algorithms are detailed further in [3].

2. Multi-variate Algorithms: Principal Component Analysis (PCA), Real-time Statistical algorithm [4], or other deep learning algorithms such as LSTM can be applied to the derived features to detect anomalies. More details on how these algorithms are implemented in Cloud Pak for Watson AIOps are available in [6].

Strategies for dealing with log volumes

IT logs are expensive to process and can pose significant infrastructure requirements if the volume of logs to be processed is large (e.g., TBs of data/day). We suggest that companies be judicious about which and how many application stacks they should monitor and how to start and expand to derive insights from logs, given that log processing is infrastructure intensive. Below, we list some best practices for IT operations managers for leveraging logs in IT operations management that make economic and business sense.

1. Monitor IT logs of your critical customer-facing applications (tier-1) in real-time for proactive incident detection: Typically, tier-1 applications are the most critical applications and therefore it is recommended to use real-time log monitoring for those applications. This means that the required amount of infrastructure needs to be allocated to process logs in real-time.

2. Monitor IT logs of your non-critical applications (tier-2) for asynchronous near real-time proactive incident detection: For tier-2 applications, near real-time detection of anomalies might be good enough. That is, if an issue were to occur, it might be acceptable to notify that issue within a few minutes rather than within seconds.

3. Leverage logs for incident diagnosis and explanation for non-critical internal applications (tier-3): For Tier 3 applications, instead of trying to proactively detect anomalies from terabytes of logs continuously, it might be advisable to use logs in a diagnosis use case. In the diagnosis use case, the log anomalies detector is invoked for only a subset of resources in the application stack that is identified as a probable cause for an already detected incident. The incident may have been detected using metric monitoring systems or application performance monitoring (APM) systems. In this setup, a select set of logs that correspond to a specific time window associated with the time of the incident are analyzed to detect anomalous patterns. In this approach, the log anomaly detector doesn’t take up the burden to detect anomalies and incidents in real-time on all the resources in each application context. Logs are primarily analyzed to diagnose the source of the problem for an already detected incident. This would reduce significant infrastructure investments required to process logs of all applications and infrastructure stacks and still allow SREs and CIOs to derive insights from logs for better issue diagnosis and resolution.

Conclusions

IT logs are an important source of information in IT operations management. However, deriving insights from logs is a hard problem because logs are not often standardized, come in many formats, and are voluminous. As a follow-up to our prior article in which we discussed what makes log parsing hard [2], in this article we presented some approaches to parsing logs effectively and preparing structured features from logs to derive useful insights from them. These features then form the basis for downstream tasks such as anomaly detection, prediction, diagnosis, and incident explanation. We also discussed some best practices for IT operations managers for dealing with large volumes of logs so that they can harvest insights from logs in an economically viable manner.

References

1. [Ganti R. et al 2021] Why Logs are Important? https://community.ibm.com/community/user/aiops/blogs/raghu-kiran-ganti1/2021/11/30/why-logs-are-important?CommunityKey=6e6a9ff2-b532-4fde-8011-92c922b61214

2. [Akkiraju et al 2022] Why is log parsing and processing hard? https://medium.com/ibm-cloud/why-is-log-parsing-and-processing-hard-1e72bac55712

3. [He P et al 2017] He P., Zhe J., Zheng Z., Lyu M. Drain: An online Log Parsing Approach with Fixed Depth Tree. 2017 IEEE 24th International Conference on Web Services. https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf

4. [IBM Operational Analytics Predictive Insights Documentation] Time-series algorithms in IBM’s Metric Anomaly Detection Component: https://www.ibm.com/docs/en/oapi/1.3.6?topic=concepts-algorithms

5. [Lu A et al 2022] Lu An, An-Jie Tu, Xiaotong Liu, Rama Akkiraju. Real-time statistical log anomaly detection with continuous AIOps learning. In the proceedings of the ACM International Conference on Cloud Computing and Services Science (CLOSER) 2022.

6. [Xiaotong et al 2022] AIOps Explained — Log Anomaly Detection: https://www.youtube.com/watch?v=DWkFMWi3GHY

Harvesting insights from logs in IT Operations Management in an economically viable manner was originally published in IBM Cloud on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why is Log Parsing and Processing hard?

Rama Akkiraju — Mon, 23 May 2022 18:27:11 GMT

By Rama Akkiraju, IBM Fellow, CTO AIOps and Xiaotong Liu, Manager, Senior Data Scientist, AIOps

Following our recent post on Why are logs important [1 ], this blog explores why logs are hard to parse and process.

Information Technology (IT) Logs are events written by software systems during the execution of a program. Logs contain information about errors, exceptions, warnings, informational events, and other diagnostic information such as database query statements and the time an event has occurred. Logs are useful in detecting and diagnosing problems with IT systems in IT operations environments.

Log anomaly detection (LAD) aims to detect anomalous behavior in the logs produced by IT systems. Log parsing helps in extracting features from logs, which typically serve as the first step toward downstream log analysis tasks such as log templatization, log clustering, and anomaly detection. However, log parsing and processing are not easy.

In this article, we illustrate various aspects of logs that make parsing hard. In the follow-on article, we present approaches for parsing and processing them for useful insights in an economically viable manner.

Log parsing — why so hard?

IT application and system logs are semi-structured machine-generated data. They can come in many formats, structures, languages, and large volumes. These multi-dimensional attributes of logs pose many challenges in parsing and processing logs. In addition, if many business applications and systems are to be monitored in real-time for anomalies and application performance, then IT logs can be expensive to process. In large volumes, they can pose significant infrastructure requirements.

To better understand the complexities of log parsing, let us look at some of the challenges in parsing because of structural and format variations in logs:

Semi-structured: Logs typically have structured and unstructured portions. Structured portions may include timestamps, the name of the resource that is writing the logs, hostnames, IP addresses, errors, warning codes, etc. Unstructured portions may include the log message and query statements. Often, the formats of logs are not well-documented, thereby making it necessary to apply natural language processing (NLP) techniques to detect key entities. Deriving specific entity mentions such as hostnames, IP addresses, names of resources, time stamps, and error & exception codes, which might contain useful diagnostic information, requires a reasonable semantic understanding of log messages to extract features for downstream tasks such as anomaly detection and incident explanation. Here are some examples of logs containing structured and unstructured portions.

Varied formats: Logs can be written in plain text, XML, JSON, plain old java object (POJO), or any other format. Sometimes logs can have embedded JSONs within XMLs and vice versa. This complicates the processing of logs significantly, especially when XML and JSON structures are embedded inside each other. Getting to the entities needed to prepare features becomes more difficult as they get buried inside arbitrary levels and layers.

Example 1: Log as an XML: Connecting to outbound queueSOLQXYZ.beo11.LIVE2
Example 2: Log as a JSON: _line=“{\“date\”=\“20210805T1503\”, \“message\”=\“Processed the order in 23ms.\”}”

Date-time stamp variations: Logs don’t often follow standardized formats for Date-time stamp printing. Unfortunately, Some programmers might choose to write dates and times in custom formats. Also, for software that is part of custom-written or proprietary legacy systems, date-time stamp formats could be significantly different from the standard formats that packaged software (e.g., software middleware products such as Db2, Oracle DB, SAP software, etc.) might use. Some date-time stamp format variations are listed below. These variations pose interesting challenges for natural language processing (NLP) software that detects entities such as date and time.

Example 1: “127.0.0.6 — — [15/Sep/2021:10:48:30 +0000]”
Example 2: “start_time=’2021–09–18T22:35:16.576677Z’,” “date”:”2021–09–18T22:48:43.982+0000"

Languages (e.g. English, German, Japanese,..): Logs can be written in many natural languages depending on who wrote the software program. They can be in as many languages as the business is conducted. This means that the NLP software used for extracting log messages must be able to detect the language and parse it accordingly with suitable tokenizers and dependency parsers to extract features from them. Complicating the language support problem further, some logs are written in a mix of languages. For example, the second example below shows a log message that is a combination of German and English. Tokenization gets trickier when non-space delimited based languages are mixed space-delimited languages. For example, when English and Japanese or English and Chinese are mixed, parsing those logs gets even more complex than parsing logs written in space-delimited languages such as English and Spanish or English and German.

Example 1: Double byte languages (Japanese): ランタイム・プロビジョニング・フィーチャーが使用不可になっています。すべてのコンポーネントが開始されます
Example 2: Mixed languages (e.g., German and English): Das gemeldete problem ist: error serializing java object.

Poorly written logs: Sometimes logs are written poorly and in a cryptic manner. For example, consider the log phrase ‘POST/orders 500’. According to the subject matter expert, the number 500 in the below log implicitly meant ‘HTTP error code 500’, which is a critical error that needs to be immediately brought to the attention of an administrator. However, in the absence of the phrase ‘HTTP error’, it is virtually impossible to distinguish it from a general number 500. We would like to refer to some of them as ‘read my mind’ logs.

Example: ‘POST /orders 500’ Vs.‘POST /orders HTTP error 500’

Confusing and Conflicting logs: Human programmers who design log messages are susceptible to making mistakes. When log formats are not standardized, they can design confusing and conflicting log messages. For example, in the below log message from one of the kernel applications, we noted all three log levels at once — info, exception, and error. This makes it difficult to discern if this is an information-oriented, error-oriented, or exception-oriented log message.

Example: “php.INFO: User Deprecated: Since symfony/http-kernel 5.3: \”Symfony\\Component\\HttpKernel\\Event\\KernelEvent::isMasterRequest()\” is deprecated, use \”isMainRequest()\” instead. {\”exception\”:\”[object] (ErrorException(code: 0): User Deprecated: Since symfony/http-kernel 5.3:”

Log volumes: Apart from the structure and format variations, log volumes can pose challenges in parsing as well. As the number of business applications and systems that need to be monitored in real-time for anomalies increases, the volume of IT logs that need to be processed also increases, thereby posing increased infrastructure requirements. Sometimes, log volumes can run up to several terabytes of data per day. This increases the cost of infrastructure needed to process these logs thereby making the total cost of ownership of having a solution for log-based anomaly detection economically unattractive.

Conclusion

Acknowledgments

Thanks to our collaborators Mudhakar Srivatsa, Amitkumar Paradkar, Prateeti Mohapatra, Jae-Wook Ahn, Pooja Aggarwal, Sarasi Lalithsena, Meenakshi Madugula, Neil Boyette, Raghu Ganti, Hau-wen Chang, Jiayun Zhao, An-Jie Tu, Pujitha Kara, Ritu Singh, Keving Ng, Xiaocun Cue, Charles Wiecha, Suranjana Samantha, Amy Cu, Isaiah Kim, Wei Cheng Liu, Ragu Kattinakare, Gargi Dasgupta, Karan Karuppiah, and Rakesh Mohan.

References

[Ganti R. et al 2021] Why Logs are Important? https://community.ibm.com/community/user/aiops/blogs/raghu-kiran-ganti1/2021/11/30/why-logs-are-important?CommunityKey=6e6a9ff2-b532-4fde-8011-92c922b61214

Why is Log Parsing and Processing hard? was originally published in IBM Cloud on Medium, where people are continuing the conversation by highlighting and responding to this story.

Are we there yet? The First-Mile and Last-Mile problem with Machine Learning Models

Rama Akkiraju — Wed, 01 Dec 2021 18:53:38 GMT

by Rama Akkiraju, IBM Fellow, CTO AIOps.

In this article, I describe the first-mile and last-mile problems of machine learning models, give examples of their occurrence in a few domains, and suggest practical strategies for dealing with these problems from my experience.

The first-mile and last-mile problem — an everyday example

“Head north-east on Kirwin Lane and then turn left on to De Anza Blvd” suggested the mapping software on my smartphone, but I had no idea which direction was north-east. I had to guess by the direction of the sun. Thank goodness it wasn’t a cloudy day!

Finally, after 20 minutes of driving, the map lady announced that I had arrived at my destination — right in the middle of a busy three-lane road near an intersection!

Huh?

Where had I arrived? Where was that antique store I was looking for? To my right was a large shopping complex. I guessed the store I was looking for was in there somewhere. But I was not in the correct lane to make a turn. I had to miss the turn, go further, make some roundabouts to come back. Once in the complex, the map lady announced, “go north” and then immediately suggested, “go south”. Ugh! That was frustrating!

At this point, I turned the mapping software on my smartphone off. It was time to park the car somewhere in the parking lot and walk around or ask the nearby storekeeper for directions.

My experience that day was classic first-mile and last-mile problems. Delivery companies lose billions of dollars annually due to last-mile delivery problems because the mapping software can’t finish the task of taking drivers all the way to their destinations.

Statistical Machine Learning algorithms suffer from the same first-mile and last-mile problems.

The first-mile and the last-mile problems with machine learning models

It is well-known that most machine learning algorithms need a lot of representative data to learn patterns in that data to make predictions. Until such representative data is made available to the machine learning algorithms, their prediction accuracy may not be accurate enough for many use cases. I call this phenomenon the first-mile problem because the system is not able to get off the ground with the desired accuracy in its prediction — in the same way, mapping software doesn’t quite know how to guide a driver to get out of an unmapped neighborhood to the nearest well-known street that is mapped by the mapping software.

Suppose we get past the first-mile problem by providing enough representative data to train a machine learning model and that it is making predictions well-enough to put it to action in specific use cases in production, soon enough, you will realize that while the model performs well for 80–85% of the cases, it starts to falter in the remaining 15–20% of the cases. These are typically corner cases where it is not practically feasible to get enough representative samples during training as by definition they occur rarely. Yet these corner cases do occur, and the model needs to be able to deal with them. This is the last mile problem. This is akin to the mapping software declaring that you have arrived at the destination — even though, as in my case, I was still in the middle of a three-lane road.

Log Anomaly Detection: An illustrative example

Let’s examine these first-mile and last-mile problems in the context of log anomaly detection, a machine learning model used to detect anomalies from IT applications and system logs as part of IT operations management.

An anomaly is something that deviates from normal, standard, or expected behavior. The goal of log anomaly detection is to detect anomalies from IT applications and system logs in real-time. These may include logs written by an application, infrastructure, network device, operating system, middleware, and everything in between. Typically, organizations set either static thresholds or manual rules to define and manage deviations from normal behavior. The problem with static thresholds is that it takes a long time for subject matter experts (SME) to distill them from their experience and to create them. Moreover, these static thresholds don’t easily adapt to changes and, therefore, tend to get outdated and become irrelevant quickly. Therefore, it is better to use machine learning models to detect anomalies from logs.

Machine learning models are good at learning patterns. When faced with an anomalous pattern of log messages that do not conform to the normal pattern that has been learned, a machine learning model can raise an anomaly. This relieves organizations of the need to create and manage static thresholds or to rely on SMEs to write rules for every possible anomalous condition, which might be hard to do.

First-mile problems

Many techniques have been implemented for log anomaly detection (after converting the unstructured logs into structured features via log parsing) such as ARIMA, Seasonal ARIMA, XGBoost, Exponential Smoothing, Principal Component Analysis (PCA), and other deep-learning algorithms like LSTM. Many of these techniques still require ‘normal’ data to learn the patterns from. The challenge is in collecting representative normal data without human intervention in a reasonable amount of time. Not all IT environments are guaranteed to produce representative data in the first few minutes of turning an algorithm on. Till the model sees enough variations, seasonality, and other patterns, the model’s baseline is not stable. Predictions made during that time tend to be not quite accurate — like how the mapping software keeps saying ‘go north … go south’ almost at the same time. Essentially, the model is still adjusting and getting its bearings, and establishing a baseline during this time. This is the first-mile problem. When faced with this type of first-mile problem, it is best to enable continuous learning mechanisms so that the model can learn fast with customer data, in a customer environment. Below I share a few other strategies to better deal with the first-mile problem with machine learning models.

Strategies for dealing with the first-mile problem

Build a broad-based base model: Whenever possible, try to build a good base model with as much representative data as can be obtained. These can be considered as ‘base models’. Accuracy of these services usually ranges from, say, 75%-85% accuracy (+/- 5%-10%). Enriching these base models is a continuous activity involving collecting, cleansing, training, and fine-tuning the model. For example, in the case of log anomaly detection, a base model can be built with a week’s or month’s worth of historic log data. This model can get you 75–85% of the way. Often, even this historic data fails to capture the variations triggered by user loads, seasonality, and other factors well-enough to learn the patterns reliably. That’s why the ability to customize these base models and continuously improve them becomes critical to achieving the desired levels of accuracy.
Enable Model Customization: Model customization is needed either when good-enough base models cannot be built ahead due to the special nature of the data (e.g. anomaly prediction in IT system logs for proprietary applications) or when the base model that is built using general-purpose data does not scale well for company-specific environments (e.g. general purpose Chatbots may not scale well for special drive-in menu order taking chatbot). By exposing the APIs for the machine learning model to be customized, you make it easy for the model to take-in external data beyond the data with which it has been trained initially for on-the-field training. Model customization is the mechanism by which continuous learning happens.
Enable Continuous Learning: Enabling hooks for continuous learning is a must for any machine learning model for multiple reasons, either because the initial training data is insufficient and must be augmented or customized or because the model needs to stay fresh to reflect the changing input data patterns or for other reasons. Necessary mechanisms for automatically retraining the models with new data is a critical aspect of deploying machine learning models, especially to address the first-mile problem.
Human AI-authoring & Feedback: Enable subject matter experts (SMEs) to guide various aspects of prediction tasks including data selection, data preparation, annotations of unknown patterns/templates/samples to expedite learning. This is, in essence, the human authoring of AI. There is no shame in doing this and in fact, is the best way to bootstrap the models and get them going in the right direction. It’s like asking a local person for directions to get on the nearest known main street when you get lost. It works! After all, you want the job to get done rather than sitting on a high tower of full automation. SMEs can further accelerate learning by giving regular feedback to the models using which the models can learn continuously.
Have realistic expectations: Machine learning models can’t do magic. Having realistic expectations goes a long way in avoiding early and premature disappointments with technology that can improve over time with the right feedback and more representative data. After all, often, you know the way to the nearest street that is mapped on the map software. So, if the map software gets it wrong, you use your in-built sense of orientation to get going till you can have the map software guide you better. That is, treat the initial model to be an intern-in-training until enough representative data can be collected to improve the accuracy of the model.

Last-mile problems

Continuing the log anomaly prediction problem as an example to illustrate first-mile and last-mile problems with machine learning models, let’s examine how the last-mile problem manifests. Say that you have built a good anomaly prediction model, did all the right things to get past the first-mile problem, achieved desired prediction accuracy, and deployed the model in production. Even so, the model might make mistakes every so often because of the infrequently occurring long-tail type of scenarios. For example, seasonality that occurs once in a year, or maintenance periods that trigger different behaviors of IT systems may confuse the model and may lead to inaccurate predictions. These are examples of the last-mile problem. In such cases, the best course of action might be to write a rule to deal with seasonality and maintenance windows. It takes too long, too much data, and too many repetitions of data for the model to learn these types of patterns. It is much easier to deal with these types of patterns via rules. Here are a few other strategies to better deal with the last-mile problem with machine learning models.

Strategies for dealing with the last-mile problem

Develop good test datasets with accurate ground truth: How does one know that there is a last-mile problem or model accuracy prediction problem? Well, one must develop good test datasets with accurate ground truth to first identify that there is a prediction problem. In some domains, there could be a lot of gray areas in the ground truth. What might be an anomaly for one may not be one for the other. For example, in the sentiment prediction problem domain, what might be a negative sentiment statement could be a neutral statement for another. So, establishing accurate ground truth is critical to identifying the model’s weakness.
Perform Error Analysis: Once the areas of the model’s weaknesses are identified, it is important to perform systematic error analysis. If possible, classify the errors manually or automatically, understand the source of errors, and have different strategies for fixing each error as needed.
Augment training data, if possible: If you know the exact patterns of mistakes, see if it is possible to collect more samples in those specific areas to augment the training data. Sometimes, this may not be possible if the errors are caused by corner cases that don’t occur that frequently.
Log payload & continuously learn: Whenever possible and privacy policies allow for it, log the input data of the machine learning model. Correlate the input data with the timing and error analysis to identify the input samples upon which the model is making prediction errors. This data needs to be annotated by the SMEs to correct the machine learning model’s weaknesses in the next iteration of learning.
Short-circuit learning with rules or micro-models: Often, the best way to deal with corner cases is by adding rules or by building targeted micro-models. Gathering enough samples of rarely occurring cases to teach the system might take too long and is not guaranteed to happen in a reasonable amount of time. Moreover, if you know the specific corner cases where the model is making mistakes, you already know the patterns. So, it’s much simpler to write a rule for these patterns than to collect many examples of that pattern to let the model learn the same rule that you already know. For example, in the anomaly prediction domain, what might be a high severity anomaly might not be as distinguishable from a low severity anomaly without further context. Whether an anomaly should be treated as an incident may depend on external factors such as how many times the problem occurs (i.e., error rate), how many other incidents have occurred this month, whether these incidents violate service level objectives or not, etc. These are best dealt with via post-processing rules. Also, sometimes, micro-models can be developed for targeted long-tail cases.

Conclusion

Machine learning models are susceptible to first-mile and last-mile problems similar to what happens with mapping software. Effectively dealing with the first-mile and the last-mile problem needs purposeful data collection, preparation, planning, diligent error analysis, and subject matter expert input and guidance. By working together humans and machines can co-create effective machine learning models.

Are we there yet? The First-Mile and Last-Mile problem with Machine Learning Models was originally published in IBM Cloud on Medium, where people are continuing the conversation by highlighting and responding to this story.

Watson AIOps: AI for IT Operations Management

Rama Akkiraju — Wed, 05 Aug 2020 20:19:01 GMT

At first, there were distributed computing systems, next, there were fault-tolerant systems, then, autonomic computing, and now, AI Operations. Someone once said that there is nothing new in Computer Science and that the same concepts keep coming back every few years. It’s like old wine being served in a new bottle!

Is it really? While the concepts and the vision for all these topics are the same, which is to have computer systems that are capable of self-management, the mechanisms, the means, and the standardization needed to achieve that vision fully are only coming in place now. Technologies such as Cloud computing, micro-service architectures, containerized software development such as docker and open-source container orchestration systems (e.g. Kubernetes) for automating computer application deployment, scaling, and management are all making it possible to have the necessary levels of abstractions needed to scale the self-management implementations. Even if the applications that are being managed themselves have not yet made their way to Cloud, the fact that the operations management solution can scale by building as containerized software managed by Kubernetes makes the solution more readily scalable to multiple environments. Also, the rise of Artificial Intelligence (AI) powered by the advancements in hardware architectures, Cloud computing, natural language processing (NLP) via language models such as BERT, and advancements in machine learning (ML) via deep learning (DL) algorithms and frameworks such as (Tensorflow, Pytorch) and deep neural network architecture optimization frameworks (such as Katib), has opened up new opportunities for optimizing business processes in various industries. Operations management of IT systems is one such an area that is prime for optimization. By leveraging the advancements in AI and Cloud computing we can now set out to achieve the vision of self-managing computer systems. That’s where AI for IT Operations management aka AIOps comes into the picture.

Information Technology (IT) Operations management is a vexing problem for most companies that rely on IT systems for mission-critical business applications. Despite the best intentions of engineers, good designs, and solid development practices, software and hardware systems deployed in companies in service of critical business applications are susceptible to outages, resulting in millions of dollars in labor, revenue loss, and customer satisfaction issues, each year. The best of the analytical tools fall short. This can be attributed to the complexity of the problem at hand. IT applications, the infrastructure that they run on, and the networking systems that support that infrastructure — all produce large amounts of structured and unstructured data in the form of logs and metrics. The volume and the variety of data generated in real-time poses significant challenges for analytical tools in processing them for detecting genuine anomalies, correlating disparate signals from multiple sources, and raising only those alerts that need IT operations management teams’ attention. To add to this, data volumes continue to grow rapidly as companies move to modular micro-services-based architectures, further compounding the problem.

AI can help solve these problems. AI can help IT operations management personnel/Site Reliability Engineers (SREs) in detecting issues early, predicting them before they occur, reduce event & alert noise by grouping events/alerts related to same incidents, locating the specific application or infrastructure component that is the source of the issue, determining the scope of incident impact, and recommending relevant and timely actions based on mining prior incident records. All these analytics help reduce the meantime to detect an incident (MTTD) and mean time to identify/isolate the cause of an incident (MTTI) and therefore, mean time to resolve (MTTR) an incident. This, in turn, saves millions of dollars by preventing direct costs (lost revenue, penalties, opportunity costs, etc.) and indirect costs (customer dissatisfaction, lost customers, and lost references, etc.). Below, we describe the AI in our Watson AIOps solution.

The IT operations environment generates many kinds of data. These include metrics, alerts, events, logs, tickets, application and infrastructure topology, deployment configurations, and chat conversations. Of these, metrics tend to be structured in nature while logs, alerts, and events are semi-structured, and the content in tickets and chat conversations tends to be unstructured. Also, among all the data types, logs and metrics sometimes can be leading indicators of problems, while alerts, tickets and chat conversations tend to be lagging indicators. An advanced IT operations management system can take all of this data as inputs, detect incidents early, predict when incidents may occur, offer timely and relevant guidance on how to resolve incidents quickly and efficiently, automatically apply resolutions when applicable, and proactively avoid them from recurring by enforcing the required feedback loops into the various software development lifecycles. This can increase the productivity of IT operations personnel or Site Reliability Engineers (SREs) and thereby improve the mean times to detect, identify, and resolve incidents.

Enter Watson AIOps into the picture. It does exactly that!

IBM already has strong products in the market for event management, topology management, and metric-based anomaly prediction via Netcool Operations Manager product. These capabilities draw insights from alerts, events, and metrics. Building on these strong foundations, we have introduced Watson AIOps 1.0 in June 2020 that brings together insights from both structured and unstructured data types. Watson AIOps included anomaly prediction from logs, potential problem identification feature via fault localization analysis on topology, evidence and explanations to understand the problem, incident impact radius analysis to determine the scope of impact, and problem resolution suggestion via prior similar incident analysis. In Watson AIOps, insights such as anomaly prediction, the grouping of events, the probable cause of the incident, and next-best-action recommendations are all delivered in a ChatOps environment, such as Slack, a place where IT operations management personnel or Site Reliability Engineers (SREs) work.

With Watson AIOps for Cloud Pak for Data 2.0, we are bringing these capabilities even closer together as shown in Figure 1. Broadly speaking, Watson AIOps solution capabilities can be organized into event management, incident diagnosis, incident resolution, and insights delivery categories. These capabilities are supported by an ecosystem of connectors and platform capabilities to manage the AI model training, their lifecycle for improvements, etc. Below, we give a brief view of each of these capabilities whose overall flow is highlighted in Figure 2.

Figure 1: Watson AIOps for Cloud Pak for Data 2.0 Components.

Event Management

An event indicates that something that is noteworthy has happened in an IT operations environment. For example, an application has become unavailable or disk is full or disk reaching capacity, etc. Event management is the process that monitors and manages all events that occur through a business application or IT infrastructure. Event management involves event collection, event classification, event normalization, deduplication, event enrichment for analytics, event correlation, and event grouping either via manual rules or via automated means. The main goal of event management is not only to keep a record and manage the events but also to provide insights on those events that need operator attention either because they are likely to turn into major incidents or are already major incidents and action must be taken. The goal of event grouping, classification, and deduplication is to reduce the noise for IT operations managers and to help them focus on a few important events that need their immediate attention. Event Manager in Watson AIOps 2.0 offers all of the event management capabilities noted above. The AI Manager complements to this event grouping via entity-based correlation of events. The entity-based event grouping extracts entities, i.e., mentions of application and infrastructure component names that are referenceable via topology and correlates them to further inform the event grouping. These entity-mentions also help in isolating the faulty components, and in determining the incident impact scope as well.

In Watson AIOps 2.0 we bring together the capability to group events generated from structured, semi-structured, and unstructured data types. These include anomalies detected from metrics, logs, and tickets themselves respectively. We use multiple algorithms such as Temporal, Spatial, and Association Rule mining algorithms in Watson AIOps for event grouping.

Static and Dynamic Topology Management

Application and network topology refers to a map or a diagram that lays out the connections between different mission-critical applications in an enterprise. Static topology refers to a map that is constructed based on the build and deploys information on applications and infrastructure components. Dynamic topology, on the other hand, refers to a dynamic map that captures the resources and their relationships as the environment changes at run-time and provides near-real-time visibility of the same. Another important aspect of a dynamic topology is the ability to compare the current topology with a historical one. Real-time and historical views of the environment give answers to “What happened” & “What’s happening”, and to know the details that led up to an incident and see the topology (and status) changes over time. Watson AIOps’ Topology functionality is offered via Agile Service Manager. It supports observation and discovery of the application and infrastructure dependencies, regardless of type, vendor, or source. Topology Manager also supports cross-layer application and infrastructure dependency mapping where the information originates from distinct, disjoint sources of truth so that the solution provides a comprehensive application and infrastructure dependency mapping up and down the stack. This topology functionality is intimately integrated with AI Manager and is leveraged in entity-correlation based event grouping, and in faulty component visualization and fault impact radius estimation.

Incident diagnosis: Incident diagnosis involves identifying incidents early via anomaly prediction, isolating the faulty component, and determining the impact scope. Watson AIOps offers all of these capabilities. We examine them below briefly.

Anomaly prediction:

The goal of anomaly detection and prediction is to detect anomalies from logs and metrics. An anomaly is something that deviates from normal, standard, or expected behavior. Typically, organizations set either static thresholds or manual rules to define and manage deviations from normal behavior. These rules are usually set on log aggregation systems (such as LogDNA, Splunk, etc.) and metric monitoring systems (such as SysDig, Prometheus, etc.). The problem with status thresholds is that first, it takes a long time for subject matter experts (SME) to distill them from their experience and to create them and second, they don’t adapt to changes and therefore, tend to get outdated and irrelevant quickly. If not, updated or deleted, these manual rule-based anomalies can start to flood SREs with irrelevant alerts. In our experience, approximately, 30% of these threshold events are never actioned! Operations teams waste time and effort in managing these thresholds and end up missing important clues. Therefore, learning what is normal, baselining it, and using it to automatically detect anomalies can free up SME time from having to manually manage these rules. Watson AIOps offers anomaly detection from both metrics and logs.

· Log Anomaly prediction: IBM’s Watson AIOps’ state-of-the-art and multi patent-pending log anomaly detection technology, available in AI Manager, is capable of automatically parsing IT application and infrastructure logs from log aggregation tools such as LogDNA, automatically learning normal log patterns from training data, understanding their semantic meaning, and detecting anomalies in real-time much sooner than traditional thresholding-based or error-string-matching type of alerting techniques can, thereby significantly reducing the meant time to diagnose an incident. We use deep-learning algorithms to both prepare features from logs during log parsing and to make anomaly predictions. Users don’t have to set static thresholds or manual rules to detect anomalies. The system will automatically detect these anomalies. The obtained anomalous results are then explained with back pointers to specific log messages in which anomalies were noted. We have applied this log anomaly detection system to an IBM’s own CIO office run internal field management application for sellers to track their incentives. In a specific test we did, by analyzing the Apache server logs, we were able to detect anomalies up to 20 hours on an average across five different major incidents, before a human opened incident tickets. In this experiment, training was done on one week's worth of aggregated access and error logs to represent normal or no major impact on business. Major incidents corresponding to these anomalies were not detected by any rules or existing thresholds and hence were missed till a major incident actually occurred and an IT operations management person created a ticket for these.

· Metric Anomaly prediction: Watson AIOps metric-based anomaly detection, available in Metric Manager, analyzes metrics data from various systems such as New Relic, AppDynamics, and SolarWinds, etc., to automatically learn the normal behavior of metrics in your company and automatically detects anomalies from metrics. It employs a set of time-tested time-series algorithms such as Granger Causality, Robust Bounds, Variant/Invariant, Finite Domain and Predominant Range, etc. to capture seasonality, significant trends and do perform forecasting. Many metrics are seasonal. For example, what is normal for the metric pattern at 2 pm in a time zone may not be the same normal for metrics at 10 pm in that same time zone. Therefore, taking seasonality of a particular environment is critical to accurately predicting anomalies. The Metric Manager in Watson AIOps is equipped to do this. In a specific evaluation scenario, our metric anomaly predictor caught the problem two days before a server stopped collecting data and was rebooted as a result. In another evaluation scenario, our solution was able to detect memory leaks five days before the memory maxed out on the server and prevented an outage.

Fault Localization & Blast Radius

Entity mentions are the names of the resources (e.g. service or application component names, server names, server IP addresses, pod ids, node ids, etc.) that are referenced in anomalous logs, alerts, tickets, and events. Once events are grouped and the entity mentions in anomalies, alerts and events are extracted, we perform entity resolution with topological resources to isolate the problem and to place the identified entities on the corresponding dynamic topology instances that match the time at which the mentions were noted. This enables us to map identified faults on topology. Traversing the topological graph in the application, infrastructure, and network layers enables us to map out the impacted components.

Incident resolution

Watson AIOps ingests and mines prior ticket data to provide timely and relevant action recommendations for the currently diagnosed problem at hand. Current incident symptoms are framed as a query to the indexed ticket data to not only search and retrieve top k relevant prior incident records but also important entity-action (aka noun-verb) phrases are extracted from each relevant record to make it easy for SREs to get a quick glimpse of the suggested action. For example, from a long chat conversation that is pasted inside the ‘closing comments’ section of an incident record, we extract phrases such as ‘Scaled Compose data node’, ‘Restarted Analytics pods’. In the first phrase, ‘Compose data node’ is the entity and ‘scaled’ is the action. In the second phrase, ‘Analytics pods’ is the entity, and ‘restarted’ is the action. We apply various natural language processing techniques to extract entity and action phrases including rule-based systems.

Insights Delivery and Action Implementation

In Watson AIOps, all of the insights described above are delivered both via ChatOps and dashboards. Real-time, in the moment insights, are delivered via ChatOps to SREs directly in the place where they work. Within ChatOps, there is functionality to interact and share selected incident resolution suggestions with other collaborators, in addition to exploring the evidence of the insights. From ChatOps, SREs can launch log, metric, and ticket monitoring tools to explore further details. Similarly, SREs can launch interactive dashboards powered by Event Manager, Metric Manager, and Topology features for detailed exploration of events, event groups, metric anomalies, and topology. Applicable actions/runbooks can then be automatically run via Runbook execution.

Quality Evaluations

Capabilities such as Event Manager, Metrics Manager, and Topology are already fielded in many clients’ environments. Therefore, we focused our performance evaluations on the new AI-infuse capabilities offered through AI Manager. We applied Watson AIOps analytical pipelines to various internal IBM applications and services to test-drive some of the latest feature functions in AI Manager. We also tested some of the newer AI capabilities such as log anomaly predictor, entity-linking based event grouping, and incident similarity capabilities on some of our clients as part of the beta testing the product. Our results indicate that we achieve significant reductions in the meantime to diagnose and mean time to resolve incidents. In some instances, we detected anomalies 20 hours ahead of a human creating a ticket, in other cases, we have reduced mean time to resolve incidents from 6 hours to less than 15 minutes. We are excited about the time and cost savings we are set to deliver to our clients.

A note on AI model life-cycle management

The AI models in Watson AIOps are unsupervised machine learning models. They don’t need labeled data but they do need data to learn the normal behavior of metrics and logs and to index and analyze prior incident ticket records. Therefore, Watson AIOps takes a representative set of metrics, logs, and ticket data for training and building models. Watson AIOps models are set up to learn continuously using up-to-date data from your environment and to improve based on user feedback. Watson AIOps is not a black box AI-infuse solution. We believe in full transparency of the inner workings of our AI models. While Watson AIOps is set up to automatically retrain the models at regular intervals, IT operations administrators have access to our model (re)train scripts and can execute model retrain on demand at any time.

Figure 2: Watson AIOps at a glance

What’s next for Watson AIOps?

While we plan to continue to enrich the various AI pipelines mentioned in this article continuously in Watson AIOps, we are excited to bring together the enterprise-grade event management, predictive insights, and dynamic application topology management capabilities that you are already familiar with from Netcool Operations Insights portfolio with the latest AI-infuse machine learning and natural language processing capabilities to mine the unstructured data sources such as tickets, logs, and chats to offer an unparalleled IT operations management solution for our customers. We are looking forward to expanding our ecosystem of input connectors by integrating with various log, metric, and ticketing vendor products in the remainder of 2020 in an effort to bring out-of-the-box value to our customers. Netcool Operations Insights already offers more than 150+ connectors to various open-source and vendor tools. We look forward to bringing them all together in Watson AIOps onto a single stack and to further expanding this set in 2020. Similarly, on the output front, we are expanding ChatOps platforms from Slack to Microsoft Teams and other platforms. We believe in delivering insights where SREs work, which increasingly is noted as ChatOps environments. So, we will continue to invest in improving our user experience and explanations within ChatOps. However, we do realize the value of rich dashboards that offer interactive what-if analysis exploration, decision support, and off-line analysis of what has happened. Therefore, throughout the rest of 2020 and beyond we will continue to bring these user interfaces together to allow for users to seamlessly traverse both to derive the insights they need and to perform the actions they need to perform.

Furthermore, in the next generations of our Watson AIOps solution, we envision, self-aware and autonomic IT operations environments that not only shift-left in development-security-operations (DevSecOps) life cycles to influence deployment, test, build, code and design processes but also close the loops with operations phase with feedforward and feedback mechanisms. By doing so, we intend to fully equip each stage in the DevSecOps life cycle with full foresight, hindsight which enables intelligent, and consequence-aware decision-making at each stage. Our vision for shifting-left in DevSecOps life cycle, while closing the loops virtuous feedback and feedforward cycles for efficient operations management is shown in Figure 3. We envision various stages of IT application development processes to be equipped with the smarts to proactively avoid issues from happening at run-time by not advancing IT application artifacts that do not meet the preset quality criteria to the next stage. For example, smart checks and gates prevent risky deployments from getting pushed to production, stop under-tested code modules from getting into deployment phases, and block code with risky security vulnerabilities from getting to the deployment phase and so on. We envision Watson AIOps solution to correlate past incidents with root causes that could be traced to under-tested deployment changes, security vulnerabilities, poor code test coverage, and such. This information, when fed back, serves as a critical input to reinforcing the checks and gates in the earlier stages of DevSecOps life cycle.

Figure 3: Shifting-left in DevSecOps life cycle while closing the loops virtuous feedback and feedforward cycles for efficient operations management

So, after all, we do a comeback full circle to fault-tolerant, autonomic distributed systems with Watson AIOps. It’s just that this time around, we have the compute powered by Cloud Computing, the state-of-the-art AI algorithms, thanks to the advances in Machine Learning and Natural Language Processing, standardized platforms for building scalable management systems via Docker, Kubernetes, and standardized data and AI management platform to build solutions powered by IBM’s Cloud Paks. To add to this, we have a wealth of IT operations management experience at IBM from having managed IT systems and infrastructure for our customers via various strategic outsourcing engagements and have the depth of product experience with our Netcool suite of products that have been in the market for over twenty years. We are bringing them all together with a vision toward optimizing IT operations management, not just in a reactive mode but to avoid issues from happening in the first place by designing the DevSecOps lifecycle activities for efficient operations right from the get-go. We can’t wait to shape the future and take you all with us, on this journey!

Acknowledgments

A big shout out to all the global cross-organizational leaders and team members of IBM’s Watson AIOps team for all their wonderful contributions! You know who you are! There are too many to list here. Thank you!

Watson AIOps: AI for IT Operations Management was originally published in IBM watsonx Assistant on Medium, where people are continuing the conversation by highlighting and responding to this story.

Artificial Intelligence (AI) Service and Solution Development Methodology for Enterprises — Part 3

Rama Akkiraju — Sun, 17 Jun 2018 22:48:26 GMT

In the previous articles, I discussed:

1) What it means to build machine learning services, and

2) How to deal with biases in AI models

In this article, I will discuss an approach to building machine learning/ AI services and solutions for the enterprises.

Software and Services development has gone through various phases of maturity in the past decades and the community has come up with lifecycle management theories and practices to disseminate best practices to developers, companies and consultants alike. For example, in software field, Capability Maturity Model (CMM), Software Development Life Cycle (SDLC) Management, Application Life Cycle Management (ALM), Product Life Cycle Management (PLM) models prescribe theories and practical guidance for developing software products. Information Technology Infrastructure Library (ITIL) organization presents a set of detailed practices for IT Services management (ITSM) by aligning IT services with business objectives. All these practices provide useful guidance to developers for building software and services assets systematically. Are these models going to be helpful when building AI services as well in this new cognitive era? In this article, I’d argue that there is a need for developing AI Services lifecycle management methodology (AISLM).

Now that we are well into Cognitive-era or AI-era, where developers are actively building cognitive/AI applications (e.g.: chat bots, personal digital assistants, doctors assistants, radiology assistants, legal assistants, health and wellness coaches etc.) using AI building block services such as Conversation service, Speech-to-Text Service, Text-to-Speech service, Image Recognition Service, and Natural Language Understanding services (such as Sentiment, Emotion and Tone Analysis services, Concepts and Entity Recognition services etc.), I am convinced that we need to develop best practices around building AI applications and AI services as well. In this blog series, I will present my thoughts on the areas where we must evolve best practices for building AI services and AI applications. Along the way, I will share some of the practices that are working well for us. Hopefully, we can have a good debate over these topics and evolve good practices for AI Service Life Cycle Management (AISLM) and AI Application Life Cycle management (AIALM) overtime! Well, there they are! That’s two more acronyms for us to remember!

In the last article, I have already discussed the similarities and differences between building traditional software-as-a-service application development and Machine learnt services. Building on those ideas, in this article, I present a point-of-view on AI Service Life Cycle Management and best practices for scaling AI Operations.

First, let me introduce what we mean by AI Service Lifecycle management, AI Operations and why one cares to scale AI Operations.

Four Stages of AI Service Development for Enterprises

We argue that a typical AI service development should go through four stages of development before becoming useful and viable for a particular customer and a user:

1) Base model development and enrichment

2) Industry/Domain adaptation and

3) Customer adaptation

4) User adaptation

Four-layered AI development model for Enterprises

While the base model provides the foundations for an AI service, industry/domain adaptation and customer adaptation pave the way for achieving market viability of AI models. Why so? If we don’t partition the problem space, it is too huge! It is everything under the sun! Imagine trying to build an English Speech-to-text AI service that can understand all types of English accents (American, UK, Australian, African, Russian, Indian, Chinese etc.), can understand all industries’ vocabularies and all use cases in those industries! In addition, when you take that model to a particular company, it has to work right off without any fine tuning! That is not possible today without unlimited budgets and data access! The problem space is too huge! The amount of data needed to train such a large system is too much, takes too long to acquire it, takes too much budgets that AI Service development companies can’t afford! In essence, trying to be the best in everything and best for everyone is too costly! Narrow the problem and you may have a better chance! Let’s take a look at what we mean.

1) Base model development and enrichment: A base model is what an AI service vendor typically prepares and makes available as a service in the general domain. For example, general purpose AI services like sentiment analyzer, speech-to-text, and image recognition services are typically trained using domain independent data. These base models are built to have broad coverage and are trained with data from multiple and diverse publicly available or licensed data sources to ensure broad coverage. The advantages offered by these base models is that they can get developers/companies started on any dataset and they provide average-to-good accuracy depending on the type of data. These can be considered as ‘base models’ Accuracy of these services usually ranges from, say, 75%-85% accuracy give or take +/- 5%-10%. Enriching these base models is a continuous activity involving collecting, cleansing and preparing more diverse training data, including payload data from the service invocations and fine tuning the model parameters to put out better models. We call this base model development and enrichment process.

2) Domain adaptation: While the base models are necessary, often, they are insufficient to get the job done when tested on specific domains. For example, Speech-Text service that is trained on broadcast news may not work well on speech samples from banking and insurance domains as there aren’t just enough occurrences of specialized words and their pronunciations in the base training data. Similarly, a Tone Analyzer service trained on general purpose data like Twitter Tweets, blogs, and news articles may not perform well when applied to customer support domain. As a specific example, take a look at the sentence, ‘I got a ticket’. It could be perceived as a sad statement by the base model because getting a ticket is usually associated with a ticket that one gets when one violates rules of the road. Since a ticket often comes with a fine one has to pay for driving violation, it considered a sad event in one’s life. Whereas in Customer care domain ‘getting a ticket’ is a common first step to opening a problem report. It should be a neutral statement in Customer care domain. As you can see, the same sentence ‘I got a ticket’ should be classified as a neutral tone statement in Customer support domain while it should be classified as a sad tone in general purpose domains. So, essentially, the models need to be adapted to the domains that they are expected to work on. In order to make this happen, one must train a domain-specific model. This can be done by mixing a good chunk of the base model data with domain specific data and rebuilding a new model by adjusting the model parameters. This is how AI vendors can make available domain specific services. For instance, a Speech-to-Text service trained to recognize Australian accent and able to decode banking terms to meet the needs of banking clients in Australia or a Tone Analyzer service trained to better understand the sensibilities in customer care domain are examples of domain adapted versions of Speech-to-Text and Tone Analyzer services respectively.

3) Customer adaptation: Often, even the domain adapted models aren’t sufficient enough to perform at the levels of accuracy needed in Enterprises. Each company has its own vocabulary, policies, business processes, products, offers and terminology. When building cognitive applications companies would like the AI services that they use to build their applications to understand their own specific company’s domain better. For instance, an insurance company may want the Speech-To-Text service to recognize the specific plans that their company offers such as ‘ABC Inc’s insurance plan’. Similarly, a company may like Tone Analyzer to be further adapted to their domain. For example, if a customer is asking for a quote on ABC Inc’s insurance plan, ABC company may consider it as an excited/joyful tone as the customer is showing interest about their plan. Under normal circumstances, a general purpose/base model would have treated such sentences as neutral in tone whereas a customer may want to override that by teaching the model to classify such statements as excited to suit their domain better.

4) User modeling and adaptation: Last but not least, in some domains and companies, further personalization of models might be needed. For example, in retail industry, user clicks, interests, preferences are gathered through various means and user models are built to further personalize and customize interactions specifically for individual users. This will make the interactions further personalized and more relevant for individuals. User models can be constructed with user consent, from social media data, Web click data, and other transactional data sources. A more detailed view on user modeling for personalizing interactions is available in this talk I gave in 2017 https://www.youtube.com/watch?v=MlDcMyPZfIU&t=3s.

The key takeaway from understanding this four layered AI service development model is to have clarity on the purpose of the AI Service. Which market is the AI Service supposed to do well in? Base/Industry/Customer/User? Trying to be the best in everything and for everyone might be too costly and too hard! The key insight is to narrow the problem to achieve desired results. For example, Google recently announced very special purpose chat bots for making restaurant appointments. They chose such a narrow domain to show success because that is the only way to perform at the levels of accuracy that is needed, presently with the state-of-the-art.

Learning accelerates when you narrow the domains

You might ask, how does narrowing the domains help achieve better accuracy of the models? Well, let’s take a look at the learning curves below. A learning curve plots the performance of an AI model as the training data size increases. Typically, the more the training data, the higher the accuracy! (there are exceptions but for now, go with me on this one please!). Depending on the type of service involved, achieving higher levels of accuracy might need large amounts of training data. However, by narrowing the domains, we can ride different learning curves that helps us get to the desired levels of accuracy faster with smaller amounts of data. We can scale from building credible base models, to market viable industry/domain models to useful, to usable customer models, to relevant and personalized user models quicker and faster by narrowing the scope of the problem and domains.

Learning Curves: As the domain narrows, learning accelerates

In the next blog article, I will discuss the AI life cycle management topic. Watch out for it!

Acknowledgements: The ideas in this blog article were shaped by various discussions I have had with many colleagues at IBM. Many thanks to all of them. Specifically, I’d like to thank Beth Smith, John Schumacher, Worknesh Belay, Vibha Sinha, Ruchir Puri and Donna Romer for lively discussions that helped refine and solidify these ideas and points of view.

Declare your biases

Rama Akkiraju — Tue, 20 Feb 2018 23:26:38 GMT

Photo by Joshua Earle on Unsplash

This month there has been a lot of discussion on biases in Machine Learning models in AI world spurred by this New York Times article on bias in AI systems. Every time the topic of machine learning models and biases comes up, invariably everyone points out the importance of training data and how important it is to ensure that the training data is representative and unbiased. While that is an important point to note, I’m afraid, it is not very actionable for practitioners who are building machine learning models, if we don’t provide any prescriptive guidance on how to ensure that the training data is ‘representative’ and ‘unbiased’. How can a data scientist building a machine learning model ensure that the training data she is working with is unbiased? What does it mean to be ‘unbiased’ anyway? ‘unbiased’ in what ‘scope’? representative of what ‘scope’? Who defined that ‘scope’ for her? How to measure that ‘scope’? She needs some guidance and tools to answer these questions.

I’d like to make two points on this topic in this article:

First, I’d argue that there is no such a thing as an ‘unbiased’ Machine Learning model. So, instead of striving for unbiased machine learning models, a machine learning model must state its biases openly.

Second, minimizing biases has to start with creating test datasets rather than with training datasets.

Let me elaborate.

State your biases: If one were to attempt collecting unbiased training datasets in any particular domain to build out a robust machine learning model, ideally, one has to collect enough representative samples of data from that domain that the model is trying to learn. How does one go about modeling that domain and mapping out its contours so that you can sample enough data from the space that the domain represents? Physicists, Mathematicians and Statisticians like to explain the phenomena in our world from the point of view of models. Models are good tools to explain what’s happening around us in general terms, if not at every specific occurrence of a phenomenon. Distributions such as Gaussian, lognormal, exponential, Laplace, and Gamma etc. are used often to represent the occurrences in real-world. They are approximations but serve a good purpose to help reason things. So, we have tools at our disposal to figure out which distribution might be a good approximation to the domain that we are trying to model. Once the distribution is identified, we can use tools again to see whether we have enough samples to represent that model of the world reasonably or not. However, here lies the problem. More often than not, the domains are not evenly distributed, meaning not all phenomenon occur at the same frequency. So, it is hard to observe certain phenomenon than certain other phenomenon because they occur less frequently. Therefore, a corollary to this is that it is much difficult to collect certain types of data than certain others because there aren’t simply enough of them to around. During data acquisition process, organizations have to deal with budget, and time constraints. Rarely do organization have unlimited budgets and time to collect representative samples to collect most comprehensive datasets that can avoid biases completely. One can, at best, mitigate biases with careful planning (I will discuss this in the next point). Therefore, I’d argue that it is more practical for a machine learning model to declare its biases than to pretend that it is unbiased or that it can ever be fully unbiased. How can we do this? Well, one way to do this is to be open about the scope, coverage, type of data and the sources of data that a model is trained on. I know this gets into revealing too much about ones’ secret sauce to the whole world. Organizations don’t like to do this, usually for good reasons. However, here is an analogy that might help us rationalize this. When a new drug is released to the market, Federal Drug Administration (FDA) (or whatever the analogous body in a different country) mandates that the ingredients used in making that medicine be declared on the drug label. Drug companies would rather not do it if they can help it but it helps the patients understand what they are getting. May be machine learning models ought to be treated like these new medicines released to market. If we tell the users what the models are trained on, and the innate biases, they know what to expect and won’t hold it accountable for something that it is not trained on!

Start with test datasets: In software engineering, after many years of trial and error and iterations, best practices evolved on how to build robust software with minimal defects. One such best practice states that one must start with building test cases for the software first before writing any code. Once the expected behavior of the software is defined by means of detailed requirements, business analysts must write the test cases. Developers, then, are supposed to write corresponding unit and system test cases first. Software is then designed and developed to meet those requirements and to pass the test cases. Passing the test cases is how one measures whether the written software meets the requirements or not. Test case coverage is a very important software development metric in building good quality software systems. When building large commercial software, teams of software testers are employed to write test cases and to test the software from all angles. It seems that when building machine learning models, somehow we have forgotten the basic principles of software development. The onus is often on data scientists building the machine learning models to ensure that they train the model with ‘unbiased’ data. Whatever happened to writing test cases first in the world of building machine learning models? Whatever happened to creating a test team? A test dataset in machine learning world can be thought of as a test case in traditional software development. Just as test teams are an integral part of a software development organization, test teams that create test datasets should be an integral part of AI systems development organizations. Similar to ‘test coverage’ metrics in software engineering, we need to define, measure and monitor ‘test coverage’ for machine learning models. Leaving this to data scientists who build machine learning models is not enough. A good software project never relies on developer test cases alone to release commercial software. It must be tested by independent testers.

Once the test datasets are created in machine learning world, it is the job of data managers (please refer to my previous article on new roles in machine learning systems for a definition on who a data manager is) to collect training data that has the desired ‘coverage’ to train a machine learning model. At this point, we need good metrics, algorithms and tools to measure various aspects of training data sets and test data sets to note how close they are to each other, what the gaps are and in what areas the gaps are. Unsupervised machine learning algorithms themselves (such as topic modeling and clustering) can be put to use here to measure the distance between train and test datasets on various dimensions and to understand where gaps need to be bridged in training datasets. Based on the noted gaps, data managers can iterate till they reach a threshold of desired coverage or a threshold of distance between test and train datasets in terms of coverage. Clearly, these are ideas and concepts at this time. We need to drive more research work on these topics to build the methods, processes, and tools to institutionalize this type of disciplined process to building train and test datasets.

But wait a minute! I haven’t still addressed the original problem of creating test datasets that cover the scope of the machine learning model. I simply argued for distributing the problem of creating test datasets to multiple people (which is a good start anyway! Diversity ensures good coverage often). We still need a good ‘coverage’ measurement metric to measure the distance between the scope covered in the test datasets collected and that of ‘requirements’. Okay, so now we are on to a different point. How to represent requirements for a machine learning model? Well, these are all the critical questions we must ask and answer for ourselves. I don’t have answers to all these yet but I’m certainly thinking about these and I know several of you are as well.

One thing is clear to me though! We need to be more methodical about building machine learning models than we currently are. There is a lot to learn from software engineering practices, and quality management in manufacturing. We don’t need to reinvent those wheels. We need to find suitable interpretations here so we can build on those best practices.