Sequential Evaluation to Help Develop Better Chatbots

Continuously improving the quality of classification is crucial for chatbot projects that use it. How can we validate the algorithm choice and at the same time understand and improve the data quality?

The challenge is to select a classification algorithm that doesn’t just work well for many (100–200) classes but also maintains reliable and consistent performance over time. We’ve worked before on better evaluation of dataset snapshots, and recently have been looking into evaluation over the long life of dataset development and maintenance.

We introduce one way to evaluate the chatbot performance using historical data. Implementation can be found here https://github.com/jobpal/nex-cv.

How to use the sequential evaluation with nex-cv

The proposed evaluation can be used to monitor the performance of a specific algorithm over time. Training the chatbot continuously on live incoming data is an important feature that allows the algorithm to show live improvements for the clients. Sequential evaluation needs time information to reconstruct the real scenario where new (or updated) data is provided to the trained model. Test data is selected using time information s.t. at each step it is selected from the period of time covered by the training data. This allows “playing back” the real process of a chatbot being trained, and allows comparing a classification algorithm not just at one point in time, but over the re-constructed lifetime of a dataset. Performance metrics are reported, namely:

  • Run time (in seconds)
  • Accuracy
  • Weighted and macro F1 scores
  • Precision/recall per class
  • Results on randomly selected N examples (for debugging)

Depending on the training algorithm used, decisions will be made when receiving new data (with or without new classes) or updates to old data (e.g., a changed ground truth label). Will the algorithm perform full re-training in all cases or does it depend on the mentioned conditions?

Online classifiers are a natural choice when training on data streams but it comes with the cost of forgetting old knowledge. Sequential evaluation could be used in this context to evaluate the ability of the online classifier to remember old data. Based on 50k questions asked over 1 year from 1 chatbot, you see below the performance in accuracy (bars) and run time on an AWS r5.xlarge instance (blue line) of an online classifier. The dark grey bars indicate that one or more classes are added s.t. a full re-training is needed.

Sequential evaluation on a company FAQ dataset

Furthermore, data insights could be developed from analysing the sequential evaluation results. Concept drift for instance is a hard to detect problem that can emerge over time for different reasons, e.g. a new reviewer with a different understanding of the what a class means in a dataset will apply the “ground truth” labels differently. The sequential evaluator helps giving a more in-depth view of a classification algorithm performance, including how much changes in the dataset impact measures of the classifier performance.

--

--