Common metrics for evaluating natural language processing (NLP) models
You can’t train a good model if you don’t have the right evaluation metric, and you can’t explain your model if you don’t understand the metric you’re using. So, here’s a list of common metrics which are used for ML and NLP models, along with their definitions and common applications. I’ve always had a difficult time remembering these from charts and confusion matrices, so I thought a verbal explanation might work better.
Denotes the fraction of times the model makes a correct prediction as compared to the total predictions it makes. Best used when the output variable is categorical or discrete. For example, how often a sentiment classification algorithm is correct.
Evaluates the percent of true positives identified given all positive cases. Particularly helpful when identifying positives are more important than overall accuracy. For example, if identifying a cancer that is prevalent 1% of the time, a model that always spits out “negative” will be 99% accurate, but 0% precise.
The percent of true positives versus combined true and false positives. In the example with a rare cancer that is prevalent 1% of the time, if a model creates totally random predictions (50/50), it will have 50% accuracy (50/100), 50% precision (0.5/1), and 1% recall (0.5/50)
Combines precision and recall to give a single metric — both completeness and exactness. (2 * Precision * Recall) / (Precision + Recall). Used together with accuracy, and useful in sequence-labeling tasks, such as entity extraction, and retrieval-based question answering.
Area Under Curve; Combines true positives vs false positives as threshold for prediction is varied. Used to measure the quality of a model independent of prediction threshold, and to find the optimal prediction threshold for a classification task.
Mean Reciprocal Rank. Evaluate the responses retrieved given their probability of being correct. The mean of the reciprocal of the ranks of the retrieved results. Used heavily in all information-retrieval tasks, including article search and e-commerce search.
Mean average precision, calculated across each retrieved result. Used in information-retrieval tasks.
Root mean squared error — very common way to capture a model’s performance in a real-value prediction task. Good way to ask “How far off from the answer am I?” Calculates the square root of the mean of the squared errors for each data point. Used in numerical prediction — temperature, stock market price, position in euclidean space…
Mean absolute percentage error. Used when the output variable is a continuous variable, and is the average of absolute percentage error for each data point. Often used in conjunction with RMSE and to test the performance of regression models.
The cheese that tastes like it sounds. Also, bilingual evaluation understudy. Captures the amount of n-gram overlap between the output sentence and the reference ground truth sentence. Has many variants, and mainly used in machine translation tasks. Has also been adapted to text to text tasks such as paraphrase generation and summarization.
Precision-based metric to measure quality of generated text. Sort of a more robust BLEU. Allows synonyms and stemmed words to be matched with the reference word. Mainly used in machine translation.
Like BLEU and METEOR, compares quality of generated to reference text. Measures recall. Mainly used for summarization tasks where it’s important to evaluate how many words a model can recall (recall = % of true positives versus both true and false positives).
Measures how confused an NLP model is, derived from cross-entropy in a next word prediction task. Used to evaluate language models, and in language-generation tasks, such as dialog generation.
Of course you can find plenty more, but that’s a fairly good list when we’re talking NLP. Thanks for reading, and follow me on twitter — @SaladZombie