# A Visual, Layman’s Guide to Language Models in NLP

(This is a crosspost from the official Surge AI blog. If you need help with data labeling and NLP, say hello!)

# Introduction

Language models are a core component of NLP systems, from machine translation to speech recognition. Intuitively, you can think of language models as answering: “How English is this phrase?”

# Intuition: Thinking Like an Alien

Imagine you’re an alien whose spaceship crashes into Earth. You’re far from home, and you need to blend in until the rescue team arrives. You want to pick up some food, maybe watch Squid Game to learn about human culture, and so you need to learn how to speak like an earthling first.

• Assign a low probability to unintelligible responses that lead to fear, confusion, and a call to the Men in Black. (For example: “Fries Santa cheese dirt hello”)
• Customer 2: “Two cheeseburgers”
• Customer 3: “Two cheeseburgers”
• Customer 4: “Fries”
• Customer 5: “The daily special”
• P(“fries”) = 1 / 5 (since “fries” was uttered in 1 out of the 5 interactions)
• P(“the daily special”) = 1/5 (since “the daily special” was uttered in 1 out of the 5 interactions)
• P(anything else) = 0
• Customer 2: “two cheeseburgers”, “cheeseburgers two”
• Customer 3: “two cheeseburgers”, “cheeseburgers two”
• Customer 4: “fries”
• Customer 5: “the daily special”, “the special daily”, “daily the special”, “daily special the”, “special the daily”, “special daily the”
• P(“fries”) = 1 / 13
• P(“the daily special”) = P(“the special daily”) = P(“special daily the”) = P(“special the daily”) = P(“daily special the”) =P(“daily the special”) = 1 /13
• P(anything else) = 0

# Evaluating Language Models

One question, then, is: which of your robots performs better? Remember that “two cheeseburgers” and “cheeseburgers two” sound equally valid to an uninformed alien!

# Human Evaluation

One approach is a human evaluation approach. Because your robots are trying to imitate human language, why not ask humans how good their imitations are? So you stand outside Shake Shack, and every time a customer approaches, you ask your robot to generate an output and the customer to evaluate it. If the customer thinks it’s a good, human-like response, they’ll assign it a score of +1; otherwise, they’ll score it 0.

• With the second customer: Robot A says “fries” (score: 1), while Robot B says “cheeseburgers two” (score: 0). A: 1.0, B: 0.0.
• With the third customer: Robot A says “fries” (score: 1), while Robot B says “daily the special” (score: 0). A: 1.0, B: 0.0.

Another approach would be to evaluate the outputs against a downstream, real-world task. In our alien situation, your goal is to get food from Shake Shack, so you could measure whether or not your robots help you achieve that goal.

• Robot A goes up to the counter again. This time, it says “fries”. The cashier understands and gives him a fresh bag of fries. Success again! A: 1.0.
• Next, Robot B goes up to the counter and says “cheeseburgers two”. The cashier doesn’t understand, so gives him nothing. Failure! B: 0.0.
• Robot B tries again with “the daily special”. The cashier understands this time, so gives him the Tuesday Taco. Success! B: 1.0.

# Intrinsic Evaluation and Perplexity

Human evaluations and task-based evaluations are often the most robust way to measure your robots’ performance. But sometimes you want a faster and dirtier way of comparing language models; maybe you don’t have the means to get humans to score your robots’ output, and you can’t risk blowing your cover as an alien with a bad response at Shake Shack.

# Summary

In summary, this post provided an overview of a couple key concepts surrounding language models:

• We described a way to train language models: by observing language and turning these observations into probabilities.
• We discussed a couple approaches to evaluating the quality of language models: human evaluation (did the robot responses sound natural to a human?), downstream tasks (did the robot responses lead to actual food?), and intrinsic evaluations (how perplexed were the robots by the human utterances?).

Founder at Surge AI: data labeling and human infrastructure to power NLP. https://www.surgehq.ai Former Google, Facebook, Twitter, Dropbox, MIT.

## More from Edwin Chen

Founder at Surge AI: data labeling and human infrastructure to power NLP. https://www.surgehq.ai Former Google, Facebook, Twitter, Dropbox, MIT.