The Mathematics of Language: Being a Writer in a World Dominated by GenAI

Leigh-Anne Wells (vd Veen)
6 min readApr 19, 2024

--

Image depicting: The Mathematics of Language: Being a Writer in a World Dominated by GenAI

Generative AI (GenAI) is everywhere! It has dominated headlines since November 2022, when OpenAI’s ChatGPT was first launched. This launched the world as we knew it into a new era, a new word order — as described in the article “How to Thrive as a (Technical) Writer in a World Where LLMs Think They Can Do Better.”

“Word order” or “world order”? Surely the phrase is “world order” and not “word order”?

Well… yes, the correct phrase is “world order” and not “word order.”

However, in the context of Generative AI (GenAI) and large language models (LLMs), the emphasis is on the correct order of words and how they are structured to form sentences, affecting sentence meaning and grammatical correctness. These models must understand and apply syntax and grammar rules to form sentences correctly, make sense, and ensure effective communication.

The emphasis on “word order” in the context of AI highlights the complexity and sophistication of language processing technologies. These models analyze vast amounts of text data to learn how words combine to convey meaning, adapting to nuances that can vary dramatically across different contexts.

The Mathematics of Language

GenAI’s effectiveness — particularly in the context of LLMs — is closely linked to the mathematics of language. This relationship is rooted in how these models process and generate language based on mathematical principles.

We need to dive into mathematical and machine-learning constructs like vector embeddings to understand how LLMs construct sentences.

What on earth are vector embeddings, and how do they relate to language?

Is there even something as crazy as the mathematics of language?

Surely, language is as far from math as the earth and moon are from each other. Well, yes, under normal circumstances, the idea of the “mathematics of language” might seem far-fetched, but it’s a critical aspect of how LLMs function.

Let’s dive into this concept with our first port of call being:

What are Vector Embeddings?

As you might know, one of my favorite places to search for information is academic literature. To this end, I found a fantastic tool — Elicit.com — that searches more than 125 million academic papers every time I ask it a question. This time, I asked the tool what vector embeddings are, and it came back with this answer:

Vector embeddings are a powerful tool in natural language processing, enabling the encoding of word relationships in a vector space.”

In other words, vector embeddings are data representations that represent the relationships between words as vectors or lists of numbers. This mathematical representation allows computers to handle and process natural language data efficiently. By mapping words into a high-dimensional space, vector embeddings capture not only the words themselves

There you have it.

But wait, what are vectors?

In the simplest terms, vectors are lines with arrows that point from one location to another in space, as the following image describes:

Each vector has both direction and length (magnitude). In other words, a vector is a way to describe the movement needed to get from one point to another.

But what do directional lines have to do with mathematics and computing?

The succinct answer to this question is that they can represent data points in multidimensional spaces, which are used for data analyses and calculations…and to convert language into lists of numbers.

While vectors might be difficult to understand for all non-mathematicians — or is it just me — it is worth noting that they are essentially tools that can describe physical quantities in the real world or abstract quantities in data analysis. By breaking them down into their directional components, vectors help visualize and solve a wide range of problems.

Thanks, ChatGPT.

Now, I guess the question is — is this fact, or is ChatGPT hallucinating?

But, I digress… yet again.

Converting a Sentence into a Vector Embedding

Converting a sentence into a vector embedding is quite complex. Therefore, let’s simplify it using the following pangram sentence as an example:

“The brown fox jumped over the lazy dog.”

1. Tokenize the sentence into individual words or tokens:

The first step is to split the sentence into individual words or tokens as follows:

[“the”, “brown”, “fox”, “jumped”, “over”, “the”, “lazy”, “dog”]

2. Convert each token into a vector:

There are several methods to do this, the simplest being “One-Hot Encoding,” where each word is represented as a vector of zeros, except for the single number at the end of each vector. However, this method does not indicate semantic meanings.

Semantic what?

Semantic meanings or relationships between words are the associations that exist between words. For instance, the word “brown” in this pangram is related to the word “fox;” it describes the fox’s color.

It is possible to capture the semantic relationships between words with advanced vectorization models such as Word Embeddings, where pre-trained models like Word2Vec, GloVe, and FastText provide sophisticated representations that capture the semantic relationships between words.

3. Vectorize each Sentence:

Once we have converted each word/token into a vector embedding, the next step is to convert the sentence — or pangram in our scenario — into a vector by combining all the word vectors to form a single sentence vector. This is done in several ways: simple averaging, summing, or more complex operations considering the sentence’s syntax and grammar.

Often, pre-trained models like BERT or GPT — from libraries like Hugging Face’s Transformers — are used to encode sentences directly into vectors. These models take the whole sentence’s context into account, creating a dense vector representation.

How do Vector Embeddings Relate to Language?

In the context of language, vector embeddings serve several functions:

1. Semantic Similarity:

In semantic similarity, words with similar meanings are placed close together in the vector space. For instance, the words “brown” and “blue” are placed closer together than “brown” and “apple.”

2. Context Awareness:

Modern vector embedding techniques, like those used in LLMs, look at the context in which a word appears. For example, the word “bank” will have a different vector when used as part of the phrase “river bank” versus “savings bank.”

3. Language Modeling:

Vector embeddings are fundamental in building models that predict the probability of a sequence of words. This is crucial for tasks like text completion, translation, and generating text automatically.

Is there a Mathematics of Language: Fact or Fiction?

Fact. There is absolutely a mathematics of language.

Beyond vector embeddings, mathematical models can describe and predict linguistic phenomena. For instance, mathematical linguistics uses mathematical concepts, statistical models, algorithms, and other methods to study and analyze the structure, form, and patterns of natural languages, including syntax, semantics, and phonology — or to interpret language. Computational linguistics, on the other hand, is built on the intersection of language and mathematical models to understand and generate human language.

So, while language and mathematics might seem like distant relatives, they are intimately connected in the realm of artificial intelligence and linguistics. The mathematics of language through tools like vector embeddings has been transformative, enabling the development of sophisticated LLMs and machine-learning models that can understand, interpret, and generate human language with remarkable accuracy.

Conclusion: The Firecrab Take

Where does this leave the technical writer?

Is the technical writer — or any copywriter, for that matter — consigned to the scrap heap as a white elephant?

Or does the technical writer have a role to play in an era dominated by GenAI?

While in theory, the answer to these questions can only be an emphatic yes: All copywriters have a role to play in today’s GenAI-dominated landscape. However, in practice, I’m not sure this is the case. Anecdotal evidence seems to suggest otherwise.

Therefore, to gain a definitive answer to these questions, I turned to GenAI itself, or GenAI cloaked as the familiar ChatGPT. And this is what ChatGPT had to say:

I guess the question remains: If we don’t have to convince GenAI, who do we have to convince that technical writers still have a role in producing content for all SaaS companies — especially tech startups?

--

--