What is Entropy?

Andrew Batutin
10 min readJun 24, 2024

--

Epigraph:

Ludwig Boltzmann, who spent much of his life studying statistical mechanics, died in 1906, by his own hand. Paul Ehrenfest, carrying on the work, died similarly in 1933. Now it is our turn to study statistical mechanics.

Intro

Entropy is a notoriously difficult concept to understand. To begin with, it has half a dozen definitions that vary depending on the field and the person speaking.

Most of us have heard that:

  • Entropy is a measure of system disorder
  • Entropy is a measure of surprise
  • Entropy always increases with time

The trick is that all these statements are very true and, at the same time, very different.

Throughout the 19th and 20th centuries, the concept of entropy was reinvented and reinterpreted in the contexts of thermodynamics, information theory, and quantum mechanics.

In this blog post, we will take a closer look at:

  • Statistical Mechanics Entropy — the classic physics textbook definition of entropy
  • Information Theory Entropy — the interpretation of entropy for basic data types such as strings and bytes
  • Embeddings Entropy — the interpretation of entropy for text vector embeddings that fuel the current GenAI revolution

The goal is to shed some light on how the same concept of disorder is defined in different fields, making it easier to adopt the notion of entropy in the emerging AI world.

Statistical Mechanics Entropy

Entropy, a term that has been used in science since the early 19th century, has undergone numerous formulations and interpretations across various fields. It all started with application for thermodynamics, that gave birth to statistical mechanics and one of first more or less well formulated definitions of entropy

In Boltzmann’s formula (yes, the one who killed himself), entropy is defined as:

H = k * ln(W)

Where:

  • H is the entropy of the system
  • k is the Boltzmann constant
  • ln is the natural logarithm function
  • W is the number of microstates corresponding to the macrostate of the system

This formula relates the microscopic properties of a system (the number of possible arrangements of its components) to its macroscopic thermodynamic properties (entropy). It provides a statistical interpretation of entropy, linking it to the disorder or randomness of a system.

According to Boltzmann’s formula, a system with a higher number of possible microstates will have higher entropy, while a system with a lower number of microstates will have lower entropy.

Entropy — Information Theory

In the XX century Claude Shannon built upon these ideas in his groundbreaking work on information theory, introducing the concept of information entropy as a measure of uncertainty or randomness in a message.

The core idea of information theory is that the “informational value” of a communicated message depends on the degree to which the content of the message is surprising. If a highly likely event occurs, the message carries very little information. On the other hand, if a highly unlikely event occurs, the message is much more informative.

– (Source)

Shannon’s work provided a mathematical framework for quantifying information and its relationship to entropy, laying the foundation for the modern understanding of information theory and its applications in various fields, including computer science, cryptography, and telecommunications.

In information theory, the entropy of a discrete random variable X, which can take on possible values {x₁, x₂, …, xₙ} with corresponding probabilities {p(x₁), p(x₂), …, p(xₙ)}, is defined as:

H(X) = -Σ p(xᵢ) * log₂(p(xᵢ))

Where:

  • H(X) is the entropy of the random variable X
  • Σ denotes the sum over all possible values of X
  • p(xᵢ) is the probability of X taking the value xᵢ
  • log₂ is the base-2 logarithm

Ok let’s move to some examples to get a better understanding of information entropy.

Calculating the Entropy of a String

Here is a simple method to calculate the entropy of a string based on the frequency of characters within the text. For the purpose of this demonstration, we will treat characters as the atomic units of information.

To begin, let’s define a Python function called calculate_entropy that takes a string as input and returns its entropy.

import math
from collections import Counter


def calculate_entropy(text):
# Calculate the frequency of each character in the text
char_freq = Counter(text)

# Calculate the entropy
entropy = 0
text_length = len(text)
for freq in char_freq.values():
prob = freq / text_length
entropy += prob * math.log2(prob)

return -entropy

The calculate_entropy function works as follows:

  1. We use Counter(text) to create a dictionary-like object that stores the frequency of each character in the input string.
  2. We iterate over the frequency values obtained from char_freq.values().
  3. For each frequency value, we calculate the probability of the character by dividing its frequency by the total length of the string.
  4. We then update the entropy variable by adding the product of the probability and the log2 of the probability.

Finally, we return the negative value of entropy as the calculated entropy of the string.

Comparing Entropy Values

By applying the calculate_entropy function to different strings, we can compare their entropy values and gain insights into the randomness or predictability of the character distribution within the text.

Let’s consider a few examples:

As we can see, the entropy values for the simple English text and the more complex scientific text are relatively close (4.34 and 4.45, respectively). This is because we calculated entropy based on character frequency, which doesn’t capture the semantic complexity of the text.

However, there is a notable difference between the entropy of random strings and that of English text. The random string has a significantly higher entropy (6.36) compared to the English texts, indicating its unpredictability and lack of structure.

As expected, a string that has all the same characters `a` has the lowest possible entropy 0. This is the most predictable string of all.

Information Entropy and Predictability

Information entropy tells us how random and unpredictable a message is. In the context of language, it can provide insights into the predictability of the next word or character in a sequence.

For example, if you are given the text “Today is great,” guessing the next word won’t be too difficult. Most likely, it would be “day.” The low entropy of the English language allows for such predictability.

On the other hand, if you are given a random string like “ucN7dzPEBmciMUAMqpfM,” guessing the next symbol becomes a much harder task. The high entropy of random strings indicates their lack of predictability and structure.

That is why Information Entropy is used in encryption and compression.

Embeddings Entropy

While the previous discussion on character-based entropy is interesting, it may not be directly applicable to modern Retrieval-Augmented Generation (RAG) services.

In a typical RAG system, users expect to receive answers in the form of short paragraphs. Pushing the information entropy to the level of paragraphs is not practical, as almost all paragraphs are unique, making the entropy calculation less meaningful.

Moreover, in RAG systems, content is often represented using embeddings rather than characters or words. To calculate the entropy of text embeddings, we need to move from text to a vector representation of the content.

One approach to calculate the entropy of embeddings is as follows:

  • Get binary representation of embeddings.
  • Simplify the process by using binary embeddings.
  • Cohere provides binary embeddings that maintain 99.99% of search quality.
  • Calculate the entropy of each embedding component.
  • Compute the average entropy of all embedding components.

Let’s visualize how binarized embeddings would look:

From the perspective of linear algebra, the approach of calculating entropy based on the binary representation of embeddings can be interpreted as a measure of embedding diversity:

  • The entropy of embeddings can be seen as a measure of the diversity or spread of the embedding vectors in the high-dimensional space.
  • Higher entropy indicates a more diverse set of embeddings, where the binary components are more evenly distributed.
  • Lower entropy suggests a more homogeneous or clustered set of embeddings, where certain binary patterns are more prevalent.

It’s important to note that this is not the only way to calculate the entropy of embeddings. Other methods may involve:

  • Using continuous embeddings instead of binary representations.
  • Applying dimensionality reduction techniques like PCA or t-SNE before entropy calculation.
  • Employing more advanced entropy estimation methods, such as nearest neighbor-based approaches.

Calculating the Entropy of a Embedding

Here is a sample code how to calculate entropy of binary embedding

def calculate_binary_entropy(embeddings):
# Apply the calculate_entropy function to each column in the embeddings array
entropies = np.apply_along_axis(calculate_entropy, 0, embeddings)
# Calculate the average entropy
avg_entropy = np.mean(entropies)

return avg_entropy

def calculate_entropy(column):
# Calculate the frequency of each character in the column
char_freq = Counter(column)

# Calculate the entropy
text_length = len(column)
entropy = -sum(freq / text_length * math.log2(freq / text_length) for freq in char_freq.values())

return entropy

def similarity_score(documents):
# calculate similarity score pairwise between documents
# with sentence transformer
documents = documents.astype(float)
s_acc = []
for i in range(len(documents)):
for j in range(i + 1, len(documents)):
cosine_scores = util.cos_sim(documents[i], documents[j])
s_acc.append(cosine_scores)
av_s_acc = sum(s_acc) / len(s_acc)
return av_s_acc

def entropy_cohere_gen(documents):
co = cohere.Client()
embeddings = np.unpackbits(np.asarray(co.embed(texts=documents,
model="embed-english-v3.0",
input_type="search_document",
embedding_types=["ubinary"]
).embeddings.ubinary, dtype='uint8'), axis=-1).astype("int")


return calculate_binary_entropy(embeddings), similarity_score(embeddings)

The calculate_binary_entropy(embeddings) function takes an array of embeddings as input and performs the following steps:

  • Applies the calculate_entropy function to each column of the embeddings array using np.apply_along_axis.
  • The calculate_entropy function calculates the frequency of each character in the column using the Counter class from the collections module.
  • It then computes the entropy of the column using the entropy formula: the sum of the frequency of each character divided by the column length, multiplied by the base-2 logarithm of the frequency divided by the length.
  • The entropy is negated to obtain the final result.
  • The average entropy of all columns is calculated using np.mean.

The similarity_score(documents) function calculates the pairwise cosine similarity scores between documents using the sentence_transformers library:

  • Converts the documents to float type.
  • Calculates the cosine similarity score for each pair of documents and appends the scores to a list.
  • Computes the average of the scores and returns it.

We are calculating the cosine similarity to see how it relates to the entropy of embeddings.

The entropy_cohere_gen(documents) function serves as the main function to generate embeddings for the documents and calculate their binary entropy and similarity score:

  • Creates a cohere.Client object.
  • Uses the embed method of the client to generate embeddings for the documents.
  • Unpacks the embeddings and converts them to integer type.

Returns the binary entropy and similarity score of the embeddings by calling the calculate_binary_entropy and similarity_score functions, respectively.

Examining Embedding Entropy

Let’s run some tests with the following texts:

  • doc_max_entr is a list of random strings that should have highest entropy.
  • documents_simp_text is a list of simple english sentences and should have average entropy.
  • documents_same_text is a list of exact same strings whose entropy should be 0.
doc_max_entr = [
"0RErSHV0IkOiN6aJbrsOSq9LLpCuF01M",
"HzgK8UQzUttFu568fRKLwCxdwjI79wyR",
"Xa3OF6pKMJgdsAmfIkitDxZ50zQFRxJV",
"uEjmxIfcV731CRHBZDUIEgMNzQjP4HMB",
"kCohsvGhaMfIMddF1Z6wSvMrKfFUcORo",
"cYKPvYMYVHmD2vUF5t5DvTdtS3xrK2zj",
"3LTGLJiDfp8SvaRvyVZsTsvR2Dw244Xe",
"Lzpy2gvjbstJlVWkmrTj1MZVlXAYNe9n",
"eCkC9bDf95PRTWeyLnHmGyveL35froU7",
"yjBFo9g2JLX2OeV5bJEDW0wtv2IsZHKj",
]


documents_simp_text = [
"Alan Turing was an English mathematician",
"Albert Einstein was a German-born theoretical",
"Isaac Newton was an English polymath active",
"Marie Curie was a Polish and naturalised"
]


documents_same_text = [
"Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.",
"Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.",
"Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.",
"Alan Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist.",
]

Results:

Observations:

  • The same text has the lowest entropy (0) and the highest cosine similarity (1), which is expected as it is the easiest to predict.
  • Surprisingly, the random string (doc_max_entr) has lower entropy than the simple text (documents_simp_text).

What does this mean?

  • Entropy is a measure of randomness. The more random the data, the higher the entropy. The less random the data, the lower the entropy.
  • However, this is not the case here, which is counterintuitive.

Intuition:

  • Entropy can also be seen as a measure of surprise.
  • Let’s check the similarity scores:
  • More similar documents have a higher similarity score and lower entropy.
  • Less similar documents have a lower similarity score and higher entropy.
  • For embeddings, entropy seems to indicate how surprising or different the embeddings are.
  • If you have a sequence like 1, 2, 3, and then you get 10, it’s surprising.
  • If you have a sequence like 1, 2, 3, and then you get 4, it’s not surprising.

Hypothesis:

Entropy for embeddings is a measure of how surprising the embeddings are and how different they are from each other.

Entropy correlates with similarity.

Embedding Entropy as Measure of Content Diversity

Entropy of embeddings can be a useful metric to measure the diversity of content in a knowledge base or a set of documents. However, it is important to note that high entropy does not necessarily indicate the quality or usefulness of the content for Retrieval-Augmented Generation (RAG) systems.

While entropy provides insights into the randomness and surprisingness of embeddings, it does not directly correlate with the relevance or correctness of the information contained in the documents.

A diverse set of embeddings with high entropy may still lack the necessary information to answer specific questions or generate meaningful responses. And vice versa — a uniform set of content with low entropy does not mean it will perform badly in a RAG setting.

The Next Frontier in Entropy — AI Question-Answering Systems

As we have explored the concept of entropy across various domains, from statistical mechanics to information theory and embeddings,

It is clear that this powerful notion has the potential to shed light on the inner workings of complex systems. With the rapid advancements in artificial intelligence, particularly in the realm of question-answering systems, we stand at the precipice of a new frontier in understanding entropy.

The application of entropy to AI question-answering systems opens up exciting possibilities for quantifying and optimizing the performance of these systems, providing valuable insights into the diversity, relevance, and coherence of the generated responses.

In the upcoming article, we will embark on an exciting journey to investigate the entropy of AI question-answering systems.

--

--