Forget lorem ipsum generators, generate random English text

Muhammad Saqib Ilyas
14 min readDec 16, 2023

--

One of the most recommended practice projects for JavaScript enthusiasts is a lorem ipsum text generator. But I say, why stop there? Why not generate random English text? AI chat bots are all the rage these days, right?

In this blog, we’ll put together our own random English text generator. It wouldn’t be as good as the modern-day chat bots based on deep neural networks. But that’s not our aim. Fundamentals are key. What was the evolution of these modern neural network based text generators? One of their predecessors is Markov model based text generators.

Here’s what our page would look when you visit it at first:

Our page when viewed at first

The user would choose a file, optionally change the number of paragraphs to generate, then click on the “Generate” button. Here’s what our page would look once random text has been generated:

Our page with random generated text

What is a Markov model?

A Markov model, or a Markov chain is a state diagram, meaning that it has nodes representing all the unique states that the system could be in. In our case, each unique English word in our dictionary is a state. A Markov model also has directed edges between states, say a and b, labeled with the probability of the word b following immediately after the word a.

Training a Markov model

Let’s take the first sentence in the above paragraph for example. Its Markov model has several states, one for each word “a”, “markov”, “model”, “or”, “chain” and so on. Pick the word “markov”, and find all the words that follow it. There are two: “model” and “chain”. Each follows “markov”, once in that sentence. So, they are equally probable next word after “markov”. This would result in a state diagram like the following:

An example Markov model

That’s how we train a Markov model for a given text dataset. Create a distinct state for each unique word. Insert edges from a given state to every other word that follows it in the dataset, while labelling the edge with the probability of the two words appearing in that order.

Next, we create a cumulative distribution function (CDF) for each state. In the above example, the CDF would look something like the following:

CDF for the words that follow the word “Markov”

Generating text

Once we create the CDFs for the words in the dictionary, we can generate text using the following approach:

  • Pick a random initial word and put it in the generated text.
  • Pick a random number between 0 and 1.
  • Look for the above random number on the y-axis of the CDF for the initial word, and draw a horizontal line from it to the CDF curve.
  • Wherever the horizontal line intersects the CDF curve, drop a vertical line and see which word it falls at. This word is the next word in the generated text.

In our example above, if we draw a random number less than or equal to 0.5, then the next word is “model”. Otherwise, the next word is “chain”.

What if the words “model” appeared twice following the word “markov”, while the word “chain” followed the word “markov” once? The probability of the edge between “markov” and “model” would be 2/3, while that of the edge between “markov” and “chain” would be 1/3.

One implementation approach for training the model and generating text could be to put the next words in an array with duplicates. For example, the words following “markov” would be [“model”, “model”, “chain”]. Now, we generate a random integer from the set {0, 1, 2}. If the random integer is either 0 or 1, the next word is “model”, otherwise it is “chain”. The probability of the word “model” following “markov” is 2/3, while that of the word “chain” following “markov” is 1/3, but we are not dealing with fractional numbers, which is somewhat a relief.

The number of words in a dataset may be very large. The redundancy in our approach uses a lot of extra space. As an optimization, we could store the unique words in a separate array, while building the Markov model with arrays holding the index of the next words, rather than the next words themselves. For instance, if the indices of the word “model” and “chain” are 424 and 653 respectivley, then our next word array for the word “Markov” is [424, 424, 653]. If the redundancy is high, this scheme would save memory.

Enough theory! Let’s implement this.

The HTML

We create a simple web form in HTML as follows:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Read Text File with JavaScript</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<section class="centered">
<h2>Random English text generator</h2>
<form action="" class="train">
<label for="inputfile">Select a text file with some English text: </label>
<input type="file" name="inputfile" id="fileInput" />
<label for="numpara">Number of paragraphs: </label>
<input name="numpara" type="number" id="paras" value="5">
<button type="button" id="submit" class="btn" disabled>Generate</button>
</form>
</section>

<section id="fileContents">The randomly generated text will appear here</section>

<script src="app.js"></script>
</body>
</html>

We link to a stylesheet in the head section. We’ll work on this later. In the body section, we start with an h2 heading, followed by two section elements. The first one has a form for the user to interact with, and the second one is to display the randomly generated text. We give the first section a class of centered since we’d like to center it horizontally on the page eventually. We give an ID of fileContents to the second section to be able to select it in JavaScript later.

We create a form element in the first section element. It has an input1 element of type file. Its purpose is to allow the user to upload a text file using which we’ll train the Markov model. We also provide an input element of type number which creates a convenient way to input the number of paragraphs of text to generate. We set its default value to five. Finally, we have an input element of type button to allow the user to generate the text based on the trained model. Since the text can’t be generated until the model has been trained, we initially set the button to disabled.

The code

Now, let’s write the code to train the model:

const fileButton = document.getElementById('fileInput')
const genButton = document.getElementById('submit')
const txtElement = document.getElementById('fileContents')

fileButton.addEventListener('change', handleFileSelect)
genButton.addEventListener('click', generateRandomText)

let markovModel = {}

function handleFileSelect(event) {
const fileInput = event.target;
const file = fileInput.files[0];

if (file) {
const reader = new FileReader();

reader.onload = function (e) {
const fileContents = e.target.result;
const words = fileContents.split(/\s+/)
for (let i = 0 ; i < words.length ; i++) {
const currentWord = words[i].toLowerCase()
if (!markovModel[currentWord]) {
markovModel[currentWord] = []
}
if (i < words.length - 1) {
const nextWord = words[i + 1].toLowerCase()
markovModel[currentWord].push(nextWord)
}
}
genButton.disabled = false
};
reader.readAsText(file);
} else {
alert('No file selected.');
}
}

We start by acquiring objects representing the file section button, the text generation button, and the element where the generated text will be displayed. We add click event handlers to the two buttons using named functions. We also define an empty dictionary to serve as the Markov model. Each unique word in our dataset will be a key in this dictionary. The value corresponding to a key will be the array of next words.

Next, we implement the file section click handler to train the Markov model. First, we acquire the name of the first file selected by the user. Then, we check if the user actually made a selection. If a file has actually been selected, we create a FileReader object. Next, we make a call to the readAsTextFile() method of the FileReader object, passing it the name of the text file selected by the user. We define an onload event handler for the FileReader object. This function is called when the FileReader object is done reading the file. In the event handler, we acquire the text contents of the file in the variable fileContents. We split the text on all kinds of whitespaces (space, tab, new line) using the \s+ specifier.

Next, we iterate over the words collection using a for loop. We pick the lowercase version of the current word in the variable named currentWord. We make this conversion so that words with different case (for example “the”, and “The”) are treated the same. We check if this word already exists in the Markov model. If not, we insert the currentWord as a new key in the dictionary, with an empty array as the value. If the word is already in the Markov model, we read the next word into the variable nextWord and append it to the next words array for the currentWord. In between, we make a check on the value of i. If we just read the last word in the array, then we shouldn’t look for the next word. That is what the second if statement does.

Once all words have been processed, we enable the generate text button by setting its disabled property to false. Finally, if the user didn’t select any file, we show an alert message and the function returns.

Generating a random sentence

Now that our Markov model is trained, let’s use it to generate a random sentence.

function generateText(numWords) {
const keysArray = Object.keys(markovModel)
const initialWordIndex = Math.floor(Math.random()*keysArray.length)
const initialWord = keysArray[initialWordIndex]
let currentWord = initialWord
let result = currentWord
for (let i = 0 ; i < numWords - 1 ; i++) {
const potentialNextWords = markovModel[currentWord]
if (!potentialNextWords || potentialNextWords.length === 0) {
break
}
const nextWordIndex = Math.floor(Math.random()*potentialNextWords.length)
const nextWord = potentialNextWords[nextWordIndex]
result = result + ' ' + nextWord
currentWord = nextWord
}
return result
}

We define a function named generateText() which accepts the number of words that sentence should have. Every sentence starts with a word. We’ll pick the starting word randomly from the dictionary. We use the Object.keys() method to obtain an array with all the words in the Markov model. We generate a random integer between 0 and n-1 where n is the length of the words array. We then use this index to look up a random word from the dictionary. We store this random word in the variable initialWord, and copy it into a variable named currentWord. We also copy it to the result string which represents the randomly generated sentence. We then have a for loop that generates the required number of remaining words in the requested sentence. Each time in the loop, we acquire the next words corresponding to the currentWord, pick one at random, and copy it to the result. We then update the currentWord before the next iteration of the for loop. Note that if there are no next words corresponding to the currentWord, we simply break out of the for loop. This would mean that we weren’t able to generate a sentence with the requested number of words. If this were a hard rquirement, we could use a different strategy. For instance, we could pick a next word at random from the dictionary.

Generating a random paragraph

Now that we have generated a random sentence, generating a random paragraph should be easy.

function generateRandomParagraph() {
let result = '<p>'
const numSentences = 3 + Math.floor(Math.random()*8)
for (let j = 0 ; j < numSentences ; j++) {
const numWords = 5 + Math.floor(Math.random()*16)
result = result + generateText(numWords) + '. '
}
result = result + '</p>'
return result
}

We define a function named generateRandomParagraph(). We start by initializing a string with the <p> HTML tag in it. A paragraph is nothing but a bunch of sentences, right? How many sentences, though? I read somewhere that a typical paragraph has between three and ten sentences in it. So, let’s generate a random integer between three and ten. Since three is the lower limit, we see 3 as a constant additive term along with a random integer. Since we have already added three, the random integer should range between zero and seven. That’s why we do a Math.floor(Math.random()*8). We now have the number of sentences in this paragraph in the variable numSentences. Next, we have for loop that runs numSentences times, calling generateText(). But generateText() expects the number of words in the sentence as an argument. So, for each sentence we generate a random number of words. I’m going with a sentence having between five and twenty words. We append the sentence returned by generateText() to the result string and terminate the sentence with a period.

Once we’re done generating all the sentences, we mark the end of the paragraph with a </p> HTML tag. Finally, we return the result string.

Generating the required number of paragraphs

All that needs to be done is to generate as many paragraphs of text as the user requested in the HTML form.

function generateRandomText() {
const numParas = document.getElementById('paras')
const numParasInt = parseInt(numParas.value, 10)
let result = ''
for (let i = 0 ; i < numParasInt ; i++) {
result = result + generateRandomParagraph()
}
txtElement.innerHTML = result
}

We acquire an object corresponding to the input element for the number of paragraphs to generate. We convert its value to a base 10 integer. We initialize and empty string for the text. We call the generateRandomParagraph() function as many times as requested by the user while concatenating each paragraph into the result string. Finally, we assign the randomly generated text to the section element that is meant for this purpose.

Some CSS

Let’s touch up the looks of the web page a little bit.

*, ::before, ::after {
margin: 0;
padding: 0;
box-sizing: border-box;
}

body {
font-family: 'Arial', sans-serif;
background-color: #f4f4f4;
}

h2 {
margin: 2rem 0;
text-align: center;
}

form {
max-width: 600px;
margin: 20px auto;
background-color: #fff;
padding: 20px;
border-radius: 8px;
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}

label {
display: block;
margin-bottom: 8px;
font-size: 1.2rem;
}

input {
padding: 8px;
margin-bottom: 16px;
box-sizing: border-box;
}

button {
background-color: #4caf50;
color: #fff;
padding: 10px 20px;
border: none;
border-radius: 4px;
cursor: pointer;
}

button:disabled {
background-color: #ccc;
cursor: not-allowed;
}

#fileContents {
max-width: 600px;
margin: 20px auto;
background-color: #fff;
padding: 20px;
border-radius: 8px;
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}

#fileContents p {
line-height: 1.5;
margin-bottom: 1.2rem;
}

We start with a basic CSS reset to get rid of browser-specific styling in terms of padding, margin, and box sizing. We define a default font family, and a background color on the body element. To center the h2 heading horizontally and space it away a bit from the top and bottom, we define a bit of top and bottom margins on it.

Next up, we style the form. We set its maximum width to 600 pixels. We give it a bit of margin on the top and bottom and center it horizontally. We give it a white background color. We give it a bit of padding to space the contained elements away from the edges. We also give it rounded corners and a box shadow.

By default the labels are laid out in a row one after the other. To set them on separate lines, we set the display property to block. We add a bit of margin below the labels and increase the font size.

We improve how the input elements look by adding a padding inside them. This makes the text spaced away from the edges. We also give a bit of margin below the input elements.

We change the background and text color for the button. We add a bit of padding to the button so that the button text is spaced from its edges. We get rid of the border, and give the button rounded edges. Finally, to give a visual clue to the user that this is an active element, we set the cursor property to pointer.

We use the :disabled pseudo-class to change the button’s look when it is disabled. In this state, we set the background color to a gray and the cursor property to not-allowed.

To style the section for the generated text, we use an ID selector and give the element the same maximum width as the other section. We add a 20 pixels margin on the top and bottom and choose auto for the right and left so that it is horizontally centred. We give it a white background color, a bit of padding on all sides so that the text is spaced from the edges. We give it rounded edges and a box shadow.

We style the p elements in the generated text section to have a greater line height than default for comfortable reading. We also give a bit of margin below each p element.

Now, our page should look something like the following once the random text is generated:

Our page after a bit of styling

If you are bothered by the sentences starting with lowercase letters, here’s a couple of minor changes. First, in the generateText() function change the initialization of the result string to:

let result = '<span class="first">' + currentWord + '</span>'

We enclose the first word in a span element and give it a class of first. Next, we define this class to capitalize the first letter of the word in CSS:

.first {
text-transform: capitalize;
}

That’s all folks!

That concludes our project. Please do play around with the code. Try to change things. For instance, what would you change to have the next word arrays hold the index of the next words rather than the words themselves? How would you make the page responsive?

You may download the code from this github repository. If you prefer copy-paste, here’s the HTML:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Read Text File with JavaScript</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<section class="centered">
<h2>Random English text generator</h2>
<form action="" class="train">
<label for="inputfile">Select a text file with some English text: </label>
<input type="file" name="inputfile" id="fileInput"/>
<label for="numpara">Number of paragraphs: </label>
<input name="numpara" type="number" id="paras" value="5">
<button type="button" id="submit" class="btn" disabled>Generate</button>
</form>
</section>

<section id="fileContents">The randomly generated text will appear here</section>

<script src="app.js"></script>
</body>
</html>

Here’s the CSS:

*, ::before, ::after {
margin: 0;
padding: 0;
box-sizing: border-box;
}

body {
font-family: 'Arial', sans-serif;
background-color: #f4f4f4;
}

h2 {
margin: 2rem 0;
text-align: center;
}

form {
max-width: 600px;
margin: 20px auto;
background-color: #fff;
padding: 20px;
border-radius: 8px;
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}

label {
display: block;
margin-bottom: 8px;
font-size: 1.2rem;
}

input {
padding: 8px;
margin-bottom: 16px;
}

button {
background-color: #4caf50;
color: #fff;
padding: 10px 20px;
border: none;
border-radius: 4px;
cursor: pointer;
}

button:disabled {
background-color: #ccc;
cursor: not-allowed;
}

#fileContents {
max-width: 600px;
margin: 20px auto;
background-color: #fff;
padding: 20px;
border-radius: 8px;
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}

#fileContents p {
line-height: 1.5;
margin-bottom: 1.2rem;
}

.first {
text-transform: capitalize;
}

Here’s the JavaScript:

const genButton = document.getElementById('submit')
const fileButton = document.getElementById('fileInput')
const txtElement = document.getElementById('fileContents')
genButton.addEventListener('click', generateRandomText)
fileButton.addEventListener('change', handleFileSelect)

let markovModel = {}

function handleFileSelect(event) {
const fileInput = event.target;
const file = fileInput.files[0];

if (file) {
const reader = new FileReader();

reader.onload = function (e) {
const fileContents = e.target.result;
const words = fileContents.split(/\s+/)
for (let i = 0 ; i < words.length ; i++) {
const currentWord = words[i].toLowerCase()
if (!markovModel[currentWord]) {
markovModel[currentWord] = []
}
if (i < words.length - 1) {
const nextWord = words[i + 1].toLowerCase()
markovModel[currentWord].push(nextWord)
}
}
genButton.disabled = false
};

// Read the file as text
reader.readAsText(file);
} else {
alert('No file selected.');
}
}

function generateText(numWords) {
const keysArray = Object.keys(markovModel)
const initialWordIndex = Math.floor(Math.random()*keysArray.length)
const initialWord = keysArray[initialWordIndex]
let currentWord = initialWord
let result = '<span class="first">' + currentWord + '</span>'
for (let i = 0 ; i < numWords - 1 ; i++) {
const potentialNextWords = markovModel[currentWord]
if (!potentialNextWords || potentialNextWords.length === 0) {
break
}
const nextWordIndex = Math.floor(Math.random()*potentialNextWords.length)
const nextWord = potentialNextWords[nextWordIndex]
result = result + ' ' + nextWord
currentWord = nextWord
}
return result
}

function generateRandomParagraph() {
let result = '<p>'
const numSentences = 3 + Math.floor(Math.random()*8)
for (let j = 0 ; j < numSentences ; j++) {
const numWords = 5 + Math.floor(Math.random()*16)
result = result + generateText(numWords) + '. '
}
result = result + '</p>'
return result
}

function generateRandomText() {
const numParas = document.getElementById('paras')
const numParasInt = parseInt(numParas.value, 10)
let result = ''
for (let i = 0 ; i < numParasInt ; i++) {
result = result + generateRandomParagraph()
}

txtElement.innerHTML = result
}

--

--

Muhammad Saqib Ilyas

A computer science teacher by profession. I love teaching and learning programming. I like to write about frontend development, and coding interview preparation