Fun with text generation pt. 1: markov models in awk

Marcin Tustin
Build and Learn
Published in
3 min readNov 11, 2019

Inspired by Really fast Markov chains in ~20 lines of sh, grep, cut and awk this series is intended to explore text generation, first with markov models, then with generative adversarial networks.

The very specific inspiration for this installment was the realization that I could write an even shorter implementation in pure awk.

What are markov models?

Wikipedia has the details, but in this particular example, these are the important elements:

  1. A markov process is one where the probability of transitioning from one state (word) to another (the next word) is dependent only on the current state; and
  2. we “learn” the probability distribution of transitioning from one word to another by observing the transitions in the underlying text we’re simulating.

A quick implementation

BEGIN{
if(!RS) { RS="[^[:alnum:]]+"; };
if(!outlen) { outlen=100 };
srand(systime())
prevword = ""
}
# form transition matrix
{
nextwords[prevword]=nextwords[prevword]" "$0;
prevword = $0;
prevrt = RT # try to pick good initial words
}
# choose an initial word
(rand() < 0.1) && !firstword && prevrt ~ /[!?.]|^/ { firstword=$0 }
END {
ORS=" "; # put spaces between words
word = firstword
for(i=outlen; i >= 0; --i) {
print word;
split(nextwords[word], potential_successors, " ");
word = potential_successors[int(rand() * length(potential_successors))+1]
}
ORS="\n" # have a final newline
print word
}

The most salient thing about this model is that the assumed probability distribution is flat, and there is no attempt to model hidden states — we simply build the markov chain based on what we observe in the input text.

How well does it work?

$ gawk -f hmm.awk -v outlen=12 -v RS=$'[[:space:]"%=|-]+' goedel.u8The FORTUNE PROGRAM If you should have been possible. This Fortune Cookie Directed by$ gawk -f hmm.awk -v outlen=27 -v RS=$'[[:space:]"%=|-]+' goedel.u8Chambers and before you contribute 30 fortunes or .cshrc. file. Do not looking at selected theaters and display work. This still begs the curvature of the fortune is just$ gawk -f hmm.awk -v outlen=27 -v RS=$'[[:space:]"%=|-]+' goedel.u8it now before the curvature of the pigeonhole principle. (3) I read today's fortune. In which you by the Inanities. There is soon to misconstrue all ) [to which

Goedel fortune here

Now, on Wuthering Heights:

$  gawk -f hmm.awk -v RS="" ~/wuthering.txtThe harm had never looked out of fever and poor thing as to answer Shall you have not comprehend it were smooth away as one night as I were invisible it was digging Linton slowly drew her he demanded who had left him out of this roof Then he replied because he Put your gruel cold blue in law on endeavouring to pass such marks of Gimmerden Sough And what they be dark skinned gipsy in the kitchen Set two feeble as your wife s affair except a Project Gutenberg tm work or is scarcely conscious of course of it will you$  gawk -f hmm.awk -v RS="" -v outlen=20 ~/wuthering.txtwith a concourse thickening outside It was mad or to thrust forward on an evil disposition to remind myself directly No but$  gawk -f hmm.awk -v RS="" -v outlen=26 ~/wuthering.txtThe poor For additional terms than my hands in Christendom and then he descended to knock him again Thear agean em properly but to discourse as soft with

So…it’s OK.

Where next?

The most obvious next step is to use a hidden markov model (in short the assumption is that there are also hidden markov states; which in this case would correspond to the grammar).

The other part that we haven’t in any way automated or measured is how good the generation is. For that, we will need to develop a discriminator model.

--

--

Marcin Tustin
Build and Learn

Data Engineering and Lean DevOps consultant — email marcin.tustin@gmail.com if you’re thinking about building data systems