123 BYTES PERL MARKOV CHAIN

TLDR: As a part of a funny online competition called Nano-NaNoGenMo we managed to craft the 123 bytes Perl script for a Markov Chain text generation; 139 bytes in total with a shell command (and even shorter if we allow some input file tweaking, see the remark at the very bottom of the post).

Aleksey Tikhonov
Dec 12, 2019 · 6 min read
Image for post
Image for post

The rest of the post gives some clues on the process of optimization — from a clean and shiny 752-bytes rosetta-code implemetation to the end and beyond.

COMPETITION AND RULES

But this year it had a twist — the narratology enthusiast from MIT, Nick Montfort, decided to run a spin-off competition, Nano-NaNoGenMo (aka NNNGM), with only one additional rule — the program should have 256 bytes at most (with the possibility of using any of Project Gutenberg files as input).

I’ve discovered NNNGM just in 3 days before the end but it was so tempting so I decided to join the competition. Next, I thought what is the simplest way to generate plausible texts is to use Markov chains and in my opinion, Perl is one of the best choices for code obfuscation projects (as I mentioned before) so it was going to be a Perl implementation of the Markov chain text generator, despite the fact I didn’t write a line of Perl code last decade.

INITIAL CODE AND LOGIC

use strict;
use warnings;

my $file = shift || 'alice_oz.txt';
my $n = shift || 3;
my $max = shift || 200;

sub build_dict {
my ($n, @words) = @_;
my %dict;
for my $i (0 .. $#words - $n) {
my @prefix = @words[$i .. $i+$n-1];
push @{$dict{join ' ', @prefix}}, $words[$i+$n];
}
return %dict;
}

sub pick1 { $_[rand @_] }

my $text = do {
open my $fh, '<', $file;
local $/;
<$fh>;
};

my @words = split ' ', $text;
push @words, @words[0..$n-1];
my %dict = build_dict($n, @words);
my @rotor = @words[0..$n-1];
my @chain = @rotor;

for (1 .. $max) {
my $new = pick1(@{$dict{join ' ', @rotor}});
shift @rotor;
push @rotor, $new;
push @chain, $new;
}

print join(' ', @chain) . "\n";

To keep track of the following changes it’s useful to understand the basic logic steps of this code:

  • let’s assume we have an input file with the text all in one line (without line breaks),
  • first, we read all the words from input file into list @words,
  • append the very first $n words to the end of the list (otherwise it can preliminary stop whenever the last $n words occasionally appear as a current key),
  • build a hash %dict with each $n words in a row as keys and lists of possible next words as values,
  • starting from the first $n words as a seed, take the next one as a random choice from the list from %dict by last seen $n words as a key, stored in @rotor , accumulate all generated words in @chain and print them at the end.

Note, this version has 752 bytes length.

OPTIMIZATION CHRONICLE

324 bytes:

  • got rid of argumets parsing and fixed the input filename to a.txt,
  • replaced constants with inline values (fixed them to $max=200, $n=3),
  • removed some unnecessary syntax sugar, spaces and tabs,
  • renamed functions and vars to 1-char names: pick1() -> z(), build_dict() -> b(), @words -> @w, %dict -> %d, @chain -> @c, $new -> $e, @rotor -> @r.

Let’s see what we had now:

sub b{my($n,@w)=@_;%r;
for $i(0..$#w-$n){@p=@w[$i..$i+$n-1];
push @{$r{join' ',@p}},$w[$i+$n];}
return%r;}
sub z{$_[rand@_]}
$t=do{open$h,'<','a.txt';local$/;<$h>;};
@w=split' ',$t;
push@w,@w[0..2];
%d=b(3,@w);
@r=@w[0..2];
@c=@r;
for(1..200){
$e=z(@{$d{join' ',@r}});
shift@r;
push@r,$e;
push@c,$e}
print join(' ',@c)."\n";

225 bytes:

  • removed unnecessary\n symbols,
  • put b()(previously known as build_dict()) inline at the only place where it was called from,
  • put a file-reading code inline into the split’s second argument,
  • pulled the last print inside the chain iteration cycle to remove the longjoin construction.
sub z{$_[rand@_]}@w=split' ',do{open$h,'<a.txt';<$h>};push@w,@r=@w[0..2];%d=do{for$i(0..$#w-3){@p=@w[$i..$i+2];push@{$s{join' ',@p}},$w[$i+3];}%s};print"@r ";for(1..200){$e=z(@{$d{join' ',@r}});shift@r;push@r,$e;print$e." ";}

As you may notice, here I had 225 bytes so I already met the goal of NNNGM to squeeze it into 256 bytes. But as Hunter Thompson once said — “Anything worth doing, is worth doing right.”

So I decided to continue and asked for help an old friend of mine, s0me0ne (Самуан Ункновн), who is known as a crazy Perl hacker since the previous century. Together we continued exercises.

Since we decided to play with Perl interpreter’s command-line arguments, from this point I’ll give two byte counters as a progress indicator — the size of the Perl code payload as well as the total size of the code including shell command.

199 / 214 bytes:

  • switched to reading from stdin instead of opening the file explicitly,
  • removed the cycle var $i, since it was unused anyway,
  • used print"$e " instead of print$e." ";.
perl -e'sub z{$_[rand@_]}@w=split" ",<>;push@w,@r=@w[0..2];%d=do{for(0..$#w-3){@p=@w[$_..$_+2];push@{$s{join" ",@p}},$w[$_+3];}%s};print"@r ";for(1..200){$e=z(@{$d{join" ",@r}});shift@r;push@r,$e;print"$e "}'<a.txt

181 / 196 bytes:

perl -e'sub z{$_[rand@_]}@w=split" ",<>;push@w,@r=@w[0..2];%d=do{for(0..$#w-3){push@{$s{"@w[$_..$_+2]"}},$w[$_+3];}%s};print"@r ";for(1..200){$e=z(@{$d{"@r"}});shift@r;push@r,$e;print"$e "}'<a.txt

177 / 192 bytes:

perl -e'sub z{$_[rand@_]}@w=split" ",<>;push@w,@r=@w[0..2];%d=do{push@{$s{"@w[$_..$_+2]"}},$w[$_+3]for(0..$#w-3);%s};print"@r ";for(1..200){push@r,z(@{$d{"@r"}});shift@r;print"$r[-1] "}'<a.txt

145 / 160 bytes:

perl -e'sub z{$_[rand@_]}$/=" ";@w=<>;push@w,@r=@w[0..2];push@{$s{"@w[$_..$_+2]"}},$w[$_+3]for(0..$#w-3);for(1..200){push@r,z(@{$s{"@r"}});print shift@r}'<a.txt

136 / 151 bytes:

Also, I realized what by the rules of the competition we need to print 50K+ words, not 200. Assuming our input file is long enough, I used its length in words as a target number of words to generate:

perl -e'$/=$";@w=<>;push@w,@r=@w[0..2];push@{$s{"@w[$_..$_+2]"}},$w[$_+3]for(0..$#w-3);for(@w){push@r,$s{"@r"}->[rand@{$s{"@r"}}];print shift@r}'<a.txt

127 / 145 bytes:

Plus, we found a couple of microoptimizatons:

  • it’s possible to omit -> when dereferencing an array in Perl (at least in Perl 5.26+),
  • it’s possible to replace $#w-3 with @w-4 having the same behavior.
perl -040e'@w=<>;push@w,@r=@w[0..2];push@{$s{"@w[$_..$_+2]"}},$w[$_+3]for(0..@w-3);for(@w){push@r,$s{"@r"}[rand@{$s{"@r"}}];print shift@r}'<a.txt

Now we reached an important imaginary border — 127 bytes of payload, less than half of the initial goal. Is it possible to make it any better?

123 / 139 bytes:

perl -ae'push@F,@r=@F[0..2];push@{$s{"@F[$_..$_+2]"}},$F[$_+3]for(0..@F-4);for(@F){push@r,$s{"@r"}[rand@{$s{"@r"}}];print$".shift@r}'<a.txt

Aaand… that’s it for now! :)
If you have any good ideas on how to squeeze it more — please write me some comments:)

Additional remark:

If we assume we can tweak the input file a little, we can copy it’s first three words to the end, like
(cat a.txt; cut -d’ ‘ -f1–3 a.txt; ) > b.txt
so now we don’t need the first push anymore, so we could have 116/132 bytes solution:

perl -ae'@r=@F[0..2];push@{$s{"@F[$_..$_+2]"}},$F[$_+3]for(0..@F-4);for(@F){push@r,$s{"@r"}[rand@{$s{"@r"}}];print$".shift@r}'<b.txt

But I do not think it’s fair, so the final result has 123/139 bytes code.

Altsoph’s blog

Random notes on people and machines

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store