EXCERPT

The Cryptographer’s Toolbox

From Secret Key Cryptography by Frank Rubin

Manning Publications

Published in

CodeX

16 min readOct 19, 2022

This article covers:

The rating system used for ciphers
Substitution ciphers
Transposition ciphers
Fractionation, breaking letters into smaller units
Pseudorandom number generators

Take 35% off Secret Key Cryptography by entering fccrubin into the discount code box at checkout at manning.com.

Secret Key ciphers are built from a few basic elements. You can think of these as the tools of the trade. To build a strong cipher you want all of these tools in your toolbox. That does not mean you should use every element in every cipher. That could lead to excess complexity without any improvement in security. Your cipher would be slower, with no added benefit. This article covers substitution, transposition, fractionation and random numbers.

Before discussing the elements, let’s talk about strength. The strength of a cipher is measured in bits. Each bit represents one binary choice. If there were a cipher where each ciphertext could represent just one of two possible plaintexts, then that cipher would have a strength of 1 bit. For example,

0 = We lost.

1 = We won.

The size of the key is a limiting factor in determining the strength of a cipher. If a cipher uses 64-bit keys, then its strength can be no more than 64 bits, but the strength can be less if the cipher is weak.

Rating System

In order to give you a general feel for the strengths of the ciphers, I rate ciphers on a One to Ten scale. These are my personal ratings, based on my experience and my analysis of how much effort is required to break the cipher using the best techniques I know, and how the ciphers compare to one another and to historical ciphers that were or were not broken in practice. I give much of the analysis in the section preceding each rating.

One indicates a cipher that can be broken by a beginner with no training using only paper and pencil and moderate effort.
Two indicates a cipher that can be broken by an experienced amateur or hobbyist using only paper and pencil.
Three is a cipher that a skilled amateur cryptographer can breach with hand methods.
Four or Five means that a computer, a trained cryptographer, or both are needed.
From Six to Nine indicates how much computing power an expert opponent would need.
Ten denotes a cipher that will hold up against a national cryptographic agency with legions of trained cryptographers using today’s largest supercomputers.

Sometimes I go outside the scale. Zero means that the cipher can be understood without needing paper or pencil, such as Pig Latin or GNITIRW EHT SDROW SDRAWKCAB. An Eleven rating means that the cipher will stand up to potential future ultracomputers far stronger than quantum computers or supercomputers as we currently conceive them.

By seeing how I rate different ciphers, you can get the gist of how to rate ciphers that you see elsewhere, or which you may invent yourself. Each rating is only an estimate, not a guarantee of strength.

Substitution

The first tool in the cryptographer’s toolbox is substitution. One unit is substituted for another unit in a text. The plaintext units can be single letters, pairs of letters or longer blocks. The ciphertext units can be letters, blocks of letters, blocks of digits or letter-digit combinations. When all units are single letters, the cipher is called simple substitution or monoalphabetic. In computer cryptography the units can be bits, bytes, or blocks of bits or bytes of any length. This section gives a quick glimpse.

One of the oldest known substitution ciphers is the Caesar cipher, used and possibly invented by Julius Caesar, where each letter of the alphabet is replaced by the letter 3 positions later. In modern use, this may be any fixed number of positions earlier or later. The Caesar cipher is rated One.

There is no requirement that all plaintext units have the same length. Suppose that the cipher takes letters of the alphabet and substitutes 2-digit pairs. There are only 26 letters of the alphabet, but 100 possible digit pairs. This means there are 74 extra pairs the cryptographer can use for some other purpose. One approach, which has been used for hundreds of years, is to provide substitutes for common letter pairs, such as TH, ER, ON, AS and NT, and possibly short words like THE and AND, in addition to the single letters. The plaintext units would then be 1, 2 or 3 letters long. This makes the frequencies of the digit pairs more uniform. Since differences in letter frequencies can be used for solving ciphers, making the frequencies more uniform makes the cipher stronger.

Another approach is to use the extra pairs to provide additional substitutes for some common letters. This is called homophonic substitution. For example, you might provide 10 substitutes for E, 8 substitutes for T, and so forth. The multiple substitutes for a given letter are called homophones. This is analogous to the way the homophones F and PH both represent the same sound in English. Providing multiple substitutes makes the frequencies of the 100-digit pairs more even. Naturally, both approaches, letter pairs and homophonic substitution, can be combined to get even more uniform frequencies for the digit pairs. In other words, these methods prevent an opponent from using frequency analysis.

Huffman Codes

In a computer context the ciphertext units can be strings of bits. A good example is Huffman coding, developed by David A. Huffman in 1952 when he was a student at MIT. I won’t cover the method for optimizing the set of codes, I will just give the general concept as an example of a variable-length binary code. In Huffman coding, the most frequent letters get short codes, while rarer letters get long codes, based on an underlying letter frequency table. Consequently, fewer bits are required to express the message. This is called text compression.

The most frequent letters in English are E and T, which each occur about 1/8 of the time. Since 8=23 we use 3 bits to represent E and T. We can arbitrarily choose any 3-bit values, say E=100 and T=111. I call this method Mixed Huffman. The next most frequent are A, O, I, N, S, R, H. These each occur roughly 1/16 of the time, so we use 4 bits for each of them. We can use any 4-bit codes, except codes starting with 100 or 111 which have already been used. The next group of letters are D, L, U, C, M, F, Y which each occur about 1/32 of the time, so 5-bit codes are needed. And so forth.

Here is a set of Mixed Huffman codes I created based on counts of 150,000 letters of English text. Other languages vary. Huffman codes have the prefix property, namely no code is a prefix of any longer code. For instance, if abcd is a code, then abcde could not also be a code for any choice of binary digits a,b,c,d,e. The prefix property was first described by mathematician Emil Leon Post in 1920.

E 100     D 00000     P 010010     
         T 111     L 01000     B 010011     
         A 0001    U 10110     V 110101     
         O 0010    C 10111     K 1101000    
         I 0011    M 11000     X 11010011   
         N 0101    F 11001     Q 110100101
         S 0110    Y 11011     J 1101001000
         R 0111    W 000010    Z 1101001001
         H 1010    G 000011

Using these code groups, the word STYLE would be encoded as 0110 111 11011 01000 100. Rewriting this in groups of 4 bits gives 0110 1111 1011 0100 0100 which is hexadecimal 6FB44.

Although it is nearly impossible for Emily to identify the code groups for individual letters in a ciphertext, Emily can search for longer repeated strings of bits. These will represent common letter pairs, called bigrams, letter triples, called trigrams, or words. For example, any given 10-bit string should appear about once every 210, or 1024, times. If a 10‑bit string appears 20 or more times out of 1024 strings, then it almost certainly represents the word THE, which is by far the most common word in English. If you have identified the word THE in a text, then you can look for extensions like THERE or THESE, which are easy to spot because of the repeated E. Mixed Huffman is rated Three.

Transposition

The second major cryptographic tool is transposition, changing the order of the characters in a message. The simplest method is route transposition. The letters of the message are written into a rectangle in one order, and read out in a different order. This section gives a quick look at this in practice.

For example, the message THERE IS NO LOVE AMONG THIEVES, which has 25 letters, is written into this 5×5 grid from left to right across the rows, and read out from top to bottom down the columns. The leftmost column in this grid is TIOOI when read from top to bottom.

T H E R E     Plaintext:  THERE IS NO LOVE AMONG THIEVES
         I S N O L
         O V E A M     Ciphertext:  TIOOI HSVNE ENEGV ROATE ELMHS
         O N G T H
         I E V E S

Common routes for writing the letters into the grid, and for reading the letters out of the grid, include going straight across the rows, either left or right, straight up or down the columns, alternating left and right across the rows, alternating up and down the columns, diagonally starting at any corner, diagonally in alternating directions, or spiral clockwise or counter clockwise. Route transposition ciphers are rated One.

Fractionation

Fractionation is the division of characters into smaller units. We have already seen one way, representing a character as a binary number. Each bit of that binary number can be manipulated as a separate unit, substituted or transposed. This section introduces fractionation.

A classical way of representing a letter as two digits is the Polybius Square, invented in the second century BCE by the Greek historian Polybius. Here is a 5×5 square using a mixed alphabet with the keyword SAMPLE. Notice that the letters I and J share one cell in order to make the 26-letter alphabet fit into the 25-cell grid.

1  2  3  4  5
         1  U  V  W  X  Y     A mixed Polybius Square using
         2  Z  S  A  M  P     the keyword SAMPLE
         3  L  E  B  C  D
         4  F  G  H  IJ  K
         5  N  O  Q  R  T

Since A is on row 2 in cell 3, it is represented by 23. B is 33, C is 34, and so on, through Z represented by 21. I and J are both represented by 44. These digits can then be substituted, transposed and regrouped in various ways. Pairs of digits can be turned back into letters using this grid, or another Polybius Square arranged in a different mixed order.

A modern version would replace each character with its hexadecimal representation in ASCII or UTF-8 code. Thus A=41, B=42, C=43, through Z=5A. These hexadecimal digits can similarly be substituted, transposed, regrouped and turned back into bytes.

A fun example is Fractionated Morse invented by M. E. Ohaver in 1910. Ohaver always went by M.E. because he disliked his first name, which was Merle.

In Fractionated Morse the letters are taken in groups of a fixed size, say 7, and replaced by their Morse Code equivalents, using / as a letter separator. Then the lengths of the code groups are reversed, and the resized groups are turned back into letters.

E X    A  M  P    L    E   Plaintext
         ·/-··-/·-/--/·--·/·-··/·   Morse equivalents
         1 4    2  2  4    4    1   Code group lengths
         1 4    4    2  2  4    1   Lengths in reverse order
         ·/-··-/·---/·-/-·/·-··/·   Regrouped Morse
        E X    J    A  N  L    E   Equivalent letters

Morse Code was invented by Alfred Vail in 1840 and named for his employer, Samuel F. B. Morse.

This cipher has several obvious weaknesses. Since it uses the standard Morse alphabet, the only key is the length of the letter groups, which can be guessed in just a few tries. Plaintext letters are often replaced by themselves. There are 30 different Morse code groups, but only 26 letters, so four extra characters are needed. Ohaver used Germanic ä, ë, ö and ü. Fractionated Morse is rated One.

These problems can be partially fixed with two changes: (1) Use only the Morse groups of lengths 1, 3 and 4. There are 26 such groups, perfectly fitting the 26-letter alphabet. (2) Scramble the order of the alphabet, or, equivalently, scramble the order of the Morse code groups. I call this enhanced version FR-Actionated Morse. For example, using the keyword MIXEDALPHBT to mix the alphabet, with the Morse groups in standard order, you get:

M ·     P -·-    Y ·-··   N -·--  
         I -     H --·    Z ·-·-   O --··
         X ···   B ---    C ·--·   Q --·-
         E ··-   T ····   F ·---   R ---·
         D ·-·   U ···-   G -···   S ----
         A ·--   V ··-·   J -··-  
         L -··   W ··--   K -·-·

Even with these improvements, FR-Actionated Morse is rated only Two.

Random Number Generators

A random number generator can be anything that produces a sequence of numbers in some given range. The numbers might be single bits, 8-bit bytes, decimal digits, or numbers in any other desired range. For example, numbers in the range 0 to 25, corresponding to the 26 letters of the alphabet, are useful for some cryptographic purposes. This section introduces the topic.

It is important to recognize that there is no such thing as a “random number.” You cannot say that 51 is a random number, while 52 is not, or vice versa. You can, however, say that the sequence 51, 52, 53, 54, … is not random. That sequence is completely predictable. Randomness is a property of the sequence, or of the generator, not of the individual numbers in the sequence. It is more accurate to say “a random sequence of numbers” than “a sequence of random numbers.”

The generator might be a physical process, such as cosmic rays, the pinging of a Geiger counter, precise timing of computer keystrokes, a flag fluttering in a stiff breeze, spray from crashing waves or people rushing to catch trains. Most physical sources are not fast enough for cryptographic purposes, however the sequence of numbers might be stored in a computer file for later use.

The generator could also be a mathematical function or computer program which produces a number each time it is called. Random numbers produced by mathematical algorithms are called pseudorandom numbers to distinguish them from true random numbers. They are considered weaker than true random numbers because an opponent who determines a portion of the random sequence may be able to calculate the preceding and following numbers and thus read the message. True random numbers can never be produced by a mathematical function.

One big difference between pseudorandom sequences and true random sequences is that pseudorandom sequences eventually repeat, while true random sequences never repeat. The number of terms before the sequence repeats is called its period. The sequence 3,1,9,2,4,3,1,9,2,4,3,1,9,2,4,3,… for example, has a period of 5, shown underlined. In general, the longer the period, the stronger the cipher.

Simply because a sequence of numbers is random does not mean that the numbers are equally probable. For example, suppose you are observing the colors of cars crossing a busy bridge. The colors are random, but certain colors are much more common than others. White, black, silver and red are far more common than orange, fuchsia or chartreuse. Similarly, in the game of craps, if the dice are fair, then each throw is random, yet throwing a 7 is six times as likely as throwing a 12.

Let’s assume that any random number generator produces numbers with equal probabilities (this is not always the case). This is called a uniform distribution, or an equiprobable distribution. With a good random number generator, pairs and triples, and so on, of generated numbers will also have uniform probabilities, perhaps going as far as octuples or beyond.

Chained Digit Generator

Let me end this section with a sample pseudorandom number generator which can easily be done using paper and pencil. No computer required. Let’s call it the Chained Digit generator. Begin by writing any 7-digit decimal number. These 7 digits are called the seed, or initial value, or initialization vector. They can be considered the key, or a part of the key for any cipher which incorporates this generator. To get the first pseudorandom digit you simply add the first and last digits. You append this new digit to the sequence, and black out the first digit. So, starting with 3920516 we add 3+6 to get 9.

Any time the sum exceeds 9 we drop the tens digit. That is, the addition is done modulo 10. To get the second pseudorandom digit we repeat the process. Here 9+9 gives 18. We drop the tens digit to get 8.

This process can be repeated to get as many pseudorandom decimal digits as desired.

39205169800562199940232...

The resulting pseudorandom sequence is 9800562199940232...

Notice that if all of the digits in the seed are even, then all of the generated digits will be even. Likewise, if all of the digits are divisible by 5, namely 0 or 5, then all of the generated digits will be divisible by 5. In that case the period could be no more than 128 since there are 7 digits in the seed, and there are only 27=128 possible combinations of 0’s and 5’s. Since such seeds cannot produce long periods they are called unqualified. For the Chained Digit generator a qualified seed must contain at least one odd digit and one digit that is not a multiple of 5. For example, 2222225 is a qualified seed, but 2222222 and 5555555 are not qualified. With a qualified 7-digit seed the period will always be 2,480,437.

This generator has behavior typical of homemade pseudorandom number generators. There are 107 possible 7-digit seeds. If you start with any seed, the generator will cycle through some sequence of numbers until it produces that seed again, so the set of 7-digit numbers is partitioned into several discrete cycles, each with its own period. If you choose a qualified seed, then the cycle will always have the maximum possible period of 2,480,437 numbers. There are 4 separate cycles of this length, plus several much shorter cycles produced by the unqualified seeds.

The behavior is similar for seeds of other sizes. Even when the maximal cycle is very short there is often a high probability of getting a maximal cycle because there can be many maximal cycles. This table shows the probability of getting a cycle of a given length using a qualified seed:

The table shows that seed lengths of 5 and 8 digits are not safe. They produce a large percentage of very short cycles. Seed lengths of 7 and 10 digits are best because you are always guaranteed a long period.

This random number generator is strictly a demo model, just to show what can be achieved using simple hand methods. It is not suitable for high-security work.

Useful Combinations, Wasteful Combinations

The 4 basic techniques of this article can be used in myriad combinations, which I explore in my forthcoming book, Unbreakable Cryptography. However, it is important to recognize right at the outset that not every combination is beneficial. Some combinations add work without adding strength.

Consider an idea that some beginners try. They perform a simple substitution on a message, then a second simple substitution on the resulting text, then a third, and so forth, for 5, 10, even 100 rounds. This is a waste of effort. Performing two simple substitutions is the same as performing one, but with a different mixed alphabet, so performing many simple substitutions does not add any strength. Here is an illustration. The two substitutions are mixed with the keys FIRST and SECOND. The third substitution is equivalent to performing the first followed by the second.

ABCDEFGHIJKLMNOPQRSTUVWXYZ    First substitution
         XYZFIRSTABCDEGHJKLMONPQUVW
         ABCDEFGHIJKLMNOPQRSTUVWXYZ    Second substitution
         UVWXYZSECONDABFGHIJKLMPQRT
         ABCDEFGHIJKLMNOPQRSTUVWXYZ    Equivalent
         QRTZCIJKUVWXYSEONDAFBGHLMP

Let’s try an example. If we encipher the word EXAMPLE using the first substitution the result is IUXEJDI. If IUXEJDI is enciphered with the second substitution the result is CLQYOXC. You can verify for yourself that enciphering EXAMPLE with the equivalent substitution yields CLQYOXC.

Performing one encipherment and then a second encipherment is called composing the two encipherments. The previous example shows that composing two simple substitutions just produces another simple substitution. If the first encryption uses a code, then following the code with a cipher is called superencipherment. The most common form of superencipherment is non-carrying addition, or addition modulo 10, which works like this:

12155 12155 12155 12155   Superencipherment key
 61587 02954 70069 53028   Plaintext code groups
 73632 14009 82114 65173   Superenciphered code groups

Bazeries Type 4 Cipher

Let’s look at the opposite case. Let’s look at a cipher where using a substitution step followed by a very simple transposition produces a cipher that is much stronger.

The cipher was proposed by the brilliant, irascible and vituperative French cryptographer Étienne Bazeries in 1898. I do not know what Bazeries named this cipher. I call it Bazeries Type 4 because it was the last of 4 ciphers that he proposed to the diplomatic Bureau de Chiffre during the 1890s. It can easily be done by hand.

The Bazeries Type 4 consists of a simple substitution followed by a simple transposition, which I call a piecewise reversal. The transposition reverses short pieces of the text according to a key which is a sequence of small integers. Here is an example using the keyword BAZERIS to mix the substitution alphabet, and the key 4,2,3 for the transposition.

ABCDEFGHIJKLMNOPQRSTUVWXYZ   Plaintext alphabet
 HGFDCBAZERISYXWVUTQPONMLKJ   Ciphertext alphabet
 THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG   Message plaintext
 PZCUOEFIGTWMXBWLROYVQWNCTPZCSHJKDWA   After substitution
 4    2  3   4    2  3   4    2  3   4    2  3   Transposition key
 PZCU OE FIG TWMX BW LRO YVQW NC TPZ CSHJ KD WA
 UCZP EO GIF XMWT WB ORL WQVY CN ZPT JHSC DK AW  Final ciphertext
 UCZPE OGIFX MWTWB ORLWQ VYCNZ PTJHS CDKAW  Ciphertext grouped by fives

This type of transposition can be used to strengthen many different types of ciphers, so it deserves a name of its own. Let’s call it Piecewise Reversal. The stronger Piecewise Transposition cipher can mix forward and reverse sections of text, like this example using the numeric key 3, 4, -3, 2.

3   4    -3  2   3   4    -3  2   3   4    -3  2  
 THE QUIC KBR OW  NFO XJUM PSO VE  RTH ELAZ YDO GS  Plaintext
 EHT CIUQ KBR WO  OFN MUJX PSO EV  HTR ZALE YDO SG  Transposed

The cryptographers at the Bureau de Chiffre were unable to solve any of the sample messages that Bazeries provided. Despite considerable effort, the messages remained unsolved for 40 years until renowned architect, and amateur cryptographer, Rosario Candela solved them and wrote a book about how he did it (The Military Cipher of Commandant Bazeries, Cardanus Press: New York, 1938).

Candela, however, was unable decrypt the messages directly. Instead, he identified and exploited a weakness in the way Bazeries generated the substitution alphabet from a key. If Bazeries had used a stronger method for mixing the cipher alphabet, Candela could not have deciphered the messages. Consequently, Bazeries Type 4 with a well-mixed alphabet is rated Five. Pretty good for combining two methods which are each rated One.

Here’s another historic tidbit: Candela was a graduate of Columbia School of Architecture, so he planned to publish his book at Columbia University Press. William F. Friedman, then the dean of American cryptologists, got wind of this, and secretly blocked the publication. This again attests to the strength of the Bazeries Type 4 cipher.

That’s all for this article. Thanks for reading.