## Bluebells, 2

I didn’t give you enough information in the preceding post to find the grid reference to the bluebell wood. Here is some more, which should be more than enough. This is a part of the substitution table (rows and columns numbered from 0 to 5):

 4 2 3 4 1 3 2 0 5 2 4

Now you can do the decoding and find the bluebell wood, and rush off to see the bluebells at their best.

This is a cipher system for encrypting a plaintext message which is a string of characters taken from an alphabet of size q, say. The key is a random string of characters from the same alphabet, of the same length as the original message. It is important for the operation of the system that the key is truly random; as with the idealised monkeys, each character is equally likely to occur in any given position in the string, and characters in different positions are independent. So, if the length of the strings is n, then each possible key has probability 1/qn of occurring.

According to stories I have heard, the one-time pads used in the Cold War were produced by employees of the CIA, KGB, or whoever, sitting in a secret room tossing coins. Since both sender and receiver must have a copy of the key, two copies are made; one is given to the agent going behind enemy lines, and the other is held securely in the agency’s headquarters. Readers of Peter Wright’s book Spycatcher will know how it works. Each character must only be used once; the pad is on paper easy to dispose of, and each sheet is destroyed once it is finished. (The book tells of finding one-time pads in suspected Russian agents’ rooms in London, copying them and reassembling them, so that the agents’ communications could be read.)

The other essential ingredient is a substitution table, a square table whose rows and columns are indexed by the alphabet, and whose entries are alphabet symbols. The encryption is done one symbol at a time: find the row indexed by the plaintext symbol and the column indexed by the key symbol, and the corresponding entry is the ciphertext symbol.

What properties should the table have?

• There should not be a symbol repeated in any column. For, if symbol s occurs in column c in both rows r1 and r2, then the receiver of a message having ciphertext symbol s in a position corresponding to key symbol c will not know whether the plaintext symbol should be r1 or r2; so unique decryption is not possible.
• There should not be a symbol repeated in any row. Such a repeat does not prevent decryption, but “leaks” information about the plaintext to an interceptor. For if a symbol is repeated in row r, then some symbol s is missing, and the interceptor knows that the ciphertext symbol s cannot be produced by the plaintext symbol r.

In other words, the substitution table should be a Latin square.

Shannon’s Theorem shows that a one-time pad as specified is “unbreakable”, in the sense that if the key is uniformly random (and is kept out of the interceptor’s hands) and the substitution table is a Latin square, then interception of the ciphertext gives no information, and indeed the ciphertext and some plaintext give no information about the rest of the plaintext. For, with these assumptions, a short calculation using Bayes’ theorem shows that, whatever prior distribution the interceptor has put on the possible content of the message, the posterior distribution (after interception) is equal to the prior distribution.

Shannon’s Theorem does not make any assumptions about the Latin square: any Latin square of order q will do. Now there are two possible views:

• Since it doesn’t matter, use a very simple Latin square, for example addition (mod q); this will reduce the chance of errors by the agent, who may be doing the encryption in difficult circumstances.
• Choose a random Latin square, and change it regularly; this will confuse the interceptor even if some keys are captured.

In this context, it is interesting to note that, in the article “Japanese Army Air Force Codes at Bletchley Park and Delhi”, by Alan Stripp, in the book Code Breakers: The Inside Story of Bletchley Park (edited by F. H. Hinsley and Alan Stripp), an example is given of a substitution table supposedly used
in the Japanese Army Air Force cipher J6633. This cipher had a codebook for translating common words in the world of naval warfare into strings of four digits, and the entire message (over the alphabet {0,1,2,…,9}) encrypted with a one-time pad; they followed the second option above, and changed the encryption table regularly. However, the example shown is not a Latin square!

Now we have a method (discovered by Jacobson and Mathews) for choosing a random Latin square of any size, which I will discuss in another post.

Shannon’s Theorem shows that, if the ciphertext and substitution table are known, but the code remains secret, then the plaintext is completely secure. This suggests a problem, if we turn the assumptions on their head, as I did with my post on bluebells:

Suppose that the ciphertext and key are known, but the substitution table (a random Latin square) is secret. How much information about the plaintext can be deduced?

It turns out that this is a very entertaining question to analyse. Clearly, if n = 1, then no information is leaked: one entry in a random Latin square, in a specified position, is a random symbol. The longer the message, the more information can be obtained; long ciphers can be broken. Let us see why.

Suppose the ciphertext has length n, much greater than q. Since the key is random, each symbol will occur close to n/q times. If we look in the positions where a given symbol occurs in the key, we have a long string of characters taken from random positions of the plaintext and encrypted with a simple substitution cipher (the substitution corresponding to the unknown column of the random Latin square). This can be attacked by the standard methods of frequency analysis, discovered by Arab cryptographers a millennium ago. Moreover, there will be about n/q2 positions where the same symbol occurs twice in succession in the key; this gives information about digrams if n is significantly larger than q2. Similarly we can get information about trigrams if the message is sufficiently long.

In usual frequency analysis, we get a few letters, and have no way of building up more except by trying to guess words in the plaintext. But there is another method available here: Sudoku! Once we have a few letters in each column of the substitution table, then the Latin square rules may begin to force other entries, until the entire table is built and the cipher is broken.

The example at the start of this post is meant to give an indication of how this works. Eight characters over an alphabet of length 6 are not enough for serious frequency analysis. (You might protest that frequency analysis doesn’t apply to grid references, but in fact I gave you the approximate location, so you could guess some digits of the plaintext and get some entries in the table, so the process is quite similar.) So imagine that frequency analysis of a longer message has produced the entries shown in the substitution table; the whole table can now be easily determined.