To recap, in the last part we covered the definition of the running key cipher and looked at a simple mechanism that you might use to try and break it: a known plaintext attack. This time we’ll cover the next most advanced method you can use to try and effect a break of a piece of running key ciphertext: n-gram analysis.
For those readers unfamiliar with the fields of cryptanalysis and information retrieval, we’ll briefly cover what an n-gram is. Essentially it is just a list of n symbols, where symbols can be letters, words, or whatever else you are interested in analyzing. However, in typical usage the term implies that a probability that somehow reflects the likelihood of observing that symbol list is attached to each such list.
In our case we are interested in character based n-grams, simply because the running key cipher is character based, and so example bigrams (n = 2) would include “ot”, “ke” and “re”. Likewise, we can consider trigrams (n = 3) such as “the”, “foo” and “ter”, and even higher order n-grams such as the pentagrams (n = 5) “erenc” and “licks” or hexagrams (n = 6) like “okingh”. Of course, to be useful, these strings have to have their attached probability. As we are interested in analyzing typical English text, these probabilities will typically be estimated from the frequency of n-gram occurrence in some publicly available corpus such as newspaper archives or the internet itself.
Before we look at how we can use n-grams for cryptanalysis, let’s take a look at some code to create them. Let’s begin with a class to represent the n-gram itself:
So far so simple. We will also need a class to analyze and contain a collection of n-grams:
That’s a fair chunk of code! However, most of it is pretty simple. The real core is the Analyse method, which just counts the frequency of n-gram occurrence ready to be turned into a probability when the n-grams are enumerated later. It takes as arguments a StreamReader, containing the corpus we wish to analyze and the length of n-grams we are interested in.
Now, before we continue I feel that I should point out the most glaring flaw in this implementation: it deals very poorly with data sparsity. Any n-gram which does not occur in the corpus will be assigned a probability of zero. This is probably undesirable since almost every string is at least possible in practice, if not necessarily particularly likely! The process of estimating probabilities for unseen n-grams is called smoothing, and the methods you can use to do it vary from the simple (initialize all n-gram counts to 1 instead of 0) to the complicated (Good-Turing smoothing). However, I have to say that in practice I’ve not found a huge difference in outcomes when I use the simple scheme versus any more complicated one.
Despite this flaw we will plunge forward with our use of n-grams. The fundamental algorithm we will be using for the cryptanalysis is based on a so-called maximum likelihood criterion. What this means is that for every chuck of ciphertext we will try and decipher it in a bruteforce way with every possible keytext, and we will then use our n-grams to estimate the likelihood of this deciphering by taking the product of the probabilities of the resulting plaintext block and the guessed key fragment (which we are assuming to be normal English text). The deciphering which has the highest such product is considered to be the most likely one and is shown to the user. A simple implementation of this algorithm is shown below:
Notice that this program assumes that you have a copy of Frankenstein on your drive to use as a corpus. Obviously in reality you will want to use a somewhat larger text than this, composed of the concatenation of many texts. I have found Project Gutenberg to be a great source for text you can use to construct such a corpus without too much trouble.
The output from this program is rather user unfriendly and it certainly won’t give you the plaintext in a straightforward way. I have found it most useful for suggesting possible plaintext fragments that can then be used as a starting point for a known-plaintext search, which I described in the last post. Recall that the plaintext we suggested from last time was:
And we enciphered this horrible secret with the running key:
So let’s take a look at what our n-gram based analysis program spits out as possible plaintext fragments when it is fed the corresponding ciphertext, with a BLOCK_SIZE of 4 as in the example code:
After analysis on offset 0 'Plain': TOBEECOUNIGHEAUTDWITLLSTOVERDING 'Key': VENTSHOUARINLACETUATWHICENIGALON After analysis on offset 1 'Plain': ANDDREATSOFAIOUSCOVEENCEBEFOATIO 'Key': SBUTSYOUHAPPSIDEOURDONTOHISPTILT After analysis on offset 2 'Plain': BEENVOUROJECETTEWILLERBEIGHT 'Key': NTSWHATIALLYSEDMMEWHWEREERWA After analysis on offset 3 'Plain': ESSAATTHRISEJECTTHISEOFTEDAT 'Key': TERCOUGHDHISOSOPTAKIREDTTATI
Although this example is really too small and our corpus too sparse to give very helpful results, you can see immediately that the analyzer has got at least some things right: for instance, you can see the string EAT occurs at the appropriate position in the plaintext output for blocks offset by 1. It is worth noting that there is no guarantee that the plaintext will be output in the ‘Plain’ field and the keytext in the ‘Key’ field: they could equally be swapped around, due to the commutativity of adding characters (which is essentially addition in modular arithmetic to modulus 26) and the fact that we have assumed that the governing n-gram distributions behind both texts are the same.
Although this example might have been somewhat disappointing, I’ve had fairly good with n-gram analysis on larger ciphertexts (where alignments of common character sequences in the plain and key texts are more common) where I’ve used a more complete corpus. However, we can apply even more technology to the problem: next time I’ll try and show how we can add hidden Markov models into the mix to resolve some of the more fundamental problems with the crude maximum likelihood approach shown above.