# Breaking The Running Key Cipher: Part 2

This is the second post in a series of three covering the running key cipher. You can find the first part here.

To recap, in the last part we covered the definition of the running key cipher and looked at a simple mechanism that you might use to try and break it: a known plaintext attack. This time we'll cover the next most advanced method you can use to try and effect a break of a piece of running key ciphertext: n-gram analysis.

For those readers unfamiliar with the fields of cryptanalysis and information retrieval, we'll briefly cover what an n-gram is. Essentially it is just a list of n symbols, where symbols can be letters, words, or whatever else you are interested in analyzing. However, in typical usage the term implies that a probability that somehow reflects the likelihood of observing that symbol list is attached to each such list.

In our case we are interested in character based n-grams, simply because the running key cipher is character based, and so example bigrams (n = 2) would include "ot", "ke" and "re". Likewise, we can consider trigrams (n = 3) such as "the", "foo" and "ter", and even higher order n-grams such as the pentagrams (n = 5) "erenc" and "licks" or hexagrams (n = 6) like "okingh". Of course, to be useful, these strings have to have their attached probability. As we are interested in analyzing typical English text, these probabilities will typically be estimated from the frequency of n-gram occurrence in some publicly available corpus such as newspaper archives or the internet itself.

Before we look at how we can use n-grams for cryptanalysis, let's take a look at some code to create them. Let's begin with a class to represent the n-gram itself:

```class NGram
{
public string Text;
public double Probability;

public NGram(string text, double probability)
{
this.Text = text;
this.Probability = probability;
}
}```

So far so simple. We will also need a class to analyze and contain a collection of n-grams:

```class NGramFrequencies
{
private Dictionary<string, int> occurrences;
private int totalOccurrences;

private NGramFrequencies(Dictionary<string, int> occurrences, int totalOccurrences)
{
this.occurrences = occurrences;
this.totalOccurrences = totalOccurrences;
}

public double ProbabilityOf(String nGram)
{
int nGramOccurrence;
if (this.occurrences.TryGetValue(nGram, out nGramOccurrence))
{ return this.ProbabilityOf(nGramOccurrence); }
else
{ return 0.0; }
}

private double ProbabilityOf(int occurrence)
{ return (double)occurrence / this.totalOccurrences; }

public IEnumerable<NGram> NGrams
{
get
{
foreach (KeyValuePair<string, int> pair in this.occurrences)
{ yield return new NGram(pair.Key, ProbabilityOf(pair.Value)); }
}
}

{
Dictionary<string, int> occurances = new Dictionary<string, int>();
int totalOccurances = 0;

{
char nextCharacter;
{
if (accumulator.Count == length)
{
StringBuilder nGram = new StringBuilder(length);
foreach (char value in accumulator)
{ nGram.Append(value); }
string nGramString = nGram.ToString();

int currentOccurrences;
if (occurrences.TryGetValue(nGramString, out currentOccurances))
{ occurrences[nGramString] = currentOccurrences + 1; }
else
{ occurrences[nGramString] = 1; }

totalOccurrences++;
accumulator.RemoveFirst();
}

}
}

return new NGramFrequencies(occurances, totalOccurances);
}

private static bool TryNormalizeCharacter(char input, out char output)
{
if (Char.IsLetter(input))
{
output = Char.ToUpper(input);
return true;
}
else
{
output = '?';
return false;
}
}
}```

That's a fair chunk of code! However, most of it is pretty simple. The real core is the Analyse method, which just counts the frequency of n-gram occurrence ready to be turned into a probability when the n-grams are enumerated later. It takes as arguments a StreamReader, containing the corpus we wish to analyze and the length of n-grams we are interested in.

Now, before we continue I feel that I should point out the most glaring flaw in this implementation: it deals very poorly with data sparsity. Any n-gram which does not occur in the corpus will be assigned a probability of zero. This is probably undesirable since almost every string is at least possible in practice, if not necessarily particularly likely! The process of estimating probabilities for unseen n-grams is called smoothing, and the methods you can use to do it vary from the simple (initialize all n-gram counts to 1 instead of 0) to the complicated (Good-Turing smoothing). However, I have to say that in practice I've not found a huge difference in outcomes when I use the simple scheme versus any more complicated one.

Despite this flaw we will plunge forward with our use of n-grams. The fundamental algorithm we will be using for the cryptanalysis is based on a so-called maximum likelihood criterion. What this means is that for every chuck of ciphertext we will try and decipher it in a bruteforce way with every possible keytext, and we will then use our n-grams to estimate the likelihood of this deciphering by taking the product of the probabilities of the resulting plaintext block and the guessed key fragment (which we are assuming to be normal English text). The deciphering which has the highest such product is considered to be the most likely one and is shown to the user. A simple implementation of this algorithm is shown below:

```class Program
{
static void Main(string[] args)
{

const int BLOCK_SIZE = 4;
NGramFrequencies frequencies = nGramFrequencies[BLOCK_SIZE - 1];

for (int offset = 0; offset < BLOCK_SIZE; offset++)
{
StringBuilder plainTextOutput = new StringBuilder();
StringBuilder keyTextOutput = new StringBuilder();
for (int a = offset; a < cipherText.Length - (BLOCK_SIZE + offset); a += BLOCK_SIZE)
{
string cipherTextBlock = cipherText.Substring(a, BLOCK_SIZE);

double maximumProbability = 0.0;
string bestPlainText = null;
string bestKeyText = null;
foreach (NGram nGram in frequencies.NGrams)
{
string plainText = Decipher(cipherTextBlock, nGram.Text);
double probability = frequencies.ProbabilityOf(plainText) * nGram.Probability;

if (probability >= maximumProbability)
{
maximumProbability = probability;
bestPlainText = plainText;
bestKeyText = nGram.Text;
}
}

plainTextOutput.Append(bestPlainText);
keyTextOutput.Append(bestKeyText);
}

Console.WriteLine("After analysis on offset {0}", offset);
Console.WriteLine("'Plain': {0}", plainTextOutput);
Console.WriteLine("'Key': {0}", keyTextOutput);
}
}

static string Decipher(string cipherText, string keyText)
{
StringBuilder output = new StringBuilder(cipherText.Length);
for (int a = 0; a < cipherText.Length; a++)
{ output.Append((char)('A' + ReduceMod(cipherText[a] - keyText[a]))); }

return output.ToString();
}

static int ReduceMod(int value)
{
if (value < 0)
{ return value + 26; }
else
{ return value % 26; }
}

{
NGramFrequencies[] nGramFrequencies = new NGramFrequencies[6];
for (int a = 1; a <= nGramFrequencies.Length; a++)
{
nGramFrequencies[a - 1] = NGramFrequencies.Analyse(corpusStream, a);
corpusStream.BaseStream.Seek(0, SeekOrigin.Begin);
}

return nGramFrequencies;
}
}```

Notice that this program assumes that you have a copy of Frankenstein on your drive to use as a corpus. Obviously in reality you will want to use a somewhat larger text than this, composed of the concatenation of many texts. I have found Project Gutenberg to be a great source for text you can use to construct such a corpus without too much trouble.

The output from this program is rather user unfriendly and it certainly won't give you the plaintext in a straightforward way. I have found it most useful for suggesting possible plaintext fragments that can then be used as a starting point for a known-plaintext search, which I described in the last post. Recall that the plaintext we suggested from last time was:

`IENJOYEATINGWHILESUSPENDEDUPSIDEDOWN`

And we enciphered this horrible secret with the running key:

`GOBOILYOURBOTTOMSYOUSONSOFSILLYPERSONS`

So let's take a look at what our n-gram based analysis program spits out as possible plaintext fragments when it is fed the corresponding ciphertext, with a BLOCK_SIZE of 4 as in the example code:

```After analysis on offset 0
'Plain': TOBEECOUNIGHEAUTDWITLLSTOVERDING
'Key': VENTSHOUARINLACETUATWHICENIGALON
After analysis on offset 1
'Plain': ANDDREATSOFAIOUSCOVEENCEBEFOATIO
'Key': SBUTSYOUHAPPSIDEOURDONTOHISPTILT
After analysis on offset 2
'Plain': BEENVOUROJECETTEWILLERBEIGHT
'Key': NTSWHATIALLYSEDMMEWHWEREERWA
After analysis on offset 3
'Plain': ESSAATTHRISEJECTTHISEOFTEDAT
'Key': TERCOUGHDHISOSOPTAKIREDTTATI```

Although this example is really too small and our corpus too sparse to give very helpful results, you can see immediately that the analyzer has got at least some things right: for instance, you can see the string EAT occurs at the appropriate position in the plaintext output for blocks offset by 1. It is worth noting that there is no guarantee that the plaintext will be output in the 'Plain' field and the keytext in the 'Key' field: they could equally be swapped around, due to the commutativity of adding characters (which is essentially addition in modular arithmetic to modulus 26) and the fact that we have assumed that the governing n-gram distributions behind both texts are the same.

Although this example might have been somewhat disappointing, I've had fairly good with n-gram analysis on larger ciphertexts (where alignments of common character sequences in the plain and key texts are more common) where I've used a more complete corpus. However, we can apply even more technology to the problem: next time I'll try and show how we can add hidden Markov models into the mix to resolve some of the more fundamental problems with the crude maximum likelihood approach shown above.

# In Defence Of Clipperz

My last post, which detailed my switch to Mac, got a lot of attention from an unexpected audience: designers of password manager programs! Not only did I get a welcoming comment from the co-author of the wonderful 1Password, but I also received an email from Marco Barulli of Clipperz, an online password manager which I mentioned only briefly. Marco quite rightly wanted to point out that some of the things I said about his product could potentially be misleading, and so in the interests of accuracy I'm reproducing the meat of exchange (with his permission!) here in public:

I would like to thank you for the kind mention of Clipperz.

Also, with regard to the lack of integration with other password managers, I would like to point out the following features:

You can easily move to Clipperz all your passwords and access them even if you are offline.

(Ideally you shouldn't need any longer a software-based password manager)

I hope you will give Clipperz a chance!

As you can see, Marco was extremely courteous, despite my negative attitude towards something he has doubtless put a lot of hard work into! I had based my statement of "miniscule password limit" purely on my experience that it tended to slow to a crawl when more than around 100 entries were present in it (I have at least twice this many passwords!): this is due to the fact that the application is having to do some pretty intensive encryption and decryption in Javascript, of all things! However it seems that Clipperz has been working to improve this and that my experience may no longer be valid. Marco says:

I myself have more than 100 cards and both Firefox and Safari are doing a decent job encrypting and decrypting on my old G4 PowerBook. I'm sure that your new MacPro (what a beautiful beast!) will crunch your cards with ease.

Finally, he mentioned some of the goodies Clipperz users can look forward to in the future:

New features under development:

• iPhone version
• Tags & search
• More intuitive interface
• Sharing

It looks like exciting times are ahead! However, I think I shall hold out for My1Password as a web based solution to my password management needs simply due to the convenience of integration between my desktop and web password databases. Whatever Marco may say, I still believe you can get some valuable ease of use gains by having a piece of desktop software that can hook into all the disparate applications that you need to use passwords with on a daily basis.

However, if you don't share my point of view, or all the things you need passwords for are web based, I really can't do any better than to recommend Clipperz! It very functional and polished and is a remarkable technical achievement both in all that it manages to accomplish simply with Javascript and in its novel zero-knowledge architecture which means that only you can get to your passwords! Finally, I should have done this earlier, but I've now made a donation to the Clipperz team due to how helpful they were in solving my cross platform password problems in the past: this is just the kind of innovative software development that deserves our support!

# Breaking The Running Key Cipher: Part 1

I love challenges of all kinds, both the algorithmic and cryptographic. There is just something inherently compelling about pitting your wits alone against an intellectual challenge, free to apply any and all means to beat it. I recently took part in a crypto challenge where we were meant to break a text encoded with the running key cipher, which I hadn't encountered before, and I came up with a rather satisfying solution for I thought might be good to share here.

For those of you who aren't familiar with the running key cipher, it's very simple. Say you have some text you would love to be kept secret:

`IENJOYEATINGWHILESUSPENDEDUPSIDEDOWN`

But still feel the need to communicate it to someone, with whom you have agreed some (usually well known) key beforehand, say:

`GOBOILYOURBOTTOMSYOUSONSOFSILLYPERSONS`

We just "add" the two texts together, with the letters being used as numbers in the obvious way: A = 0, B = 1, ..., Z = 25. This means that e.g. C + B = 2 + 1 = 3 = D. Furthermore, we do this modulo 26, so that e.g. Z + B = A. Additionally, since in our case our key is a bit longer than the secret message, we will just not use the last couple of letters of the key. This means that our encoded message is:

`OSOXWJCONZOUPAWXWQIMHSAVSIMXDTBTHFOB`

The receiver can just subtract their known key to retrieve the original text. Wow, great! Since we never never reuse the key at any point in that message, and assuming we never use they key again, it must constitute a one time pad! This means that we must have perfect security, right? Ah, grasshopper, if it only it were so simple! Actually, this is not very secure at all because both our key and our message exhibit distinctly non-random distributions of letters. A simple example of this can be seen if you imagine that we subtract just the letter E (the most commonly used in English) from every letter in the intercepted ciphertext. We would expect to get considerably more than 1/26 of the resulting letters being either from the key or plaintext. If we do this with our example, we obtain:

```IENJOYEATINGWHILESUSPENDEDUPSIDEDOWN
GOBOILYOURBOTTOMSYOUSONSOFSILLYPERSONS
KOKTSFYKJVKQLWSTSMEIDOWROEITZPXPDBKX
K    K         K    K  K      KP```

I've labelled each resulting letter with a K or a P depending on whether it came from the key text or the plain text. As you can see, most of the letters are from the key because it contains very few E characters.

This is all very well, but it's clearly a fairly crude attack. If we want to have a chance of actually breaking the cipher we need some more heavyweight methods. As far as I know, there are a choice of three major techniques:

• Known plaintext attacks
• Crude N-gram analysis
• Hidden Markov model N-gram analysis

The last one is what I actually used to break the challenge, and the one that I think is just plain coolest, but we need to build up to it as it's pretty scary :-). So, let's take a look at the other techniques first:

Known Plaintext Attack
This is definitely the easiest attack to employ if you know something about the message being sent. For instance, if you, as the evil attacker, happen to know that the sender is a passionate gourmand, you might guess that EATING will appear somewhere in the key or plaintext. Knowing this, you can exhaustively subtract EATING from the ciphertext at every offset. If we do this for our example text, we obtain:

Text offset Text deciphered with key
0 KSVPJD
1 OOEOWW
2 KXDBPI
3 TWQUBH
4 SJJGAT
5 FCVFMI
6 YOURBO
7 KNGGHJ
8 JZVMCU
9 VOBHNQ
10 KUWSJR
11 QPHOKQ
13 WWEODC
14 SXDIVG
15 TWXAZB
16 SQPEUM
17 MITZFU
18 EMOKNP
19 IHZSIM
20 DSHNFC
21 OACKVG
22 WVZAZR
23 RSPEKX
24 OITPQN
25 EMEVGV
26 IXKLON
27 TDATGB
28 ZTILUZ
29 PBAZSI
30 XTOXBV

Most of those look like gibberish, but one stands out as eminently readable English text: YOURBO. This is, of course, part of the key text. The diligent cryptanalyst will then typically be able to use further guesses (perhaps inspired by the discovered piece of text) to reveal more of the message and key. For instance, he might try every word beginning with BO to try and extend the known plaintext rightwards (though this is a rather difficult proposition since there are at least 368 possibilities according to my dictionary). A small aside here about my methodology: I found the optimal experience was got by combining a suitable dictionary (I use 5desk) with grep in a Cygwin Bash shell. You can then make queries such as the one below, which counts the number of words in the dictionary beginning with "bo".

```\$ grep '^bo' 5desk.txt | wc -l
368```

Note that it is not strictly possible to tell whether the English words you recover are part of the key or plain text. This is because of the symmetry of the modular addition we got to perform the ciphering step. However, it is usually possible to work it out by your knowledge of the messages probable contents. Once you know which text stream is part of the key, a Google search will usually allow you to obtain the entire thing from your fragmentary knowledge, assuming that the key is a public text (as is typical: Bible extracts are popular). This will let you break the code!

However, what if you don't know anything about the message, or, like me, you're just a bit rubbish at making good guesses as to how to extend your known fragment of text? We probably need to employ something a little more sophisticated, like one of the N-gram analyses I mentioned earlier. However, as this blog entry is already getting overlong I shall leave exposition of these rather more clever techniques for a future entry: watch this space, especially those of you whom I have just bored senseless by explaining the absolute basics.