| Group | 1,2,3 | 4,5,6 | 7,8 |
| key length | 3 | 5 | 8 |
Is the percentage that survived roughly 1 / key length?
Those that came from plain text repeated trigrams
Those that are present in the cipher, but not in the plain text
(so they came from the polyalphabetic nature of the Vigenere encryption).
We will see a way to exploit the information above to help decide the key length of a Vigenere encrypted cipher later. Now, we'll explore another new notion that plays a key role in cipher text only, Vigenere attacks. The notion of index of coincidence. Given a source of text (like plain old English, or affine encrypted English, etc.), the index of coincidence is defined to be the probability of reaching into the source text, selecting two characters at random, and having those two characters be the same. If a source consists of random characters, where each character, a to z, appears about 1 / 26 th of the time, then the index of coincidence (IC) is 1 / 26 = 0.03846 or so (the first character can be anything and the second is the same character 1 time in 26). On the other hand, everyday English is known to have an IC of about 0.065. That difference is enough to utilize in a significant way.
Here is a quick and dirty way to estimate the IC for a source of text (assuming you have a fair bit of text that is). Take two independent Strings of text from the source, say s1 and s2; independent means that the characters in s1 have no influence on the characters in s2. For English, independent means that the Strings are located "not too close together" in the source text, say at least 10 characters apart. We've seen that certain digrams and trigrams appear much more often than others, so if you know one character of English, you can tell a fair bit about the two or three characters that follow it, but you'd be hard pressed to know much about which character will appear 10 characters from this one (other than it has about a 0.13 chance of being an "e", etc.). The two strings should have the same length, say N characters. Line them up above one another, and count the number of positions, say x, where the two characters above one another are the same. The IC should be about x / N. Here are the first several characters from a few sentences back:
WewillseeawaytoexploittheinformationabovetohelpdecidethekeylengthofaVigenereencrypted cipherlaterNowwellexploreanothernewnotionthatplaysakeyroleinciphertextonlyVigenereatt
We got 7 matches in 85 characters, giving an estimate of 7 / 85 = 0.08235
(which is high, but noticeably different than 0.038). In theory, I didn't
need to take different sentences to estimate this IC, I could have just
as well taken the top string, deleted the first 10 or 15 characters and
lined it up with "itself". Let's see how IC's vary with different sources.
Use the IC applet to get these estimates.
| n | 1 | 2 | 5 | 10 | 20 |
| IC |
| k = | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| IC |
| k = | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| IC |
| k = | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| IC |
| k = | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| IC |