# What is the Voynich Manuscript about?

This is the Voynich Manuscript:

It consists of a little over two hundred pages of vellum, covered in an unknown script. Most pages are richly illustrated with non-existent plants, astronomical diagrams and little bathing women. After at least a hundred years of research, the current state of knowledge of the VMs can best be summed up as follows: we don’t know. We don’t what it is and we don’t know what it says. We know it was probably written in the 15th century, and that it’s probably from Italy, but that’s about it. It could be a coded text, it could be a dead or invented language. It could well be a hoax: somebody trying to make some money from an old stock of vellum by filling with enticing scribbles and trying to flog it to some gullible academic.

Given that we know so little, it seems a little presumptuous to ask what it’s about. Surely we would need to translate the thing first. Well, perhaps not.

# Level statistics

In Level statistics of words: finding keywords in literary texts and symbolic sequences (abstract, PDF), the authors provide an interesting statistical approach to unsupervised analysis of text that might hold a lot of promise for computational attacks on the VMs.

The approach works as follows. Imagine a book, where every occurrence of a particular words has been highlighted. Consider the distribution of the highlights across the pages. If the word is a common word, like “it” or “the” we will see frequent highlights,in roughly equal numbers on each page. For other words, the number of highlights per page will decrease, but the expected number per page will remain the same. What we are interested in here, is the words for which one page contains a mass of highlights, and two pages on, there are no highlights at all. These are, so goes the theory, the words with high aboutness. They are the words that the text—or at least the section with many highlightsh—is about.

Think of the book as a line, indicating the continuous string of words, where we mark each occurrence of the word of interest with a colored dot. If the words are spread out randomly, we will see the occurrences of the words spread out relatively evenly along the line. For words with high aboutness, we will see the words clustering together, as if they’re particles, attracting each other. Here is an illustration from the paper:

The test then simply boils down to computing how unlikely it is to see the given clustering pattern under the assumption that the sequence was produced randomly, or to be more specific under the assumption that the sequence was a sample from a geometric distribution. The more unlikely this is, the more aboutness we assign the word.

Does it work? Here are the top fifteen words for the concatenated Alice in Wonderland and Alice through the Looking Glass (together roughly the same size as the VMs):

$C$ and $\sigma_\text{nor}$ are two slightly different notions of relevance, based on the principle described above. We notice that there are certainly reasonable “subject” words shown. The word “alice” itself, what the book is nominally about, is not shown, and indeed does not get a positive score. This is an important key for interpreting the results: what we are getting is more akin to a list of keywords, than the single subject of the book. Topics that are discussed in specific sections of the book.

For more examples, the authors provide some results online for various texts from project Gutenberg.

Why is this such an exciting technique for the VMs? Because all we’ve used is the large scale structure of the book. Imagine translating The Hound of the Baskervilles to German. The words, of course, would change completely. So would the grammar, and probably the style. But the level statistics, the occurrences, at least of the salient words, would remain largely the same. The word Hund in the German version would occur on different pages, and with a different frequency than the word Hound in the English, but the basic clustering behavior would remain the same. So, even if we can’t read German, we can at least get a reasonable idea for which words have high aboutness.

Cutting to the chase, here are top 50 Voynichese words with the highest aboutness for $C$ and for $\sigma_\text{nor}$:

For this experiment, I used the Takahashi EVA transcription. Anything between curly brackets was replaced by a space, and any sequence of multiple special characters (!, *, %) was collapsed into one.

What does this tell us? Well not much, but a little. If the VMs is about anything, some paerts are likely about “shedy” and about “qokeedy”. More interestingly, note that most of the top words in the Alice example were nouns. That suggests that if these words refer to anything, they probably refer to concepts of some sort. Most of them perhaps function like nouns. Perhaps. I’ll finish up with some thoughts on how we can expand on this, but first, let’s look at another experiment the authors of this method did.

# Aboutness at the character level

For their second experiment, the authors removed all whitespace from a book, and performed the analysis for all character strings, up to a particular length. I couldn’t reproduce their results with their algorithm as they describe it, but I came up with something that performs similarly. Here are the top substrings for our concatenated Alice, with all whitespace removed:

On balance, it seems like most phrases detected start and end at a word boundary, and most complete phrases are indicative of what a local part of the text is about. I’ve highlighted where phrases start and end with a string that was a word (of more than three characters) in the original text. The highlighting doesn’t work perfectly, but it gives a decent impression, which we can use to interpret the VMs results.

“umpty”, incidentally, provides a good intuition for where the method fails. The character’s name “humpty dumpty” has high aboutness (C=22.28), but every time the string umpty occurs, it then immediately occurs again one character later (even though the string is relatively rare overall). This is such a non-random level of clustering that “umpty” gets a very high C-score.

The character level analysis of the VMs shows some interesting differences:

There’s a lot to note here.

Firstly, the C scores are bigger than any observed in a natural language text of this size. This indicates a much more clustered non-randomness in how substrings occur within the text. It was known already that there was more structure to the VMs text than to natural language, but it’s interesting to see that this leads to more clustering. Even if the method by which the text was composed is random, this suggest that strongly non-stationary process. In short, you can tell by the local properties of the text which part of the book you’re in (better than you can with a natural language book). If it is generated text, the generator either has a very sophisticated long-term memory, or the author reconfigured the generator between chapters.

Secondly, we are getting a lot more results. In both experiments, I started with the 200 substrings with the highest C score. For any pair $(s, l)$ of these, where $s$ contained $l$, I remove $l$ if its score is more than 2.5 below that of $s$, otherwise I remove $s$. I think we can hypothesize that the VMs text is somehow less “transitive” with its substrings than natural language. For instance, if “dormouse” occurs in a clustered way throughout Alice, then we can assume that most of its substrings (ormouse, rmous, dormo), will follow exactly the same pattern. Only a rare few (mouse, use) are words meaning something different, and are likely to show a different clustering pattern. In Voynichese, it seems, a random substring of a specific word, is much less likely to follow the clustering behavior of its superstring. We can see this from the top two entries: “hedyqo” has a high score, but nowhere near as high as its substring “edyqo”. In other words “edyqo” occurs many more times, and in many different words than just in “hedyqo” alone.

Quite what the Voynich “words” mean, and what role substrings play in the text is an open question. Hopefully, I’ll come back to that in a later post.

# A promising method

Of course, it doesn’t do us much good to know that the VMs is about “shedy” if we don’t know what that means. But it does paint a promising picture of a more structured approach to cracking the VMs. Let’s start by stating our assumptions:

1. The VMs contains, in some form, possibly encoded, meaningful text in natural language (the plaintext).
2. The words of the VMS, i.e. strings separated by whitespace, map to plaintext words, phrases, or at least broad concepts.
3. Two occurrences of the same Voynichese word, more likely than not, mean the same thing.

There are certainly plausible scenarios for these assumptions to be violated, but as assumptions go in Voynich-land, these are pretty light-weight.

Given these assumptions, we know the level statistics are likely to be informative, and we can take the lists above as roughly “correct”. But that’s not where their usefulness ends. Remember that the words with the highest aboutness are likely to be nouns. Similarly, parts of speech like verbs and articles are likely to have very uniform distributions. However noisy, the level statistics, and similar features allow us to make an informed guess about the parts of speech of different Voynich words.

Imagine the following experiment:

• Take a book in language A, and tag the words with a set of simplified POS tags, like {noun, verb, numeral, other}.
• For each word, collect a series of measurements in the vein of level statistics: features that are largely invariant to cross-language translation.
• Train a simple (probabilistic) classifier to classify words by their POS tags
• Take a book in language B, collect the features for each word and use the classifier to tag the words.

Even if we filter everything but the 100 words the classifier is most certain about, it would still provide great insight into the grammatical structure of the VMs. Considering how many cyphers were cracked just by identifying a single numeral in the cyphertext, a provisional POS-tagger seems like a great luxury.