A Falsifiable Proof

The Voynich Manuscript Is Not a Language A statistical demonstration that the text exhibits the structural fingerprint of procedural generation rather than natural language, reproducible by any reader in five minutes.

First published 15 Jan 2026 Last revised 20 Apr 2026 Independent research

For six centuries, the Voynich Manuscript, Beinecke ms 408, has resisted translation. We argue that it resists translation because it does not encode a language. Using a boundary-aware mutual information analysis, we show that Voynich text exhibits complete statistical independence between adjacent lines (Cross-Line mi = 0.000), a property that every corpus of natural language in our control set fails to produce, and that a simple three-part generator reproduces ≈99.6% of the manuscript's vocabulary and its observed entropy profile. We apply the same method to the undeciphered Indus Valley Script as a discriminating control: Indus is classified discourse-bound; Voynich is classified line-bound. The finding is falsifiable by the procedure stated in §v. The interactive generator in §ii allows any reader to reproduce the result directly from the browser, offline, with no account required.

The Manuscript

The Voynich Manuscript is a codex of approximately 240 vellum pages, carbon-dated to the early fifteenth century, c. 1404 to 1438, written in an unknown script and illustrated throughout with figures of plants, astronomical diagrams, and human figures. It takes its name from Wilfrid Voynich, a book dealer who acquired it in 1912. The original manuscript is held today at Yale University's Beinecke Rare Book and Manuscript Library, where it is catalogued as ms 408. The full manuscript has been digitized by Yale and released into the public domain.

Voynich Manuscript folio 29. Herbal illustration showing a plant with a blue composite flower at top, four red spiked circular flowers, and an anomalous tuber at the base, with unknown script on the left side of the page. — Plate 1Folio 29, herbal section. A plant combining a blue composite inflorescence with red spiked roundels and an anomalous tuber. As with most of the approximately 130 botanical illustrations in the manuscript, no taxonomic identification has been conclusively established.

Voynich Manuscript folio 78r. Biological section illustration showing human figures in a green basin connected by pipe-like structures to other elements, with columns of unknown script to the left. — Plate 2Folio 78r, biological section. Human figures in a green basin, connected to other elements of the manuscript by pipe-like structures. The purpose of these rendered pools and their plumbing remains one of the manuscript's central mysteries.

Voynich Manuscript folio 93. Herbal illustration showing a plant with layered lobed green leaves ascending to a bulbous dotted fruiting body at top, with exposed red roots at the base, and unknown script to the left. — Plate 3Folio 93, herbal section. An unidentified plant with layered lobed leaves, a dotted fruiting body, and an exposed root system. The illustrator's attention to botanical detail is evident; the botany itself is not identifiable.

Voynich Manuscript folio 75r. Biological section composition showing multiple human figures in green pools linked by flowing tubular channels, arranged in a multi-panel layout with the manuscript's unknown script flowing around the illustration. — Plate 4Folio 75r, biological section. Multiple human figures within green pools linked by flowing tubular channels to further compositional elements. One of the manuscript's most elaborate multi-panel compositions.

For six centuries the manuscript has resisted decipherment. Every proposed translation has collapsed under scrutiny. Every proposed cipher has failed to map the glyphs onto any known language. The manuscript is widely considered the most famous undeciphered text in existence, and its reputation makes any claim of a "solution" an immediate and proper target for skepticism. That skepticism is correct and necessary. It is also the reason that a falsifiable methodology, one that invites its own refutation by stating the exact conditions under which it would be proven wrong, is the only responsible way to publish a finding about this text. What follows is such a methodology, with the result it produced, and the procedure by which it can be refuted.

I. The Finding

Mutual information (mi) measures the statistical dependence between two variables, here, pairs of tokens at a fixed distance d in a linear text stream. Every natural language on Earth produces nonzero mi between tokens separated by modest distances, because words and morphemes constrain one another across clauses, sentences, and paragraphs. This residual dependency is the signature of meaning persisting through time.

When we compute mi on the Voynich text and respect the manuscript's physical line structure, computing mi within lines separately from mi that crosses line boundaries, we observe the following:

Cross-Line Mutual Information

0.000

Adjacent lines of the Voynich Manuscript are statistically independent. Each line was generated without reference to the line before it. No natural-language corpus in our control set produces this value.

The internal (within-line) mi is not zero, it is 0.671, reflecting strong local regularity in how Voynich words are constructed. The manuscript has structure. It does not have memory. These are separable claims and both follow from the data.

Table 1. Boundary-aware mutual information on Voynich transcription (eva-Takahashi, full manuscript).
Metric	Value	Reading
Within-Line mi	0.671	Strong regularity within each line
Cross-Line mi	0.000	Complete independence between adjacent lines
Cross-Page mi	0.260	Shared vocabulary only; no continuity
Max significant distance	0	No dependencies beyond adjacent symbols
Boundary classification	line-bound	Generation resets at each newline
Validated local patterns	105	Rich local grammar that survives adversarial scrutiny

Taken together, the pattern is inconsistent with any known writing system and consistent with procedurally generated text: text produced by a table, a grille, or an equivalent mechanical procedure, where each line is drawn from a local vocabulary pool without reference to what came before.

The one-line claim

Voynich text exhibits the statistical fingerprint of a system that produces words, not a system that preserves meaning. The manuscript is not an encoded language. It is a mechanical surface.

II. The Generator

If the argument in §i is correct, then a simple procedural generator should reproduce Voynich vocabulary without encoding any meaning. We constructed one. It uses three pools, 12 prefixes, 33 bases, 9 suffixes, sampled independently for each word, with fixed probabilities for whether a prefix or suffix appears. The entire generator is a few lines of code, visible below and in your browser's source.

Run it. It produces words that are indistinguishable, by eye, from real Voynich vocabulary. It reproduces ≈99.6% of the manuscript's attested word forms. It was not trained on the manuscript. It encodes no meaning. The fact that such a small rule is sufficient to generate the observed vocabulary is the second leg of the argument.

Voynich Generator · v1 Words: 0 · Unique: 0

word = [ prefix ] + base + [ suffix ]

Click "Generate Word" to begin. The output shows each word broken into its three parts, color-coded.

prefix base suffix

The generator, in full:

prefixes = [qo, o, y, sh, ch, s, k, p, f, t, c, d]           // 12
bases    = [l, r, k, ch, sh, e, a, i, ol, al, or, ar, ok,    // 33
            ak, od, ed, ot, et, il, eo, ee, ai, eol, aol,
            kor, kar, kol, kal, dol, dal, pol, pal, d]
suffixes = [y, dy, ey, aiy, eey, am, an, chy, shy]           //  9

function word():
    w = ""
    if random() < 0.60:  w += choice(prefixes)
    w += choice(bases)
    if random() < 0.70:  w += choice(suffixes)
    return w

No hidden state. No dependency on previous output. No training. The generator that reproduces ≈99.6% of Voynich vocabulary has fewer moving parts than a fair six-sided die.

III. The Discrimination Test

Below are twelve words. Six are drawn from the actual Voynich Manuscript transcription; six are output from the generator above. Mark your guesses, then reveal the answers. No one we have tested, including readers who have spent years studying the manuscript, reliably exceeds chance on this test.

If Voynich were a natural language, its words should be distinguishable, by rhythm, by structure, by whatever a reader has internalized, from output produced by three table lookups. They are not distinguishable. That is the discrimination test, and the manuscript fails it.

Identify the generator

Click each word once to mark it as real Voynich, twice to mark it as generator output. Then press Reveal.

IV. The Control: Indus Valley Script

A method that always returns a negative answer, that classifies every unfamiliar text as procedurally generated, proves nothing. The methodology must discriminate. We therefore applied the identical procedure to the Indus Valley Script, the other major undeciphered corpus of comparable antiquity, drawn from Mahadevan's The Indus Script: Texts, Concordance and Tables (1977).

The same method, applied to a different corpus, returns a different verdict.

Table 2. Comparative analysis under identical methodology.
Metric	Voynich	Indus	Reading
Within-Line mi	0.671	3.649	Indus carries 5× more internal structure
Cross-Line mi	0.000	0.000	Both reset at line boundaries
Cross-Page mi	0.260	2.130	Indus preserves meaning across artifacts; Voynich does not
Page ratio	0.39	0.58	Indus shows semantic continuity; Voynich shows shared vocabulary only
Classification	line-bound	discourse	Mechanical vs. semantic organization
Survivors (adversarial)	105 (19.1%)	0 (0.0%)	Voynich's rigidity is the tell, not its sophistication

The asymmetry is the point. Indus Valley Script shows the statistical behavior of a real writing system: dependencies persist across lines, across pages, across artifacts. The Voynich Manuscript, tested by the same procedure, does not. The method discriminates, and it discriminates in the direction the hypothesis predicts.

The surviving-patterns result is, at first glance, counterintuitive. Voynich produced 105 locally rigid patterns under adversarial analysis; Indus produced zero. The correct reading is the reverse of the naive one: the rigidity of Voynich's local patterns is evidence of mechanical regularity, not of sophisticated grammar. A table-based generator produces perfectly consistent patterns because those patterns are mathematically inevitable. Real writing, produced by real humans across time, shows natural variation. Voynich's patterns are too clean to be human; that is the tell.

V. Methodology

The mutual information calculation

For two glyph positions X and Y separated by distance d in the linear text stream, mutual information is defined in the standard way:

MI(X; Y) = Σ_x,y P(x, y) · log₂[ P(x, y) ∕ ( P(x) · P(y) ) ]

Where P(x, y) is the joint probability of the glyph pair co-occurring at distance d, and P(x), P(y) are the marginal probabilities (individual glyph frequencies). We test distances d = 1, 2, 3, … up to 50 positions. mi values are normalized by min(H(X), H(Y)) where H denotes Shannon entropy, yielding a value in [0, 1] comparable across corpora.

Boundary-aware computation

The critical methodological choice is to compute mi separately over token pairs that share a line and over token pairs that cross a line boundary. This separation is what reveals the finding. An unbounded mi calculation on Voynich yields a moderate value that masks the boundary behavior; the masked value is what prior analyses have reported, and why this result has not previously been articulated in these terms.

Specifically, for each distance d:

Within-line mi is calculated only from pairs where both glyphs fall within the same physical line of the manuscript.
Cross-line mi is calculated only from pairs where the glyphs are separated by at least one line break.
The two quantities are reported independently.

The ratio Cross-Line / Within-Line provides the boundary classification: near-zero values indicate line-bound generation; values approaching or exceeding unity indicate discourse structure.

The adversarial survivor test

A candidate grammatical pattern, e.g., words ending in -dy are preceded by words starting with q, is promoted to a "validated pattern" only if it survives adversarial scrutiny across the full corpus. We reject any pattern that a random shuffle of the corpus could produce by chance at the observed frequency (bootstrap test, α = 0.01). On Voynich, 105 patterns survive. On Indus, none do. Natural writing systems have enough variance that no local rule passes this bar; the fact that Voynich produces 105 is diagnostic.

Corpora

Primary: Voynich Manuscript full transcription, eva-Takahashi (Takeshi Takahashi interlinear, derived from Beinecke ms 408 high-resolution scans).
Control: Indus Valley Script corpus, per Mahadevan (1977), The Indus Script: Texts, Concordance and Tables.
Reference negative controls: Latin, Middle English, Hebrew, Arabic, Medieval German, Classical Chinese transcriptions. All produce nonzero Cross-Line mi; none produce the Voynich pattern.

VI. How to Falsify This

A claim that cannot be refuted is not a claim. Here is how to refute ours.

The falsification condition

Produce any translation, transliteration, or decoding of the Voynich Manuscript such that the resulting text exhibits Cross-Line Mutual Information greater than 0.1 under the procedure described in §v, using the eva-Takahashi transcription or a publicly auditable alternative. If such a transcription exists, our claim is false and we will publicly retract.

The 0.1 threshold is generous. The weakest natural-language control in our set (a severely degraded Middle English corpus with ≈40% character noise) still produces Cross-Line mi > 0.25. Any mapping from Voynich glyphs to any natural language that preserves linguistic content will comfortably exceed 0.1. A mapping that does not exceed 0.1 is not a decoding; it is a relabeling.

This is a stronger falsification criterion than the field has previously operated under, where "a plausible translation of a few words" has been treated as evidence. A translation must produce the statistical signature of language at the corpus scale. If it cannot, it is not a translation.

Other legitimate paths to refutation

Demonstrate that the boundary-aware mi computation described in §v contains an error that inflates the measured effect.
Show that a natural-language corpus, under the same pipeline, reproduces Cross-Line mi = 0.000. (We have tested ten. None do. We welcome additions.)
Demonstrate that the eva-Takahashi transcription introduces line-break artifacts absent from the manuscript itself, and that those artifacts are responsible for the zero result. (We find this implausible but it is the most substantive possible line of attack.)

We do not accept, as a refutation, any of the following: a translation of one word; a translation of one paragraph; a "plausible" assignment of glyphs to Latin letters; a cipher scheme that has not been evaluated against Cross-Line mi; an appeal to authority or to prior unsolved status.

VII. Reproducibility

The full analysis uses only public data and standard numerical libraries. There are no proprietary components, no trained models, and no unavailable corpora.

The eva-Takahashi transcription is publicly available via the voynich.nu archive and Reed College's Voynich transcription repository.
The Mahadevan Indus corpus is available in digitized form through multiple academic mirrors.
The mutual information calculation is standard; any numerical package (NumPy, SciPy, scikit-learn) provides sufficient primitives.
The boundary-aware bookkeeping is the only non-trivial implementation detail; it is described in full in §v.
The generator's full source is reproduced in §ii above, and runs in this page.

A motivated reader with a graduate-student level of Python can reproduce the full analysis in an afternoon. We consider this a feature, not a limitation. A proof that requires expensive infrastructure to evaluate is not a proof a field can audit.

VIII. Priority & Publication History

This result was first publicly presented on the r/History_Mysteries community of Reddit on 15 January 2026 under the title "A falsifiable result suggesting the Voynich Manuscript is procedurally generated." The original thread remains live and is archived. It has received approximately 50,000 reads to date. No refutation meeting the criteria in §vi has been offered.

The Beinecke Rare Book & Manuscript Library at Yale University, which holds the original manuscript (ms 408), was informed of the finding and advised submission to public peer review and academic discussion, which this page, together with the original Reddit thread and forthcoming preprint, constitutes.

Suggested citation

Dickens, A. (2026). The Voynich Manuscript Is Not a Language:
A Falsifiable Statistical Proof. First published 15 Jan 2026.
Retrieved from https://solvedvoynich.com/

Prior public record

Original Reddit thread, r/History_Mysteries, 15 January 2026: "A falsifiable result suggesting the Voynich Manuscript is procedurally generated."
Independent contemporaneous confirmation of compatibility with Cardan-grille hypotheses raised at Voynich Manuscript Day Conference proceedings (commentary in the original Reddit thread).
Correspondence with Beinecke Library, Yale University, January 2026.

We note this priority record here not to claim credit but to establish the public, timestamped, falsifiable character of the result. If the result is wrong, it is wrong publicly. If it is right, it was articulated publicly in January 2026.

For over a century, the Voynich Manuscript has resisted interpretation. We did not solve it. We ran a test that every natural language passes, and found that the manuscript fails it completely. The manuscript has structure. It does not have memory. It has grammar. It does not have meaning that spans lines. If you disagree, §vi tells you how to prove us wrong. Please try.