Solid Rock Hebrew Bible with Cantillation

Mikra according to the Masora accents mapped onto the Solid Rock Hebrew Bible

What are cantillation marks?

Cantillation marks (Hebrew: te'amim, טעמים) are the accent signs in the Hebrew Bible that indicate how each word is chanted during synagogue reading. They also function as a system of punctuation, marking phrase divisions and clause boundaries. Each word in the Hebrew Bible traditionally carries one cantillation accent.

In Unicode, cantillation marks occupy the range U+0591–U+05AF and appear as small marks above or below the consonant letters, distinct from the vowel points (nikkud, U+05B0–U+05BD).

The two source editions

Solid Rock Hebrew Bible (SRHB) is a TEI XML critical edition by Joey McCollum with 2,500+ textual adjustments. It includes consonants and vowel points but is missing cantillation marks.

Mikra according to the Masora (MapM) is a Hebrew Bible text available on Hebrew Wikisource that includes full cantillation marks along with consonants and vowel points.

Both editions draw on important manuscript traditions but each has its own editorial methodology and textual decisions. This project uses MapM as the source for cantillation accents and maps them onto the Solid Rock text word by word.

The transposition algorithm

The process runs in four stages:

1. Import Both texts are parsed and loaded into a SQLite database. Solid Rock is parsed from TEI XML files (via the quick-xml crate). MapM is parsed from a JSON export. Each word is stored with its book, chapter, verse, and position number.

2. Normalize for comparison Before comparing words, both texts undergo Unicode normalization:

Strip all cantillation marks (U+0591–U+05AF) and punctuation
Strip meteg (U+05BD) — MapM includes it, Solid Rock does not
Normalize kamatz katan (U+05C7) to regular kamatz (U+05B8)
Split MapM words joined by maqaf (U+05BE) into separate tokens
Filter out standalone paseq (U+05C0) tokens from MapM
Apply NFC normalization to resolve combining mark order differences

3. Match and transpose For each verse, Solid Rock words and MapM words are aligned by word position (1st word ↔ 1st word, 2nd ↔ 2nd, etc.). For each pair:

If the normalized forms match → Matched (confidence 1.0). Cantillation marks from the MapM word are applied to the Solid Rock word.
If the normalized forms differ → Mismatch (confidence 0.5). The original Solid Rock word is displayed without cantillation, since applying accents from a non-matching word would be misleading. These words await resolution from proper manuscript sources.
If no MapM word exists at that position → Not in MapM (confidence 0.0). The Solid Rock word is kept without cantillation.

4. Apply accents to matched words For words that match, the accent application works character by character:

Build a "skeleton" of the MapM word: each non-cantillation character paired with any cantillation marks that follow it.
Walk through the Solid Rock word and MapM skeleton in parallel.
When a Solid Rock character matches the corresponding MapM skeleton character, insert the associated cantillation marks after it.
When characters don't match, advance both pointers (best-effort alignment).

Results

Status	Words	Percentage
Matched — cantillation applied	283,868	92.8%
Mismatch — text differs, shown without cantillation	21,334	7.0%
Not in MapM	636	0.2%
Total	305,838

Why do mismatches occur?

The two editions sometimes have different words at the same position in a verse. Common causes include:

Different word counts. One edition may split or join words differently (e.g. with maqaf), causing all subsequent words in the verse to be offset by one or more positions.
Textual variants. The editions sometimes preserve different readings of the same passage.
Orthographic differences. Plene vs. defective spelling, or different vowel traditions for the same word.

Improving mismatch resolution — for instance through fuzzy matching or context-aware word alignment — is an active area of development.

How to read the viewer

Click any word in the viewer to see a tooltip with:

Status and confidence score
SR Original — the word as it appears in Solid Rock (without cantillation)
Result — for matched words, the cantillated result; for mismatches, the original SR word
Notes — for mismatches, shows what both SR and MapM texts looked like at that position

Source code and data

accent-transpose-chirho — Rust source code for the transposition engine
Solid Rock Hebrew Bible — TEI XML critical edition
Mikra according to the Masora — Hebrew Wikisource

Updating the Solid Rock text

This project tracks the Solid Rock Hebrew Bible as a Git submodule pointing to the upstream repository at github.com/jjmccollum/solid-rock-hb.

When updates are pushed to that repository, we incorporate them by:

Pulling the latest submodule: git submodule update --remote solid-rock-hb
Re-importing: cargo run -- import-solid-rock-chirho
Re-running transposition: cargo run -- transpose-chirho
Re-exporting: cargo run -- export-chirho
Redeploying the site and PDF

Any corrections to the Solid Rock text (such as typo fixes or future edition updates) are picked up automatically through this pipeline. The entire process from import to export takes under a minute.