Esoteric LLM: Fine-Tuning a 9B Model on Tarot, Kabbalah, and Alchemy for $2

Share

Esoteric LLM is a useful experiment wrapped in an unusual domain: a 9-billion-parameter language model fine-tuned on 22 books about tarot, kabbalah, hermeticism, alchemy, the I Ching, astrology, and gnosticism. The full training run took about 11 hours, and the synthetic instruction dataset cost less than $2 to generate through an API. The occult angle makes the project memorable; the engineering lesson is more general: small domain-specific LLMs are becoming cheap enough to train as practical tools, not only as lab demos.

The Domain Is Odd, the Pattern Is Practical

The project asks what happens if a model reads a curated esoteric library and then learns to answer questions from it. The corpus contains roughly 11 million characters across 22 books; 21 files survived cleaning, while one scanned PDF was rejected because OCR quality was too poor. That is a small dataset by foundation-model standards, but it is large enough to teach a narrow model the vocabulary, authors, and cross-references of a specific field.

This is the part that matters beyond tarot. A universal model can be prompted with background context every time, but fine-tuning changes the default behavior of the model itself. The model starts to speak the domain language without needing the entire library pasted into the prompt.

The Corpus: 22 Books Across Seven Traditions

The README lists canonical and research-heavy sources rather than generic New Age filler. The tarot set includes Waite, Crowley, and Liber 777. Alchemy includes Splendor Solis, Maier’s Atalanta Fugiens, and Jung. The I Ching uses the Wilhelm/Baynes translation, while the gnostic side includes Nag Hammadi texts and Pistis Sophia.

Tradition Example texts
Tarot Waite, Crowley, Liber 777
Astrology Lilly, Rudhyar, Tarnas
Kabbalah Sefer Yetzirah, Zohar excerpts
Hermeticism Corpus Hermeticum, Emerald Tablet commentaries
Alchemy Splendor Solis, Atalanta Fugiens, Jung
I Ching Wilhelm/Baynes translation
Gnosticism Nag Hammadi texts, Pistis Sophia

After processing, the corpus became 3,035 chunks of roughly 1,000 tokens, split at paragraph boundaries. That detail is not cosmetic: paragraph-aware chunking gives the model coherent passages instead of chopped fragments, which is especially important in a domain built around dense references and terminology.

Phase A: Continued Pre-Training Builds the Vocabulary

The first training phase is continued pre-training. The model reads the raw corpus without question-answer formatting, so the goal is not instruction following yet. The run took 1 hour and 40 minutes using 4-bit QLoRA, LoRA rank 64, alpha 32, bf16, and a cosine learning rate of 5e-5.

Only 132 million parameters were trainable, equal to 1.39% of the full model. Loss moved from 2.2 to 2.0. That is not a dramatic transformation, but it is the right kind of first step: the model becomes less surprised by the domain before it is asked to answer questions about it.

Phase B: Instruction Tuning Costs Less Than $2

The second phase teaches the model to answer. The author generated 12,495 question-answer pairs through the DeepSeek V4 Flash API, roughly four pairs per corpus chunk. The total API cost was under $2, which is the headline number for the project and also the reason it is worth paying attention to.

Instruction tuning took 8 hours and 28 minutes. The run continued from the Phase A LoRA adapter instead of starting a fresh one. Loss dropped from 2.0 to 0.91, a 54% reduction, which suggests the model learned the instruction format rather than only absorbing terminology. The hard part here is not buying compute; it is creating a dataset that asks the right questions without polluting the model with weak answers.

Phase C: DPO Helps With Preference, Not Personality Replacement

The third phase was Direct Preference Optimization, aimed at giving the model a denser scholarly writing voice with precise references. The author wrote 30 seed examples and expanded them into 800 preferred responses using a local Qwopus-27B model running through llama.cpp. DPO took 1 hour and 22 minutes with beta 0.1, learning rate 5e-6, and reported accuracy in the 75-85% range.

The metrics looked fine, but the style transfer was weaker than expected. The reason is a useful warning: Phase B had 12,495 English instruction pairs, while the DPO examples were in Spanish. Eight hundred examples cannot override the language and style prior established by twelve thousand. DPO can adjust preferences at the margin; it is not a clean personality transplant, however tempting that phrase may be in model-tuning discussions.

The Real Cost Is Data Work, Not API Spend

The project spent less than $2 externally, but the model did not magically appear for $2. It needed a 24 GB GPU for training, enough RAM for CPU-side merging, cleaned books, prompt design for dataset generation, checkpoint recovery, and GGUF export. The low API cost is real; it just should not be confused with zero engineering cost.

Stage Time Main result
Continued pre-training 1h 40min Loss 2.2 -> 2.0
Instruction tuning 8h 28min 12,495 Q&A pairs, loss 2.0 -> 0.91
DPO alignment 1h 22min 800 preference pairs, 75-85% accuracy
Final model ~11 hours 5.3 GB Q4_K_M GGUF

The final artifact is a 5.3 GB Q4_K_M GGUF model deployable through Ollama or llama.cpp. That makes it small enough for local use and specialized enough to be interesting. It is not a frontier model, but it is not trying to be one.

The Engineering Failures Are the Most Useful Part

The README documents several failures that make the project more useful than a clean success story. First, the LoRA merge for GGUF export could not fit both the base model and adapter into 24 GB of VRAM. The fix was a CPU merge: load everything into RAM, merge, and save. It took about 20 minutes and used around 125 GB of system memory.

Second, generating DPO “chosen” answers from scratch caused hallucinated citations and dates. In a domain where author names, editions, and text references matter, confident false citations are worse than vague answers. The fix was rewrite mode: provide a factual short answer first, then ask the model to restyle only the prose.

Third, instruction tuning crashed out-of-memory at step 340 with max_seq=2048. The repair was to reduce max_seq to 1024 and double gradient accumulation to keep the effective batch size at 16. Nothing mystical happened; the GPU simply ran out of memory before the experiment ran out of ambition.

Why This Matters Outside Esotericism

The interesting direction is not “occult chatbots.” It is small-scale specialization. The same pipeline can be applied to local regulations, old technical manuals, internal documentation, legal archives, training material, or a tightly scoped professional library. A corpus of 11 million characters is not huge, but it can be enough when the domain is coherent and the desired behavior is narrow.

The practical bottleneck shifts from compute to data engineering. Corpus selection, OCR quality, chunking, synthetic question generation, target language choice, and DPO design determine the result more than the API bill. In this project, the paid dataset generation was the smallest line item; the more expensive resource was judgment.

Conclusion

Esoteric LLM does not prove that a small model has discovered hidden wisdom. It proves something more useful: an end-to-end fine-tuning pipeline for a 9B model can run on consumer hardware, produce a local GGUF file, and use an external dataset budget that rounds to two dollars.

The broader lesson is straightforward. If the corpus is compact, curated, and domain-specific, fine-tuning a small model is no longer reserved for research labs with enormous budgets. It is becoming an engineering workflow: prepare the texts, generate the instruction data, train carefully, and remember that DPO can polish behavior but cannot undo a mismatched SFT foundation.

Source: Esoteric LLM on GitHub.

Leave a Reply