LLM Wikis Are Over-Engineered — I Changed Mine With a Pure Python Compiler

Cease Returning Textual content from RAG: The Typed Reply Contract That Prevents Hallucination

AI Brokers Defined: What Is a ReAct Loop and How Does It Work?

TL;DR

I pipeline that compiles a folder of uncooked, messy textual content notes right into a linked, linted markdown wiki. No LLM calls, no embeddings, no exterior APIs, commonplace library solely.
The pipeline has 4 levels: a regex extractor, a graph builder that detects cross-references, a section-aware rewriter that preserves something you write by hand, and a linter that checks its personal output.
I hit two actual bugs whereas constructing this: a graph builder that scaled badly, and a linter that silently undercounted orphan pages. Each are on this article as they really occurred, together with the fixes.
I benchmarked the complete pipeline at three corpus sizes on two totally different machines (Linux and Home windows) and checked whether or not the deterministic outputs really matched throughout each. They did, precisely.
Full code, all 17 exams, and unrounded terminal output are included under so you may rerun all the things your self.

Why I wrote this

I attempted constructing a Karpathy-style LLM wiki. Agent loops. Recursive LLM calls. Embeddings for all the things.

The enter was a folder of native markdown recordsdata I already had, sitting by myself disk.

And partway via, it hit me: I used to be paying tokens to reorganize textual content I already owned.

So I changed all the pipeline with a pure Python compiler.

This text walks via that system in full: flip a folder of uncooked, inconsistently formatted textual content notes right into a linked, linted markdown wiki, with zero LLM calls, zero exterior APIs, and 0 third-party dependencies. Each benchmark quantity under is actual, was run on two totally different machines (a Linux container and my very own Home windows PC), and I’ve included the 2 actual bugs I hit whereas constructing it.

When you’re trying to find a pure Python markdown wiki compiler, a deterministic various to agent-based data base instruments, or a sensible breakdown of constructing a local-first RAG various, that is that article.

Full code: https://github.com/Emmimal/wiki-compiler/

The compiler mindset

Right here’s the reframe that the remainder of this text is constructed on:

An agent decides what your wiki may seem like. A compiler ensures what it should seem like.

Diagram comparing the stochastic nature of an Agent pipeline with the deterministic nature of a Compiler pipeline. The Agent pipeline flows through LLMs and rewrites, while the Compiler pipeline flows through a parser, graph, and rewrite process. — Stochastic Agent pipelines versus deterministic Compiler pipelines. Whereas Agentic workflows introduce variability via iterative LLM calls, Compiler-based architectures present constant, reproducible outputs. Picture by Writer

I wished this wiki to be predictable. In contrast to an LLM which varies its output, a compiler offers you a similar outcome each single time you run it. That consistency is important for my private reference notes. I’ve structured this technique in order that markdown pages act like object recordsdata. They’re generated from supply and may be rebuilt at will. I maintain the hand-edited content material separate from the machine-generated sections. Right here is how the pipeline is about up:

A vertical flowchart detailing "The Compile Pipeline," showing a 5-stage automated data pipeline that transforms unorganized raw markdown notes into a structured, validated, and fully linked personal wiki. — The step-by-step system structure of the Markdown compilation pipeline, mapping out the automated extraction, hyperlink graph era, part rewriting, and structural linting phases. Picture by Writer

I broke this down into 4 levels. Every one handles a single deterministic activity and may be examined by itself. I averted any step that depends on a mannequin to make a judgment name.

Why zero dependencies issues right here, particularly

Every little thing on this codebase runs on the Python commonplace library alone. No sentence-transformers, no vector database, no HTTP consumer for an embedding API. That’s not a purity take a look at for its personal sake. It’s a direct consequence of the issue this pipeline solves.

When you strip away the LLM calls, what’s really left to do is textual content parsing, string manipulation, and graph traversal over an in-memory dictionary. These are precisely the sorts of issues re, os, and plain Python knowledge constructions had been constructed for. Reaching for a heavier dependency right here wouldn’t purchase correctness, it might simply purchase set up friction and yet one more factor that may break for causes that don’t have anything to do together with your precise notes. When you’ve ever had a pip set up grasp on Home windows as a result of a compiled wheel for a machine studying library wasn’t out there to your Python model, you already know why “it simply runs” is price defending.

The issue with agent-driven wikis

The thought of utilizing an LLM to construct and preserve a private wiki isn’t new, and it isn’t mine. It gained severe traction after Andrej Karpathy described the sample in a broadly shared put up, the place he defined that he was spending much less of his token finances producing code and extra of it constructing structured, persistent data bases out of his analysis notes. He adopted up with a public “concept file” laying out the structure in additional depth, and explicitly in contrast the method to compilation: uncooked sources go in, a structured, cross-referenced wiki comes out, and the LLM is the factor doing the compiling [1][2].

I believe that compilation framing is precisely proper. I simply don’t assume an LLM must be the compiler.

Right here’s the sensible drawback. In case your uncooked supply is already native, already textual content, and already deterministic, routing it via a probabilistic system to prepare it introduces three prices {that a} parser or a compiler merely doesn’t have:

Value: Each time you add a brand new doc, an agent-driven wiki re-reads content material, decides what modified, and rewrites pages. That’s token spend on organizational work, not synthesis. It provides up quick as soon as your supply folder has a whole bunch of recordsdata as an alternative of a dozen.

Latency: Each read-decide-write cycle is a community spherical journey for those who’re utilizing a hosted mannequin, and an actual compute value even for those who’re operating one thing native. For work that’s essentially about restructuring present textual content, that latency has no cause to exist.

Non-determinism: That is the one that really bit me. I ran the identical folder via an early agent-based prototype twice and bought two totally different hyperlink constructions. Nothing had modified within the supply recordsdata. The mannequin simply made barely totally different judgment calls each instances about what counted as associated. For code, that’s often charming. For the factor you’re utilizing as your supply of fact, it’s an issue.

None of this implies LLMs are the fallacious device for data work normally. It means they’re the fallacious device for the particular a part of the job that’s really deterministic: taking identified inputs and producing a identified, reproducible construction. That half is a parsing drawback, not a reasoning drawback.

Step 1: The regex metadata extractor

Actual be aware folders are a large number. Some recordsdata use a # Header, some use a naked uppercase line, and a few don’t have any header in any respect. Metadata like “created:” or “aliases:” could be lacking, or hidden in the course of the file. Any extractor anticipating constant formatting breaks the second it hits a real-world file, so I constructed mine to deal with the mess.

It checks for a # header first. If that fails, it seems for a naked capitalized line. If it nonetheless finds nothing, it defaults to the filename. It scans for metadata fields wherever they occur to exist somewhat than requiring them to be in a selected spot.

This implies the pipeline doesn’t crash on a malformed be aware; it simply works with what it finds. It’s the most boring a part of the venture, however it cleans up the majority of the mess, which is why I spent probably the most time getting this stage to really maintain up.

Step 2: The Graph Builder

This stage handles the hyperlinks between notes. I hit a efficiency wall right here.

My first model ran a separate regex for each entity towards each different file. It was an O(n^2) method. With 100 recordsdata, it was tremendous. With 1,000 recordsdata, it took 4.4 seconds. At 5,000 recordsdata, it took 107 seconds. I had assumed that not calling an API meant the code can be quick by default. I used to be fallacious. The algorithm issues greater than the dearth of community calls.

I changed the pairwise regex matching with a word-indexed phrase matcher. Now, I tokenize every file as soon as. As I transfer via the tokens, I exploit a dictionary lookup to test just for entity names that begin with the present phrase. As a substitute of testing each entity identify towards each single place, I solely test the candidates that might really be a match. It turns a course of that grew quadratically into one which scales way more effectively because the corpus expands.

The outcomes modified drastically. The 1,000-file run dropped from 4.4 seconds to beneath 50 milliseconds. The 5,000-file run went from 107 seconds to lower than a second. These particular numbers are from early testing in my Linux growth setting, earlier than I ever ran the pipeline on Home windows, so don’t anticipate them to match the Home windows numbers later on this article precisely; the benchmark part additional down has the true Home windows figures. Numbers matter, so right here is the precise development:

Method	100 recordsdata	1,000 recordsdata	5,000 recordsdata
Naive pairwise regex (first model)	~46 ms	~4,400 ms	~107,000 ms
Mixed regex alternation (intermediate)	~12 ms	~597 ms	~14,000 ms
Phrase-indexed phrase matcher (last)	~2 ms	~33 ms	~492 ms

The center row of my benchmark is price noting, as a result of it was my first try at a repair and it failed. I attempted combining each entity identify into one large alternation sample (Name1|Name2|Name3...). Whereas this decreased the variety of regex objects, the underlying engine nonetheless needed to test the complete listing at each single character place it scanned. With 5,000 entities, that’s nonetheless successfully quadratic habits carrying a linear disguise. The word-indexed matcher was the one model that really fastened the complexity, somewhat than simply hiding it behind a sooner fixed issue.

Here’s what the ensuing graph really seems like for 3 actual entities from my take a look at corpus:

A conceptual diagram showing a bidirectional mention graph excerpt with three knowledge nodes—Attention Mechanism, Learning Rate Schedule, and Gradient Descent—mapped directly to the corresponding Markdown internal link syntax. — An architectural format illustrating the direct relationship between uncooked Markdown backlink syntax and the parsed bidirectional graph visualization. Picture by Writer

Detection right here is strictly lexical, not semantic. It solely matches if the precise identify seems within the textual content. It doesn’t know that “the corporate” and “my employer” could be the identical factor. That’s a real limitation, and I’ll get into the precise value of it later on this article.

Step 3: The section-aware rewriter

This stage writes the precise markdown. It isn’t an summary syntax tree parser. It doesn’t stroll any bushes. It simply does focused string alternative between particular ## Heading tags, so I’m calling it a section-aware rewriter.

I didn’t need the compiler to simply nuke each file and begin from scratch. I wanted a strategy to maintain something I scribbled right into a web page’s Notes part throughout a recompile. The logic is useless easy: earlier than it writes a web page, it checks if a file already exists on disk. If it does, it grabs no matter is beneath that web page’s ## Notes heading. Then, the compiler-owned sections—Metadata, Associated, Referenced By, and Physique—get wiped and regenerated from supply to maintain them correct, whereas the Notes content material will get written again in as-is.

I didn’t simply belief that this might work; I examined it. I manually added a be aware to a generated web page, tweaked the supply, and ran the compiler. I confirmed the be aware stayed put whereas the opposite sections up to date. I reran the identical take a look at on my Home windows machine just a few days later, and it labored precisely the identical means.

Right here’s what that really seems like, begin to end, on one actual entity from my take a look at corpus:

RAW SOURCE (raw_notes/attention_mechanism.txt)
------------------------------------------------
ATTENTION MECHANISM
created: 2026-02-27

A typical mistake is tuning Consideration Mechanism with out first
checking Studying Fee Schedule.

This part wants a cleaner instance earlier than it's thought of last.

COMPILED OUTPUT (compiled_wiki/attention_mechanism.md)
------------------------------------------------
# Consideration Mechanism

## Metadata
- created: 2026-02-27
- aliases: none
- supply: raw_notes/attention_mechanism.txt

## Associated
- [[Learning Rate Schedule]]

## Referenced By
- [[Gradient Descent]]

## Physique
A typical mistake is tuning Consideration Mechanism with out first
checking Studying Fee Schedule.

This part wants a cleaner instance earlier than it's thought of last.

## Notes
_(add your personal notes right here -- preserved on recompile)_

Nothing within the uncooked file instructed the compiler that Gradient Descent references this web page. That hyperlink bought added routinely, as a result of Gradient Descent’s personal uncooked be aware occurs to say “Consideration Mechanism” in its physique textual content, and the graph builder caught it. That’s all the pipeline in a single concrete instance: messy enter in, structured and cross-referenced output out, with zero handbook linking.

Step 4: The Linter (and the Second Bug)

The linter is straightforward: it walks the output, flags damaged hyperlinks, and calls out pages that no one else is pointing to. Individuals may mistake this for an LLM “judgment” step, however it’s not. It’s a dumb, structural test with fastened guidelines.

My first model had a large bug. I’m mentioning it as a result of it’s the sort of factor you’ll miss until you write a take a look at that really probes the logic. The linter counted incoming hyperlinks by scanning each [https://towardsdatascience.com/llm-wikis-are-over-engineered-i-replaced-mine-with-a-pure-python-compiler/] within the file. Right here’s the issue: the Referenced By part additionally incorporates [[links]]. These hyperlinks observe pages that time to the present file, not pages the file factors to itself. My linter was counting these as outgoing hyperlinks, which blew up the rely for each web page.

The outcome? On my 100-file take a look at corpus, with 13 confirmed orphans, the buggy linter reported zero. It wasn’t simply undercounting, it instructed me the wiki was completely linked when it wasn’t. I’d have shipped that error if I hadn’t double-checked the logic towards a second supply of fact.

The repair was to restrict the rely to the Associated part solely. That’s the one place the place real outgoing edges really reside.

related_text = _extract_section(textual content, "Associated")
for match in LINK_RE.finditer(related_text):
    target_slug = _slugify(match.group(1))
    if target_slug in incoming_count:
        incoming_count[target_slug] += 1

After that change, the linter’s orphan rely matched the graph builder’s precisely. Each time. It held up whatever the corpus measurement. I added a regression take a look at for this—named after the bug—so it will possibly by no means silently creep again in.

The Full Check Suite

I’ve bought 17 exams, utilizing solely stdlib unittest, protecting each stage plus the complete end-to-end pipeline. Right here’s a consultant slice:

test_linter_does_not_miscount_referenced_by ... okay
test_human_notes_preserved_across_recompile ... okay
test_recompile_is_idempotent_on_compiler_owned_sections ... okay
test_deterministic_output ... okay

Ran 17 exams in 0.020s
OK

That test_linter_does_not_miscount_referenced_by is the regression take a look at for the bug I simply walked via. It’s the ugliest identify within the file, however it’s crucial one.

I didn’t simply decide 17 as a result of it felt proper; the construction issues. Every stage has its personal remoted exams utilizing hand-built Entity objects as an alternative of the artificial generator. I did this so a failure factors to precisely one stage. I don’t wish to spend an hour debugging a full pipeline simply to discover a one-line typo.

The complete-pipeline exams on the backside are totally different. They catch integration issues that unit exams can’t see. The idempotency take a look at is an effective instance—it recompiles the identical corpus twice and makes certain the output is byte-identical each instances. If the rewriter had unintentionally launched any non-determinism, like a rogue timestamp, that take a look at would have screamed at me instantly.

I’d somewhat have 17 exams that every fail for one particular, apparent cause than one large integration take a look at that fails and leaves me guessing which of the 4 levels really broke.

The Benchmark: Two Machines, Identical Numbers

I ran the complete pipeline at three totally different scales, each in a Linux container and on my native Home windows 10 machine, utilizing the identical seed to maintain the supply materials an identical.

Terminal output, uncooked:

Recordsdata	Extract	Graph	Rewrite	Lint	Compile whole	Full pipeline	Orphans
100	22.8 ms	3.1 ms	59.4 ms	86.0 ms	85.4 ms	171.4 ms	13
1,000	261.5 ms	47.1 ms	605.5 ms	883.9 ms	914.1 ms	1,798.0 ms	133
5,000	1,398.4 ms	625.6 ms	3,446.7 ms	6,972.5 ms	5,470.6 ms	12,443.1 ms	644

FULL PIPELINE TIME BY STAGE AT 5,000 FILES (12.44s whole)
   ============================================================

   extract  [==]                                    1.40s  (11%)
   graph    [=]                                      0.63s  ( 5%)
   rewrite  [=======]                                3.45s  (28%)
   lint     [==============]                         6.97s  (56%)

   ============================================================

Lint is the costliest stage by far. At 5,000 recordsdata, it prices greater than the extract, graph, and rewrite levels mixed.

That shocked me. Lint has the lightest logic in all the pipeline—it simply opens every file as soon as to regex-scan two small sections. The bottleneck isn’t the code; it’s the disk I/O. Lint hits each file with a contemporary learn, and Home windows is considerably slower than Linux right here. Home windows Defender probably contributes to this, checking each file because it opens, although I haven’t verified this immediately towards Defender’s personal logs.

The graph builder has zero disk I/O, so it scales one of the best, however it isn’t completely linear.

Going from 100 to 1,000 recordsdata—a 10x bounce in knowledge—took 15.2x the time. Leaping from 1,000 to five,000 recordsdata (one other 5x improve) took 13.3x the time. That’s the true, measurable value of the word-indexed matcher having to test extra candidate names per token because the entity listing grows, and it means the scaling isn’t fairly linear.

The orphan counts (13, 133, 644) stayed an identical throughout each single run, whatever the OS. That’s not a coincidence. It’s all the level of constructing this as a compiler somewhat than an agent: the outputs are deterministic. They don’t transfer. Solely the wall-clock time shifts, which is only a reflection of the {hardware} and OS, not the algorithm.

These orphan numbers are connectivity stats from an artificial, seeded corpus, not a measure of ‘high quality.’ I verified them independently; the graph builder and the linter calculate orphan standing via utterly totally different code paths, they usually agreed at each scale I examined.

What this implies for real-world use: at 5,000 notes, a full recompile takes about 12 seconds on commonplace Home windows {hardware}—zero token value, zero community calls. When you’re operating on the scale of most private data bases (just a few hundred to some thousand notes), a full recompile finishes in beneath two seconds.

The place this breaks

Unstructured or wildly inconsistent supply knowledge. My extractor handles two header types and a few non-obligatory metadata as a result of that’s what my take a look at corpus required. When you throw a folder of chaotic, garbled textual content or multi-language notes with zero construction at it, regex alone isn’t going to chop it. You’d want a considerably extra refined extraction layer.

Semantic linking. That is the massive one. My graph builder makes use of precise identify matching. If one be aware talks about “gradient descent” and one other describes “the optimization step” with out utilizing the literal phrase, they gained’t hyperlink. Nothing on this pipeline understands that means. That’s the trustworthy boundary of what a deterministic compiler can do. If I had been to increase this, I’d bolt a semantic layer on as a clearly separated enhancement—I wouldn’t fold it into the core deterministic path.

The framing isn’t that LLMs are the fallacious device for constructing a private wiki. It’s that they’re the fallacious device for the 90 p.c of the job that’s purely mechanical, and arguably the fitting device for the ten p.c that requires really understanding what the textual content means somewhat than simply matching the way it’s spelled.

Closing thought

In case your enter is deterministic, your pipeline needs to be too. In any other case, you’re simply including randomness the place none existed.

That’s the argument of this complete piece. Not each data administration drawback wants an agent loop. Generally, you simply want a parser, a graph, and a linter that tells you the reality about your personal output—together with the components the place it’s fallacious.

The complete supply code, together with the generator, extractor, graph builder, rewriter, linter, benchmark harness, and all 17 exams, is at: https://github.com/Emmimal/wiki-compiler/

Assets and citations

This text references the next main sources. Quoted materials is restricted to brief phrases beneath fifteen phrases per supply, in step with commonplace fair-use apply for commentary and technical writing; I’d encourage studying the originals immediately somewhat than counting on my abstract of them.

[1] Andrej Karpathy, authentic put up describing LLM-driven private data bases, X, April 2026. https://x.com/karpathy/standing/2039805659525644595

[2] Andrej Karpathy, “LLM Wiki” concept file (GitHub Gist), April 2026, describing the wiki-as-compiled-artifact sample referenced all through this piece. https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f

All code, benchmark numbers, and take a look at outcomes on this article are my very own, generated by operating the included codebase immediately. No proprietary datasets, copyrighted textual content, or third-party code had been utilized in constructing or benchmarking this technique.

LLM Wikis Are Over-Engineered — I Changed Mine With a Pure Python Compiler

Cease Returning Textual content from RAG: The Typed Reply Contract That Prevents Hallucination

AI Brokers Defined: What Is a ReAct Loop and How Does It Work?

Related Posts

Cease Returning Textual content from RAG: The Typed Reply Contract That Prevents Hallucination

AI Brokers Defined: What Is a ReAct Loop and How Does It Work?

The Untaught Classes of RAG Retrieval: Cosine Is Not the Basis

Tokenminning: Learn how to Get Extra from Your Chatbot for Much less

Why Highly effective ML Is Deceptively Simple — Half 2

2026 BAIR Graduate Showcase – The Berkeley Synthetic Intelligence Analysis Weblog

Leave a Reply Cancel reply

POPULAR NEWS

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

Chainlink’s Run to $20 Beneficial properties Steam Amid LINK Taking the Helm because the High Creating DeFi Challenge ⋆ ZyCrypto

Easy methods to Use LLMs for Highly effective Computerized Evaluations

XMN is accessible for buying and selling!

College endowments be a part of crypto rush, boosting meme cash like Meme Index

EDITOR'S PICK

IBM Provides Granite 3.2 LLMs for Multi-Modal AI and Reasoning

Provide chain assault strikes array of Chrome Extensions • The Register

Rumble secures $775 million funding from Tether

Lowering Time to Worth for Information Science Tasks: Half 2

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

LLM Wikis Are Over-Engineered — I Changed Mine With a Pure Python Compiler

READ ALSO

TL;DR

Why I wrote this

The compiler mindset

Why zero dependencies issues right here, particularly

The issue with agent-driven wikis

Step 1: The regex metadata extractor

Step 2: The Graph Builder

Step 3: The section-aware rewriter

Step 4: The Linter (and the Second Bug)

The Full Check Suite

The Benchmark: Two Machines, Identical Numbers

The place this breaks

Closing thought

Assets and citations

Related Posts

Leave a Reply Cancel reply

POPULAR NEWS

EDITOR'S PICK

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?