Reconstructing the Desk of Contents a PDF Forgot to Ship, So RAG Can Scope by Part

doc parsing companion in Enterprise Doc Intelligence, the sequence that builds an enterprise RAG system from 4 bricks. It extends Article 5 (doc parsing) on one desk: toc_df, the doc’s part construction, which Article 5 fills from the PDF’s native define (PyMuPDF’s doc.get_toc) when there’s one. This half is in regards to the case the place there isn’t, reconstructing that construction from what the doc nonetheless reveals on the web page.

*the place this companion sits: it extends Article 5 (doc parsing), inside Half II (the 4 bricks), reconstructing the desk of contents when the PDF ships none – Picture by creator*

Open NIST FIPS 202, the SHA-3 normal (a US Authorities work, public area, see the NIST copyright assertion), and switch to web page seven. There’s a clear desk of contents: part titles on the left, web page numbers on the appropriate. Now open the identical file in any PDF viewer and take a look at the bookmarks pane. Empty. The contents web page is ink on a web page, not construction the machine can use. The creator wrote a wonderfully good desk of contents, and the file shipped with out exposing it.

Article 5 (doc parsing) and Article 5B (the relational information mannequin) leaned on doc.get_toc(), the PDF’s native define, to fill toc_df. It’s actual when it exists. It usually doesn’t. Loads of actual paperwork, papers exported straight from LaTeX, contracts printed to PDF, authorities requirements, carry a printed contents web page however no define. For these, toc_df comes again empty, regardless that the doc is telling you its construction in plain sight on web page seven.

That construction isn’t a nicety. Retrieval scopes by part (Article 7). The chunker cuts on heading boundaries (Article 5B). Summarization walks the doc part by part. Each a type of steps reads toc_df. When it’s empty, retrieval falls again to scanning each web page, the chunker splits on blind web page breaks, and the reply loses the doc’s personal construction. So the query this text solutions is slender and sensible: when the file ships no define however prints a contents web page, how do you flip that web page again right into a toc_df?

One factor up entrance, as a result of it’s straightforward to conflate. That is about paperwork that have a contents web page. A doc with no contents web page in any respect, a paper that simply opens with “1. Introduction”, a five-page memo, an export that stripped each heading, is a special downside. Recovering a skeleton from the physique of an unstructured doc is summarization, a separate intent that builds the map from the chunks moderately than studying one off a web page. Right here we solely ever learn a contents web page the doc already has.

1. Two halves: learn the entries, then discover their actual pages

It helps to separate two issues a contents web page provides you. The primary is an inventory of sections with titles and a hierarchy: what the doc is about, in what order. The second is a map from every part to the place it bodily begins within the file. The native define fingers you each free of charge. Studying a printed contents web page fingers you the primary immediately, however the second solely as printed labels, which aren’t bodily pages. The 2 halves have totally different failure modes, so the remainder of this text retains them separate: first learn the entries, then align them to bodily pages.

In: a PDF whose doc.get_toc() returns nothing however that prints a contents web page. Out: a toc_df with the identical form Article 5B outlined (degree, title, start_page, end_page, breadcrumb), so every part downstream retains working unchanged.

The contents web page is available in two flavours, and so they price totally different quantities to learn.

2. Three instances, by ascending price

*The cascade tries every case in flip and stops on the first that yields a usable TOC. Picture by creator.*

Every case has a detection step and an extraction step, and falls by way of to the following when it fails or returns too little.

Case 1, native define. Dealt with in Article 5 by build_toc_df. Free, actual, hierarchical. When it really works there’s nothing to do. We recap it solely to set the associated fee baseline.
Case 2, contents web page with hyperlinks. No define, however an early web page lists titles as hyperlinks pointing contained in the file. The hyperlink goal is the bodily web page, so this case skips the alignment downside completely.
Case 3, contents web page with out hyperlinks. A web page that appears like a printed contents (titles, dot leaders, right-aligned web page numbers) however carries no hyperlinks. The web page numbers it prints are labels within the doc’s personal numbering, not bodily pages, so this case wants the alignment step.

All of this lives in a module of its personal, separate from the native path so Article 5 stays readable. The entry level is reconstruct_toc_df.

3. Observe the hyperlinks

Case 2 is the fortunate one. Some paperwork don’t have any define however do ship a clickable contents web page. The NIST Cybersecurity Framework is one: web page two lists each part as a hyperlink that jumps into the doc. PyMuPDF exposes these hyperlinks per web page, and every inner hyperlink carries its goal web page immediately.

In: the PDF (hyperlinks usually are not in line_df, so this reader opens the file). Out: entries with a title and the bodily goal web page, already resolved.

The detection is a density test: a web page with 5 or extra inner hyperlinks is a navigation web page, not a physique web page with the odd footnote hyperlink. The extraction joins every hyperlink’s clickable rectangle again to the textual content underneath it, then strips the leaders and the trailing web page label.

import fitz   # PyMuPDF

def extract_toc_from_links(pdf_path, min_links=5):
    """The contents web page is the web page carrying essentially the most inner hyperlinks."""
    doc = fitz.open(pdf_path)
    finest = []
    for web page in doc:
        entries = []
        for hyperlink in web page.get_links():
            if hyperlink["kind"] != fitz.LINK_GOTO:        # inner bounce solely
                proceed
            label = clear(text_under_rect(web page, hyperlink["from"]))
            if label:
                entries.append({"title": label,
                                "start_page": hyperlink["page"] + 1,  # goal web page
                                "degree": 1})
        if len(entries) >= min_links and len(entries) > len(finest):
            finest = entries                            # richest hyperlink web page wins
    return finest

Run it on the Framework and the recovered contents are clear:

*Each title resolved to an actual web page, no LLM, no guesswork. Picture by creator*

Put the detector’s output subsequent to the web page it learn and you’ll test it by eye. The Framework’s contents web page lists every part, then a Record of Figures and a Record of Tables; the detector recovers all three teams, titles and goal pages matching line for line.

*Left, the doc’s personal contents web page; proper, what the detector returns. Picture by creator*

That is the case to hope for. It’s deterministic, it’s actual, and the web page mapping is solved by the doc itself. The catch is that the majority paperwork that lack a local define additionally lack clickable hyperlinks, which takes us to the more durable case.

4. Learn the printed contents web page, then discover its actual pages

Case 3 is the widespread one: a printed desk of contents with no hyperlinks behind it, a web page headed “Contents” or “Desk of contents”, a column of titles, a column of web page numbers, usually joined by dot leaders. FIPS 202 has precisely this. A human reads it at a look. Parsing it has two distinct steps, and the second is the one individuals skip.

4.1 Detecting and studying the contents web page

First, discover the contents web page. The sign that really separates a contents web page from prose is dot-leader density: a number of strains of the form Some title .......... 42. A key phrase like “contents” raises confidence however isn’t required, and by itself is a weak sign (a sentence can say “desk of contents”). The reader works on line_df alone, so it’s engine-agnostic.

In: line_df. Out: entries with a title and a displayed_page, the web page quantity as printed on the road.

import re
# "Introduction ......... 12"             "Introduction       12"
DOTTED   = re.compile(r"^(.*?S)[.…](?:[.…s]){2,}(d{1,3})$")
TRAILING = re.compile(r"^(.{2,70}?S)s{2,}(d{1,3})$")

def extract_toc_from_contents(line_df):
    entries = []
    for web page in find_contents_pages(line_df):    # pages dense in dot leaders
        for line in lines_of(line_df, web page):
            m = DOTTED.match(line) or TRAILING.match(line)
            if m:
                title, label = m.group(1).strip(), int(m.group(2))
                entries.append({"title": title,
                                "displayed_page": label,      # printed label
                                "degree": infer_level(title)}) # "2.3.1" -> 3
    return entries

4.2 The label isn’t the web page

Right here is the subtlety. The contents web page says Introduction .... 1. Web page 1 of the file is the duvet, not the introduction. A entrance matter of a canopy, a foreword and the contents web page itself sits in entrance, so the printed label and the bodily web page reside in several numbering areas. Open the file to the bodily web page that the label names and also you land a number of pages early, each time.

So a printed web page quantity is a label, and it goes into displayed_page. Mapping it to the bodily start_page is a second step. A budget model assumes one fixed offset: bodily = displayed + shift. To search out the shift, pattern a handful of titles and take a look at each believable offset, holding the one underneath which essentially the most titles really seem on their shifted web page.

def infer_page_shift(line_df, entries, max_shift=40):
    """Finest fixed offset: physical_page = displayed_label + shift."""
    page_text = {p: text_of(line_df, p) for p in pages(line_df)}
    pattern = [(e["displayed_page"], norm(e["title"])) for e in entries][:20]
    best_shift, best_score = 0, -1
    for shift in vary(-max_shift, max_shift + 1):
        hits = sum(1 for label, title in pattern
                   if title in page_text.get(label + shift, ""))
        if hits > best_score:              # most titles land the place predicted
            best_score, best_shift = hits, shift
    return best_shift

*Printed labels 1, 2, 4, 7 map to bodily pages 4, 5, 7, 10 as soon as the front-matter shift is discovered. Picture by creator*

The identical factor occurs on an actual doc. FIPS 202 prints its contents web page on bodily pages 7 and eight, and its physique numbering begins nicely after the entrance matter. Run the detection and the alignment on it and the inferred shift comes out at +8: the introduction the contents web page calls web page 1 really begins on bodily web page 9.

*Eight pages of entrance matter, so each printed label lands eight pages later within the file. Picture by creator*

Aspect by aspect with the web page it learn, the 2 columns are the entire level. The label column reproduces what the contents web page prints; the web page column is the place every part really begins within the file.

*Left, the doc’s personal contents web page; proper, what the detector returns, label and bodily web page. Picture by creator*

A continuing shift covers the widespread case. When numbering restarts partway by way of (an appendix that resets to 1, inserted plates), the offset isn’t fixed, and the fallback is content material matching: find every title’s actual web page by fuzzy-matching its textual content towards the physique, holding the pages monotonically non-decreasing. align_toc_df runs the shift first and falls again to content material matching, so Case 3 fingers the identical bodily start_page downstream as Case 2.

When the printed contents web page is just too irregular for the patterns (a two-column structure, titles that wrap, leaders rendered as ragged whitespace), the LLM extractor takes over with a typed schema, studying the primary pages and returning the identical entry form. That could be a device of final resort for this case, not the default, as a result of a clear printed contents web page is reasonable to learn and the LLM isn’t. The LLM right here nonetheless solely reads the contents web page; it by no means invents a construction for a doc that has none.

5. The LLM disposes, it doesn’t detect

Each detection strategies are heuristics, and heuristics make errors: a hyperlink rectangle that swept up two titles, a contents line the patterns cut up fallacious, a numbering that appears off. The reflex with an LLM is handy it the entire doc and ask for a TOC. That’s the costly, least auditable possibility. The higher division of labour is the inverse: the heuristic proposes a TOC, and the LLM solely checks whether or not it holds collectively.

from pydantic import BaseModel

class TocCoherenceVerdict(BaseModel):       # typed structured output
    is_coherent: bool
    points: checklist[str]

SYSTEM = ("A heuristic already proposed this TOC. Do NOT detect construction. "
          "Decide solely: is the numbering constant (no unexplained skips), "
          "are the web page numbers non-decreasing, does the hierarchy kind a "
          "smart tree?")

def check_toc_coherence(toc_df):
    view = "n".be a part of(f"[{r.start_page}] {'  ' * (r.degree - 1)}{r.title}"
                     for r in toc_df.itertuples())
    return llm_parse(enter=[{"role": "system", "content": SYSTEM},
                            {"role": "user", "content": view}],
                     text_format=TocCoherenceVerdict, label="toc.coherence")

That is quicker, cheaper, and extra auditable than full-LLM extraction, and it degrades gracefully: if the LLM is unavailable, the heuristic TOC continues to be usable with a confidence penalty.

6. One uniform toc_df, no matter fired

The purpose of the cascade is that downstream code by no means learns which case ran. Whether or not the TOC got here from hyperlinks, a printed contents web page or the LLM, it leaves by way of the identical canonicaliser and arrives because the toc_df Article 5B outlined, with two columns added: displayed_page (the printed label, for audit) and supply (which technique fired).

DETECTORS = {"hyperlinks":         extract_toc_from_links,     # Case 2
             "contents_text": extract_toc_from_contents,  # Case 3
             "llm":           extract_toc_by_llm}         # exhausting structure

def reconstruct_toc_df(pdf_path):
    for technique in ("hyperlinks", "contents_text", "llm"):    # ascending price
        entries = DETECTORS[method](pdf_path)
        if not entries:
            proceed                                     # fall by way of
        toc_df = canonicalize(entries, supply=technique)   # one form out
        if technique == "contents_text":
            toc_df = align_to_physical_pages(toc_df)     # label -> web page
        return toc_df
    return empty_toc_df()       # no contents web page -> summarization's job

Calling it’s one import and one line. The returned body is similar toc_df Article 5B outlined, plus a supply column that information which case fired.

# NIST FIPS 202 prints a contents web page however ships no native define:
# Case 3 fires (contents_text), the label-to-page alignment runs, supply="contents_text".

toc_df = reconstruct_toc_df("information/nist/NIST.FIPS.202.pdf")

toc_df.head()              # title, degree, start_page, end_page, displayed_page, supply
toc_df["source"].iloc[0]   # "hyperlinks" | "contents_text" | "llm"  -- which case fired

Run it throughout the 2 labored examples and the cascade routes every to the most affordable technique that works, whereas the caller sees one toc_df each time.

*Hyperlinks for the linked contents web page, textual content patterns for the printed one. Picture by creator*

7. How nicely does it work?

It’s value checking the reconstruction towards floor reality. Take paperwork that do carry a local define, conceal it, run the contents-page strategies, and rating the consequence towards the native TOC. scripts/eval_toc_vs_native.py does this: recall (native entries recovered), precision (reconstructed entries which can be actual), and the share of matched entries whose begin web page lands inside one web page of the native one.

the hyperlink reader is near-exact (the hyperlink goal is authoritative); the text-pattern reader is softer, studying a printed web page and aligning labels is genuinely more durable – Picture by creator

The hyperlink case is near-exact as a result of the hyperlink goal is authoritative; the textual content case is softer as a result of studying a printed web page and aligning labels is genuinely more durable. Discover the hyperlink reader’s recall swings with the doc (86% on SP 800-30r1, 45% on SP 800-207, the place many entries usually are not hyperlinks), whereas its precision stays excessive: what it does get well, it locations appropriately. Neither technique is magic, and the coherence test is there to catch the misses.

Materialized Lake Views in Microsoft Material: When Your Medallion Matches in a SELECT Assertion

Making a PDF’s Pictures Searchable for RAG, With out Paying to Learn Them All

Conclusion

A PDF with no native define isn’t a lifeless finish so long as it prints its personal contents web page. Case 1 reads the define the file ships. Case 2 follows clickable hyperlinks and will get the bodily web page free of charge. Case 3 reads the printed contents web page, then does the step most individuals skip, mapping the printed label to the true web page. The cascade tries them most cost-effective first and stops on the first that works, the LLM checks coherence moderately than doing the detection, and every part leaves as the identical toc_df. A doc that prints no contents web page in any respect is a special downside, summarization, which builds the construction from the physique. Article 7 (retrieval) picks that toc_df again as much as scope solutions by part.

Earlier within the sequence: