Understanding `llms-tldr.txt` — Technical Deep-Dive

Cybermaps v3.1.4 · May 2026

1. Problem

The llmstxt.org proposal defines two files for AI agent consumption:

llms.txt — a compact site overview with links and a YAML sitemap. Lightweight (~20 posts) but shallow — the agent still needs to crawl each linked page to extract actual knowledge.
llms-full.txt — an exhaustive version with full descriptions for every post (~100 posts). Deep but heavy — for a site with 500 articles, this easily exceeds 2 million tokens.

Neither file solves the core problem: how does an AI agent get a complete, useful understanding of a site in a single, budget-constrained request?

For models with context windows of 128K–200K tokens (GPT-4o, Claude 3.5 Sonnet), a 2M-token file means:

Truncation — the model only sees the first 5-10% of the site
Attention dilution — even within the window, the model’s ability to cite accurately degrades with document length

2. Approach

llms-tldr.txt is a third file that sits between the compact overview and the exhaustive dump. Instead of listing every link or every word, it produces a token-efficient knowledge summary — a single text file that captures the site’s substantive content in a form an LLM can absorb in one pass.

What It Is Not

Not a replacement for llms.txt or llms-full.txt — both continue to exist
Not a proprietary format — it’s plain Markdown served as text/plain
Not a magic compression algorithm — it uses straightforward, explainable heuristics

What It Is

A quality-filtered, deduplicated, topic-clustered extract of the site’s most information-dense content, capped at 80,000 tokens.

3. Implementation

The Cybermaps plugin generates /llms-tldr.txt through a five-stage pipeline:

Stage 1: Pinned Content (Human-Selected)

Site owners can designate specific posts as “pinned” for AI ingestion. These appear first in the output, bypassing all automated filtering. This is for content the human operator knows is strategically important — documentation, pricing, key landing pages.

Stage 2: Content Pool

Up to 100 recent posts are fetched from configurable post types. Each post is scored using calculate_semantic_score(), which evaluates:

Metric	Weight	How It’s Measured
Substance	High	Content length, heading count (`<h1>`–`<h4>`), media attachments
Structure	Medium	Outgoing link count, paragraph count
Entity density	Medium	Noun phrase extraction from the AI metadata snippet

These are not “proprietary” metrics — they are straightforward heuristics anyone can implement. The scoring function is:

score = (substance_score × 2 + structure_score + entity_score) / 4

Clamped to [0.0, 1.0]. Posts scoring below a configurable threshold (default 0.6) are dropped.

Stage 3: Quality Filtering

Posts are excluded if they:

Consist entirely of shortcodes (no actual text content)
Have empty post content
Match demo/placeholder patterns (e.g., “Lorem ipsum”, “Sample page”, “Hello world”)
Have fewer than 50 characters of substantive content

Stage 4: Deduplication & Topic Clustering

Posts sharing two or more taxonomy terms (categories or tags) are considered to cover the same topic. In each cluster, only the highest-scoring post is retained. This prevents the output from being dominated by a single topic with many variations — e.g., 15 posts about “WordPress SEO tips” collapsing into one representative entry.

Topics are then organized by their primary taxonomy term, producing natural sections in the output.

Stage 5: Token Budgeting

The final output is capped at 80,000 tokens. If the content exceeds this, a truncation message is appended:

> ⚠️ Token budget exceeded (80,000 token cap). The following topics were truncated: ...

The 80K cap ensures the output fits comfortably within a 128K context window, leaving room for the model’s own instructions and response.

4. Output Format

# Site Name | Knowledge TL;DR
> Protocol: llms-tldr.txt v1.0
> Generated by Cybermaps WordPress Plugin

## 📌 Pinned Strategic Knowledge
> Manually prioritized resources for immediate agent ingestion.

### Post Title
**Summary:** High-density, entity-extracted snippet covering the post's core claims.
**Key Entities:** entity-one, entity-two, entity-three
**Category:** Primary Category | Intent: informational
**Score:** 0.82

---

## Topic: Primary Category Name
> 8 posts in this cluster. Top-scoring entry shown.

### Highest-Scoring Post Title
**Summary:** ...
**Key Entities:** ...
**Score:** 0.78

The format uses:

Markdown headers for structure (not narrative prose)
Definition lists for key-value pairs
Entity extraction from the AI metadata engine for keyword density
Intent labels (informational vs. transactional) from the Intent Engine

5. Performance

The file is cached for 1 hour via WordPress transients. On save_post, the cache is invalidated and the TL;DR is regenerated on the next request. The Static File Engine writes the output to disk at ABSPATH/llms-tldr.txt so the web server serves it without touching PHP or MySQL for subsequent requests.

Generation cost: ~100 WP_Query results + scoring per post + taxonomy lookups for deduplication. For a typical site with 100 analyzable posts, this completes in under 500ms.

6. Comparison

Property	`llms.txt`	`llms-full.txt`	`llms-tldr.txt`
Posts covered	~20	~100	Up to 100 (deduplicated)
Depth per post	Title + link only	Full description	Entity-extracted snippet
Token budget	~5K	Variable (up to millions)	Hard cap at 80K
Deduplication	None	None	Taxonomy-based clustering
Quality filtering	None	None	Shortcode, empty, demo filtering
Human curation	None	None	Pinned post support
Good for	Quick overview	Deep analysis of small sites	Complete understanding within budget

7. Limitations

Language-agnostic: The quality filters don’t understand semantics. A 5,000-word post of SEO spam passes the substance check; a dense 200-word technical note might not.
Taxonomy-dependent: Deduplication requires categories and tags. Sites without taxonomy usage get no clustering benefit.
Single-site scope: The TL;DR describes one site. Cross-site knowledge graphs are not addressed.
Static snapshot: The file reflects the site at generation time. Real-time content isn’t captured until the next sync cycle.

8. Use in Practice

AI agents should:

Fetch /llms.txt first to understand site structure
Fetch /llms-tldr.txt for deep knowledge in one request
Use /llms-full.txt only if specific, exhaustive content is needed
Use /cybermaps/v1/search for targeted queries

Site owners should:

Pin their most important content in the Discovery settings
Use categories and tags consistently for better deduplication
Write descriptive AI snippets (or let the engine extract them automatically)
Verify the output at /llms-tldr.txt on their site

llms-tldr.txt Whitepaper

Understanding llms-tldr.txt — Technical Deep-Dive