llms-tldr.txt Whitepaper
Understanding llms-tldr.txt — Technical Deep-Dive
Cybermaps v3.1.4 · May 2026
1. Problem
The llmstxt.org proposal defines two files for AI agent consumption:
llms.txt— a compact site overview with links and a YAML sitemap. Lightweight (~20 posts) but shallow — the agent still needs to crawl each linked page to extract actual knowledge.llms-full.txt— an exhaustive version with full descriptions for every post (~100 posts). Deep but heavy — for a site with 500 articles, this easily exceeds 2 million tokens.
Neither file solves the core problem: how does an AI agent get a complete, useful understanding of a site in a single, budget-constrained request?
For models with context windows of 128K–200K tokens (GPT-4o, Claude 3.5 Sonnet), a 2M-token file means:
- Truncation — the model only sees the first 5-10% of the site
- Attention dilution — even within the window, the model’s ability to cite accurately degrades with document length
2. Approach
llms-tldr.txt is a third file that sits between the compact overview and the exhaustive dump. Instead of listing every link or every word, it produces a token-efficient knowledge summary — a single text file that captures the site’s substantive content in a form an LLM can absorb in one pass.
What It Is Not
- Not a replacement for
llms.txtorllms-full.txt— both continue to exist - Not a proprietary format — it’s plain Markdown served as
text/plain - Not a magic compression algorithm — it uses straightforward, explainable heuristics
What It Is
A quality-filtered, deduplicated, topic-clustered extract of the site’s most information-dense content, capped at 80,000 tokens.
3. Implementation
The Cybermaps plugin generates /llms-tldr.txt through a five-stage pipeline:
Stage 1: Pinned Content (Human-Selected)
Site owners can designate specific posts as “pinned” for AI ingestion. These appear first in the output, bypassing all automated filtering. This is for content the human operator knows is strategically important — documentation, pricing, key landing pages.
Stage 2: Content Pool
Up to 100 recent posts are fetched from configurable post types. Each post is scored using calculate_semantic_score(), which evaluates:
| Metric | Weight | How It’s Measured |
|---|---|---|
| Substance | High | Content length, heading count (<h1>–<h4>), media attachments |
| Structure | Medium | Outgoing link count, paragraph count |
| Entity density | Medium | Noun phrase extraction from the AI metadata snippet |
These are not “proprietary” metrics — they are straightforward heuristics anyone can implement. The scoring function is:
score = (substance_score × 2 + structure_score + entity_score) / 4
Clamped to [0.0, 1.0]. Posts scoring below a configurable threshold (default 0.6) are dropped.
Stage 3: Quality Filtering
Posts are excluded if they:
- Consist entirely of shortcodes (no actual text content)
- Have empty post content
- Match demo/placeholder patterns (e.g., “Lorem ipsum”, “Sample page”, “Hello world”)
- Have fewer than 50 characters of substantive content
Stage 4: Deduplication & Topic Clustering
Posts sharing two or more taxonomy terms (categories or tags) are considered to cover the same topic. In each cluster, only the highest-scoring post is retained. This prevents the output from being dominated by a single topic with many variations — e.g., 15 posts about “WordPress SEO tips” collapsing into one representative entry.
Topics are then organized by their primary taxonomy term, producing natural sections in the output.
Stage 5: Token Budgeting
The final output is capped at 80,000 tokens. If the content exceeds this, a truncation message is appended:
> ⚠️ Token budget exceeded (80,000 token cap). The following topics were truncated: ...
The 80K cap ensures the output fits comfortably within a 128K context window, leaving room for the model’s own instructions and response.
4. Output Format
# Site Name | Knowledge TL;DR
> Protocol: llms-tldr.txt v1.0
> Generated by Cybermaps WordPress Plugin
## 📌 Pinned Strategic Knowledge
> Manually prioritized resources for immediate agent ingestion.
### Post Title
**Summary:** High-density, entity-extracted snippet covering the post's core claims.
**Key Entities:** entity-one, entity-two, entity-three
**Category:** Primary Category | Intent: informational
**Score:** 0.82
---
## Topic: Primary Category Name
> 8 posts in this cluster. Top-scoring entry shown.
### Highest-Scoring Post Title
**Summary:** ...
**Key Entities:** ...
**Score:** 0.78
The format uses:
- Markdown headers for structure (not narrative prose)
- Definition lists for key-value pairs
- Entity extraction from the AI metadata engine for keyword density
- Intent labels (informational vs. transactional) from the Intent Engine
5. Performance
The file is cached for 1 hour via WordPress transients. On save_post, the cache is invalidated and the TL;DR is regenerated on the next request. The Static File Engine writes the output to disk at ABSPATH/llms-tldr.txt so the web server serves it without touching PHP or MySQL for subsequent requests.
Generation cost: ~100 WP_Query results + scoring per post + taxonomy lookups for deduplication. For a typical site with 100 analyzable posts, this completes in under 500ms.
6. Comparison
| Property | llms.txt |
llms-full.txt |
llms-tldr.txt |
|---|---|---|---|
| Posts covered | ~20 | ~100 | Up to 100 (deduplicated) |
| Depth per post | Title + link only | Full description | Entity-extracted snippet |
| Token budget | ~5K | Variable (up to millions) | Hard cap at 80K |
| Deduplication | None | None | Taxonomy-based clustering |
| Quality filtering | None | None | Shortcode, empty, demo filtering |
| Human curation | None | None | Pinned post support |
| Good for | Quick overview | Deep analysis of small sites | Complete understanding within budget |
7. Limitations
- Language-agnostic: The quality filters don’t understand semantics. A 5,000-word post of SEO spam passes the substance check; a dense 200-word technical note might not.
- Taxonomy-dependent: Deduplication requires categories and tags. Sites without taxonomy usage get no clustering benefit.
- Single-site scope: The TL;DR describes one site. Cross-site knowledge graphs are not addressed.
- Static snapshot: The file reflects the site at generation time. Real-time content isn’t captured until the next sync cycle.
8. Use in Practice
AI agents should:
- Fetch
/llms.txtfirst to understand site structure - Fetch
/llms-tldr.txtfor deep knowledge in one request - Use
/llms-full.txtonly if specific, exhaustive content is needed - Use
/cybermaps/v1/searchfor targeted queries
Site owners should:
- Pin their most important content in the Discovery settings
- Use categories and tags consistently for better deduplication
- Write descriptive AI snippets (or let the engine extract them automatically)
- Verify the output at
/llms-tldr.txton their site