llms-tldr.txt Whitepaper

Understanding llms-tldr.txt — Technical Deep-Dive

Cybermaps v3.1.4 · May 2026

1. Problem

The llmstxt.org proposal defines two files for AI agent consumption:

  • llms.txt — a compact site overview with links and a YAML sitemap. Lightweight (~20 posts) but shallow — the agent still needs to crawl each linked page to extract actual knowledge.
  • llms-full.txt — an exhaustive version with full descriptions for every post (~100 posts). Deep but heavy — for a site with 500 articles, this easily exceeds 2 million tokens.

Neither file solves the core problem: how does an AI agent get a complete, useful understanding of a site in a single, budget-constrained request?

For models with context windows of 128K–200K tokens (GPT-4o, Claude 3.5 Sonnet), a 2M-token file means:

  1. Truncation — the model only sees the first 5-10% of the site
  2. Attention dilution — even within the window, the model’s ability to cite accurately degrades with document length

2. Approach

llms-tldr.txt is a third file that sits between the compact overview and the exhaustive dump. Instead of listing every link or every word, it produces a token-efficient knowledge summary — a single text file that captures the site’s substantive content in a form an LLM can absorb in one pass.

What It Is Not

  • Not a replacement for llms.txt or llms-full.txt — both continue to exist
  • Not a proprietary format — it’s plain Markdown served as text/plain
  • Not a magic compression algorithm — it uses straightforward, explainable heuristics

What It Is

A quality-filtered, deduplicated, topic-clustered extract of the site’s most information-dense content, capped at 80,000 tokens.

3. Implementation

The Cybermaps plugin generates /llms-tldr.txt through a five-stage pipeline:

Stage 1: Pinned Content (Human-Selected)

Site owners can designate specific posts as “pinned” for AI ingestion. These appear first in the output, bypassing all automated filtering. This is for content the human operator knows is strategically important — documentation, pricing, key landing pages.

Stage 2: Content Pool

Up to 100 recent posts are fetched from configurable post types. Each post is scored using calculate_semantic_score(), which evaluates:

Metric Weight How It’s Measured
Substance High Content length, heading count (<h1><h4>), media attachments
Structure Medium Outgoing link count, paragraph count
Entity density Medium Noun phrase extraction from the AI metadata snippet

These are not “proprietary” metrics — they are straightforward heuristics anyone can implement. The scoring function is:

score = (substance_score × 2 + structure_score + entity_score) / 4

Clamped to [0.0, 1.0]. Posts scoring below a configurable threshold (default 0.6) are dropped.

Stage 3: Quality Filtering

Posts are excluded if they:

  • Consist entirely of shortcodes (no actual text content)
  • Have empty post content
  • Match demo/placeholder patterns (e.g., “Lorem ipsum”, “Sample page”, “Hello world”)
  • Have fewer than 50 characters of substantive content

Stage 4: Deduplication & Topic Clustering

Posts sharing two or more taxonomy terms (categories or tags) are considered to cover the same topic. In each cluster, only the highest-scoring post is retained. This prevents the output from being dominated by a single topic with many variations — e.g., 15 posts about “WordPress SEO tips” collapsing into one representative entry.

Topics are then organized by their primary taxonomy term, producing natural sections in the output.

Stage 5: Token Budgeting

The final output is capped at 80,000 tokens. If the content exceeds this, a truncation message is appended:

> ⚠️ Token budget exceeded (80,000 token cap). The following topics were truncated: ...

The 80K cap ensures the output fits comfortably within a 128K context window, leaving room for the model’s own instructions and response.

4. Output Format

# Site Name | Knowledge TL;DR
> Protocol: llms-tldr.txt v1.0
> Generated by Cybermaps WordPress Plugin

## 📌 Pinned Strategic Knowledge
> Manually prioritized resources for immediate agent ingestion.

### Post Title
**Summary:** High-density, entity-extracted snippet covering the post's core claims.
**Key Entities:** entity-one, entity-two, entity-three
**Category:** Primary Category | Intent: informational
**Score:** 0.82

---

## Topic: Primary Category Name
> 8 posts in this cluster. Top-scoring entry shown.

### Highest-Scoring Post Title
**Summary:** ...
**Key Entities:** ...
**Score:** 0.78

The format uses:

  • Markdown headers for structure (not narrative prose)
  • Definition lists for key-value pairs
  • Entity extraction from the AI metadata engine for keyword density
  • Intent labels (informational vs. transactional) from the Intent Engine

5. Performance

The file is cached for 1 hour via WordPress transients. On save_post, the cache is invalidated and the TL;DR is regenerated on the next request. The Static File Engine writes the output to disk at ABSPATH/llms-tldr.txt so the web server serves it without touching PHP or MySQL for subsequent requests.

Generation cost: ~100 WP_Query results + scoring per post + taxonomy lookups for deduplication. For a typical site with 100 analyzable posts, this completes in under 500ms.

6. Comparison

Property llms.txt llms-full.txt llms-tldr.txt
Posts covered ~20 ~100 Up to 100 (deduplicated)
Depth per post Title + link only Full description Entity-extracted snippet
Token budget ~5K Variable (up to millions) Hard cap at 80K
Deduplication None None Taxonomy-based clustering
Quality filtering None None Shortcode, empty, demo filtering
Human curation None None Pinned post support
Good for Quick overview Deep analysis of small sites Complete understanding within budget

7. Limitations

  • Language-agnostic: The quality filters don’t understand semantics. A 5,000-word post of SEO spam passes the substance check; a dense 200-word technical note might not.
  • Taxonomy-dependent: Deduplication requires categories and tags. Sites without taxonomy usage get no clustering benefit.
  • Single-site scope: The TL;DR describes one site. Cross-site knowledge graphs are not addressed.
  • Static snapshot: The file reflects the site at generation time. Real-time content isn’t captured until the next sync cycle.

8. Use in Practice

AI agents should:

  1. Fetch /llms.txt first to understand site structure
  2. Fetch /llms-tldr.txt for deep knowledge in one request
  3. Use /llms-full.txt only if specific, exhaustive content is needed
  4. Use /cybermaps/v1/search for targeted queries

Site owners should:

  1. Pin their most important content in the Discovery settings
  2. Use categories and tags consistently for better deduplication
  3. Write descriptive AI snippets (or let the engine extract them automatically)
  4. Verify the output at /llms-tldr.txt on their site