Why We Built llms-tldr.txt: Implementation & Logic

Why We Built llms-tldr.txt: Implementation & Logic

The llms.txt standard has been a quiet revolution. Give an AI agent one URL, and it learns your site’s structure. No crawling, no guessing, no scraping. Just a clean, machine-readable overview.

But two months into using it on real sites, we found the gap.

The Two-File Problem

llms.txt provides links and a YAML sitemap: optimized for navigation, but the agent still needs to visit each page to extract actual knowledge. It’s a map, not the territory.

llms-full.txt provides everything: full descriptions for every post. Ideal for a 10-page portfolio. For a site with 500 articles? The file hits two million tokens. Today’s models have 128K–200K context windows. The file doesn’t fit. The model sees the first 10% and assumes it understands the whole site. Or it tries to process everything and the attention mechanism degrades: citations become fuzzy, facts blur together.

Neither file solves the core problem: how do you give an AI agent a complete, useful understanding of a large site in a single request?

What We Built

Cybermaps v3.1.4 generates a third file: /llms-tldr.txt. It sits between the compact overview and the exhaustive dump. Instead of every link or every word, it produces a token-efficient knowledge summary: quality-filtered, deduplicated, and capped at 80,000 tokens.

It’s not a compression algorithm. It’s five straightforward stages:

1. Human-Pinned Content First

Site owners can designate specific posts for priority AI ingestion. These appear at the top, bypassing all automated filtering. If your pricing page, API docs, or flagship case study is what you want AI agents to know about, you pin it.

2. Scoring for Substance

Up to 100 recent posts are fetched and each is scored on three dimensions:

  • Content substance: length, heading structure, media attachments
  • Structure quality: paragraph count, outgoing link count
  • Entity density: noun phrases extracted from the AI metadata engine

Posts scoring below 0.6 (configurable) are dropped. This eliminates thin content, placeholder pages, and posts that are mostly markup with no actual substance.

3. Quality Filtering

Before anything reaches the output, it passes four checks:

  • Not shortcode-only (no real text content)
  • Not empty
  • Not demo/placeholder content (“Lorem ipsum”, “Hello world”)
  • At least 50 characters of substantive text

These filters catch the content that humans skip and AI agents shouldn’t waste tokens on.

4. Deduplication by Topic

If you’ve published 15 articles about WordPress SEO, an AI agent doesn’t need all 15. It needs the best one.

Posts sharing two or more categories or tags are considered to cover the same topic. Only the highest-scoring post per topic cluster is retained. The rest are noted in a summary line: “8 posts in this cluster. Top-scoring entry shown.”

5. Token Budgeting

The output is capped at 80,000 tokens: roughly half a 200K context window, leaving room for the model’s instructions and response. If content exceeds the cap, a truncation message lists the topics that were cut:

> ⚠️ Token budget exceeded (80,000 token cap). The following topics were truncated: topic-a, topic-b, topic-c

What the Output Looks Like

# Example Site | Knowledge TL;DR

## 📌 Pinned Strategic Knowledge
### Pricing
**Summary:** Our pricing starts at $29/month for the Starter plan...
**Key Entities:** pricing, plans, starter, professional, enterprise
**Category:** Product | Intent: transactional
**Score:** 0.91

---

## Topic: WordPress Development
> 12 posts in this cluster. Top-scoring entry shown.

### Building Custom Gutenberg Blocks in 2026
**Summary:** A practical guide to creating custom blocks using...
**Key Entities:** gutenberg, block-editor, react, registerBlockType
**Category: Development | Intent: informational
**Score:** 0.84

The format uses Markdown headers for structure, definition lists for key-value pairs, and entity extraction for keyword density. No narrative prose, no fluff.

Does It Work?

On a test site with 150 published posts across 8 categories:

Metric llms-full.txt llms-tldr.txt
Token count ~450,000 ~28,000
Unique topics covered All (diluted) 14 distinct clusters
Quality-filtered No (includes thin content) Yes (only scored ≥0.6)
Human-curated No 3 pinned posts at top
Fits in 128K window No Yes, with room to spare

The TL;DR version covered every major topic on the site in under 30,000 tokens. The full version covered every post but at a token cost that forced truncation for any model currently available.

Use It

If you run Cybermaps, the file is already at https://yoursite.com/llms-tldr.txt. Enable the Discovery Hub in settings if it’s not already on. Pin your most important content. Verify the output.

If you’re building an AI agent, fetch /llms-tldr.txt after /llms.txt for deep knowledge in one request. Use /cybermaps/v1/search for targeted queries. Fall back to /llms-full.txt only if you need exhaustive coverage and have the context budget for it.

The implementation is open source. The heuristics are documented. If you want to implement llms-tldr.txt in your own platform, the technical deep-dive covers every stage.