Robots Manager
Robots.txt Manager
Bot Registry
Cybermaps maintains a registry of 40+ known crawlers in CrawlerRegistry. Each bot has:
new BotMetadata(
name: 'GPTBot', // Human-readable name
company: 'OpenAI', // Company behind the bot
ua: 'GPTBot', // User-Agent substring for identification
category: BotCategory::AI_TRAINING, // Category enum
default: ['robots' => true, 'llm' => true], // Default permissions
desc: 'Used by OpenAI for training data collection.'
)
Bot categories: AI Training, Search Engine, Social, Monitoring, Archive, Developer Tool, Security, SEO Tool, Other.
Bot Identification
The identify_bot() method builds a hash map indexed by User-Agent substring on first access. Subsequent lookups are O(1). The method is called at template_redirect to determine whether the current request is from a known crawler.
Per-Bot Rules
Each bot can be individually configured:
- Allow/Disallow: Whether the bot can access the site at all
- LLM Access: Whether the bot appears in
targeted_agentslists in ADP and LLMS manifests - Crawl Delay: Seconds between requests (0-3600, 0 = no limit)
- Requests Per Minute (TPM): Rate limit for aggressive crawlers
Global Settings
- Content-Signal Directives: Experimental. Adds directives like
Content-Signal: opt-into robots.txt. Disabled by default because some legacy parsers misinterpret these directives. - Discovery Hub Block: Auto-generates a block at the bottom of robots.txt that links all sitemaps and AI endpoints:
# Cybermaps Discovery Hub Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-news.xml # AI Discovery: https://example.com/llms.txt # AI Manifest: https://example.com/.well-known/ai.json - Global Crawl Delay: Default delay for bots without a per-bot override
- Takeover Mode: Replaces WordPress’s default
robots.txtentirely. When disabled, Cybermaps only appends to the existing robots.txt
Caching
Robots.txt output is cached in a transient keyed by the request context. The cache is invalidated automatically when robots settings are saved.