Robots Manager | CYBERMAPS | The Discovery Engine

Robots.txt Manager

Bot Registry

Cybermaps maintains a registry of 40+ known crawlers in CrawlerRegistry. Each bot has:

new BotMetadata(
    name:     'GPTBot',           // Human-readable name
    company:  'OpenAI',           // Company behind the bot
    ua:       'GPTBot',           // User-Agent substring for identification
    category: BotCategory::AI_TRAINING,  // Category enum
    default:  ['robots' => true, 'llm' => true],  // Default permissions
    desc:     'Used by OpenAI for training data collection.'
)

Bot categories: AI Training, Search Engine, Social, Monitoring, Archive, Developer Tool, Security, SEO Tool, Other.

Bot Identification

The identify_bot() method builds a hash map indexed by User-Agent substring on first access. Subsequent lookups are O(1). The method is called at template_redirect to determine whether the current request is from a known crawler.

Per-Bot Rules

Each bot can be individually configured:

Allow/Disallow: Whether the bot can access the site at all
LLM Access: Whether the bot appears in targeted_agents lists in ADP and LLMS manifests
Crawl Delay: Seconds between requests (0-3600, 0 = no limit)
Requests Per Minute (TPM): Rate limit for aggressive crawlers

Global Settings

Content-Signal Directives: Experimental. Adds directives like Content-Signal: opt-in to robots.txt. Disabled by default because some legacy parsers misinterpret these directives.

Discovery Hub Block: Auto-generates a block at the bottom of robots.txt that links all sitemaps and AI endpoints:

# Cybermaps Discovery Hub
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
# AI Discovery: https://example.com/llms.txt
# AI Manifest: https://example.com/.well-known/ai.json

Global Crawl Delay: Default delay for bots without a per-bot override
Takeover Mode: Replaces WordPress’s default robots.txt entirely. When disabled, Cybermaps only appends to the existing robots.txt

Caching

Robots.txt output is cached in a transient keyed by the request context. The cache is invalidated automatically when robots settings are saved.