Comparison · llms.txt vs robots.txt

llms.txt vs robots.txt: What Each File Actually Controls in 2026

robots.txt has run the show for 30 years. llms.txt is the AI-era complement — not a replacement. A pragmatic guide to what each file does, why both belong at your site root, and how to write them.

Quick answer. robots.txt is the 1994 Robots Exclusion Protocol that tells crawlers which pages they may or may not fetch. llms.txt is a 2024 proposal (with 2025-2026 widening adoption) that tells AI systems which pages are the canonical, authoritative answer to your brand’s core topics — a curated source list, not an access-control file. They’re complementary: robots.txt controls access; llms.txt curates authority. Both belong at your site root. Neither replaces the other.

Two files, two different jobs

FileYearPurposeFormatEnforcement
robots.txt1994 (RFC 9309)Tell crawlers which URLs they may fetchPlain text directives (User-agent, Allow, Disallow, Sitemap)Voluntary — crawlers can ignore but reputable ones don’t
llms.txt2024 proposalTell AI systems which pages are canonical authoritative answers per topicMarkdown with curated link list per topicVoluntary — AI systems prioritize but don’t exclusively rely

Treating them as substitutes is the most common mistake we see in 2026 audits.

What robots.txt does

Three things:

  1. Per-user-agent access control. “GPTBot is allowed everywhere; SemrushBot must wait 10s between requests; DotBot can’t crawl at all.”
  2. Sitemap discovery. A Sitemap: directive at the bottom tells crawlers where your XML sitemap lives.
  3. Crawl-budget management. For very large sites, blocking low-value sections (faceted nav permutations, internal search results, admin paths) preserves crawl budget for important pages.

What robots.txt does not do:

  • It doesn’t tell Google whether to index a page (use noindex meta or X-Robots-Tag header for that).
  • It doesn’t prevent a page from appearing in search results entirely — a blocked page can still rank if other sites link to it.
  • It doesn’t say anything about which pages are your best answers, only which are crawlable.

Modern robots.txt for 2026

A minimum-viable 2026 robots.txt explicitly allows AI crawlers (most cooperative AI systems honor the directive):

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

Sitemap: https://yourdomain.com/sitemap-index.xml

If you’re a B2B brand and you don’t have explicit Allow lines for the major AI crawlers, you’re at risk — some legal/compliance defaults block them by default through CDN rules. That’s a silent invisibility problem.

What llms.txt does

llms.txt is a markdown file at your site root that lists your canonical pages by topic. Format example:

# YourBrand

> A one-paragraph description of what YourBrand is and the core authoritative topics it covers.

## Services

- [Enterprise SEO](/services/enterprise-seo/): What we do for Fortune 1000 brands
- [Technical SEO](/services/technical-seo/): The technical health audit + remediation discipline

## Methodology

- [The Agentic SEO Approach](/about/methodology/): Our operating philosophy and how AI agents support senior strategists

## Research

- [State of AI Search 2026](/research/state-of-ai-search/): Annual benchmark report

The format is short, opinionated, and curated. It’s not a sitemap — a sitemap is exhaustive. llms.txt is the 10-30 pages you’d hand someone who asked “what are the most authoritative pages on your site for the topics you cover?”

Why llms.txt matters

AI systems that follow the proposal use it as a trusted starting point when answering questions about your brand or its topics:

  • ChatGPT’s web search ingests it as part of its first-pass authority signal
  • Perplexity uses it to determine which of your pages to cite preferentially
  • Claude (Anthropic’s model behind Resocial’s strategy work) reads it explicitly when web-searching
  • Google has indicated AI Overviews will weight it as a curation signal

Adoption is uneven — not every AI engine honors it yet, and the spec is still being formalized. But the downside is zero (a static text file) and the upside is meaningful citation lift for sites that ship it.

The “both” architecture

For Resocial-class sites in 2026:

LayerFilePurpose
Access control/robots.txt”Yes you can crawl these. No you can’t crawl those.”
Discovery/sitemap-index.xml (referenced from robots)“Here are all my URLs.”
Authority curation/llms.txt”Here are my best 20 pages by topic.”
Deeper authority/llms-full.txt (optional extended format)“Here are my pages with full text, for AI systems that want to ingest in one fetch.”

The combination tells AI systems:

  1. What you let them touch (robots.txt)
  2. What exists (sitemap)
  3. What’s canonical (llms.txt)
  4. The actual content (llms-full.txt for those that ingest it)

Common mistakes

  • Treating llms.txt as a sitemap. It’s not — keep it to your top 10-30 pages, curated by topic.
  • Forgetting Google-Extended in robots.txt. This is the directive that controls whether Google’s AI training (separate from search indexing) ingests your content. Most brands want to allow it; some legal teams want to block it. Decide explicitly.
  • No llms.txt at all — leaving AI systems to figure out your authority signals from scratch. The 2026 floor for any serious B2B brand is having both files.
  • Blocking AI crawlers via CDN rules that override robots.txt. Check your Cloudflare / Akamai / Fastly bot management settings — they sometimes default to blocking AI bots, which silently kills citation visibility.

What to do this week

  1. Open your robots.txt and verify explicit Allow: / lines for GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, Applebot-Extended.
  2. Verify Sitemap: directive points to your real sitemap.
  3. Create or refresh llms.txt at site root with 10-30 curated canonical pages organized by topic.
  4. Check CDN bot rules — make sure they’re not overriding robots.txt by blocking AI bots at the network layer.
  5. (Optional) generate llms-full.txt with the full text of those 10-30 pages concatenated for AI systems that ingest in one fetch.

The full Resocial setup of these files is part of our Technical SEO service, and the strategic framing for AI search visibility lives under the AI Search & GEO pillar. For the broader discipline context, see GEO vs SEO.

FAQs

Will robots.txt eventually be replaced by llms.txt?

Almost certainly not. They control different things — access vs curation. Both will coexist for the foreseeable future. The legacy of robots.txt is 30 years of standardization; llms.txt is solving an adjacent problem, not the same one.

Does Google honor llms.txt?

Google has indicated AI Overviews weight llms.txt as a curation signal, but hasn't committed to strict adherence. Other AI engines (ChatGPT, Perplexity, Claude) have stronger stated support. Treat it as a strong-signal-but-not-binding directive.

What if I block GPTBot in robots.txt?

Then your content will not be cited by ChatGPT's web search. For most brands that's a costly mistake — [AI-referred traffic](/glossary/ai-referred-traffic/) converts at 4.4× the rate of traditional organic. The brands that block GPTBot are usually large publishers worried about training data extraction (a separate concern from web search citation).

Can I put llms.txt at a subdirectory instead of root?

The proposed standard expects it at site root (https://yourdomain.com/llms.txt). Subdirectory placement won't be reliably discovered. Same logic as robots.txt — root or nothing.

Want strategy like this for your brand?

Get a free SEO audit

60+ dimensions, 48-hour turnaround.

Get a Free SEO Audit

Submit an enterprise RFP

Tailored proposal in 5 business days.

Submit an Enterprise RFP