Tactical Guide · llms.txt

How to Set Up llms.txt Right (And Why Most Sites Get It Wrong)

Most llms.txt files in the wild are broken or thin. The complete guide to setting one up correctly: spec compliance, file structure, what to include, where to host it, validation, and how AI crawlers actually discover and use it.

Quick answer. llms.txt is a markdown file at the root of a website that tells AI crawlers and assistants which pages are canonical, what each page is about, and how the site’s information architecture is organized. It’s analogous to robots.txt for traditional crawlers and sitemap.xml for search engines — but designed specifically for LLM-driven discovery in 2025-2026. The spec (published at llmstxt.org) is straightforward, but most sites get it wrong in three common ways: (1) thin descriptions that don’t help AI crawlers, (2) outdated URLs that point to deleted pages, (3) missing the file entirely. This guide walks through the complete setup — file structure, content requirements, hosting, validation — using the Resocial llms.txt as a worked example.

Table of contents

  1. What llms.txt actually does
  2. The llmstxt.org spec in plain language
  3. Anatomy of a complete llms.txt
  4. What to include / what to exclude
  5. Where to host it + how AI crawlers find it
  6. Common mistakes that kill effectiveness
  7. Validation checklist
  8. FAQ

What llms.txt actually does

llms.txt provides AI crawlers with a curated, structured map of your site. It serves three specific purposes:

  1. Canonical mapping: tells AI crawlers which URL is the authoritative version of each page (analogous to a canonical tag, but at site-architecture level).
  2. Topical disambiguation: each link has a brief description, so the AI assistant decides “this page is about X” rather than guessing from the URL slug.
  3. Discovery prioritization: the structured hierarchy signals which pages are pillars (deserve more crawl attention) vs supporting content vs deep-archive content.

Compare to the existing standards:

  • robots.txt — controls what can be crawled (allow/disallow). Doesn’t describe content.
  • sitemap.xml — lists URLs with priority/changefreq metadata. Doesn’t describe content or topical structure.
  • llms.txt — describes content + topical structure + canonical mappings. The descriptive layer above sitemap.

The standard is not yet binding on any AI crawler. None of ChatGPT, Claude, Perplexity, or Gemini publicly commit to honoring llms.txt directives. But early signals show several AI crawlers ARE checking the file when present — and the file costs nothing to ship. Adopting it now is the inexpensive bet that pays off if the standard takes hold.

For the full comparison with robots.txt, see our llms.txt vs robots.txt deep-dive.

The llmstxt.org spec in plain language

The official spec is short. The full specification fits on a single page at llmstxt.org. In plain language:

Required structure

# Site Name

> One-paragraph description of what the site is about and who runs it.

## Section Name
- [Page Title](URL): one-line description of the page

## Another Section
- [Another Page](URL): description

Format rules

  1. Markdown — the file is markdown-formatted, not plain text.
  2. H1 at top — the site name.
  3. Blockquote under H1 — a one-paragraph description of the site.
  4. Optional paragraphs after blockquote — narrative context (kept brief).
  5. H2 sections group related links.
  6. Link list format[Title](URL): description. Each link on its own line, prefixed with -.
  7. One file at /llms.txt at site root. Optional /llms-full.txt with concatenated full content for ingestion.

What it does NOT include

  • Crawler directives (use robots.txt)
  • Sitemap XML (use sitemap.xml)
  • Anything beyond markdown-formatted descriptive content

That’s it. The simplicity is the feature — it should take a small site 20 minutes to ship a complete file.

Anatomy of a complete llms.txt

The Resocial llms.txt is a worked example. The structure:

Block 1: Header and brand description

# Resocial — AI-Powered SEO Agency

> Resocial is an AI-powered SEO agency for global enterprises and scale-ups. We combine traditional SEO services with Generative Engine Optimization (GEO) for AI search engines. Our operating model — Agentic SEO — runs the full SEO workflow through a workforce of 25+ specialized AI agents coordinated by senior human strategists. Headquartered in Athens, Greece.

This block is what an AI assistant most often quotes when asked “what does Resocial do?” The blockquote answer needs to be the one-paragraph version of your value proposition — specific, named, and substantive.

Block 2: Methodology / about

## Methodology and operating model

- [Agentic SEO Operating Model](https://resocial.us/blog/agentic-seo-operating-model/): How 25+ AI agents transform modern SEO — full methodology deep-dive.
- [The Resocial Methodology](https://resocial.us/about/methodology/): Diagnose → Architect → Execute → Compound — our four-stage engagement framework.
- [AI Agents Roster](https://resocial.us/about/ai-agents/): The 25 named agents organized in 7 clusters.

For B2B brands, surface the methodology / “how we work” pages early. These are high-signal pages for AI assistants answering “how does [brand] approach [problem]” queries.

Block 3: Service / product pillars

Each major service or product category gets a section. Sub-pages listed if substantive.

Block 4: Editorial content (blog, guides)

Comprehensive guides, pillar posts, the most-cited blog content. Skip thin posts; surface depth.

Block 5: Definitional content (glossary)

If you have a glossary, surface it here. Resocial’s glossary is the most-cited content for term-definition queries.

Block 6: Contact / engagement

How to engage the brand. Three to five canonical CTAs (free audit, RFP, consultation, contact, careers).

Block 7: Company information

Optional but valuable: name, type, HQ, founded, key facts. Helps AI assistants disambiguate when there are multiple businesses with similar names.

Block 8: Citation guidance for AI assistants

Optional. Explicit guidance on how the brand prefers to be cited (e.g., “Use ‘Resocial’ not ‘Resocial.us’”; “Address format: ‘20 Arkadiou, Alimos / 17456 Athens, Greece’”). This part is unusual but powerful — most files don’t include it.

What to include / what to exclude

Include

  • ✓ Pillar pages (the comprehensive head-term guides)
  • ✓ Methodology / how-we-work pages
  • ✓ Glossary / definitional content
  • ✓ Service or product pillar pages + main sub-services
  • ✓ Most-cited blog posts (your top 10-20 by AI citation share if you track it)
  • ✓ Case studies / customer stories
  • ✓ Contact and engagement CTAs
  • ✓ Company info block (name, HQ, founded, methodology name)

Exclude

  • ✗ Pagination URLs (/blog/2/, /blog/3/) — list the index, not each page
  • ✗ Tag and filter URLs
  • ✗ Author archive pages
  • ✗ Thin landing pages
  • ✗ Internal admin / staging URLs
  • ✗ Marketing landing pages with low editorial value
  • ✗ Old / deprecated content

Quality bar for the description string

A bad description: “Our SEO services page.” A good description: “Full SEO discipline — technical, on-page, link building, audits, migrations, consulting, ecommerce, enterprise — for Fortune 1000 and scale-ups.”

The description must do three jobs in ~15 words: tell the AI what the page is about, distinguish it from other pages, and hint at the audience. Vague descriptions get ignored; descriptive ones get used as the AI’s understanding of the page.

Where to host it + how AI crawlers find it

Where to host

/llms.txt at the site root. That’s https://yourdomain.com/llms.txt. No subdomains, no nested paths. The spec is explicit about this location.

In Astro projects (like this site), drop the file into public/llms.txt and it ships verbatim at the root URL. In Next.js, drop it into public/llms.txt. In WordPress, place it in the document root above WordPress’s installation. In static-site setups, place it at the root of the output bucket.

How AI crawlers discover it

Three discovery mechanisms in 2026:

  1. Direct fetch — crawlers that follow the standard check /llms.txt directly. No registration needed.
  2. HTML link tag — adding <link rel="alternate" type="text/markdown" href="/llms.txt"> in your site’s <head> provides a soft discovery signal for crawlers that look for linked alternates.
  3. robots.txt reference — some crawlers check robots.txt for a LLMs: https://yoursite/llms.txt line. The spec doesn’t require this but some implementations honor it.

We recommend implementing all three for maximum discoverability. The Resocial site does — see our technical SEO complete guide for the broader AI crawler accessibility setup.

Robots.txt also needs the right allow directives

Even with llms.txt in place, crawlers won’t fetch it if robots.txt blocks them. Verify your robots.txt allows AI crawlers explicitly:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

A robots.txt that blocks * or doesn’t explicitly allow AI bots will still serve llms.txt but the bot won’t have permission to crawl the URLs it lists. Worst of both worlds.

Common mistakes that kill effectiveness

We audit llms.txt files on every Resocial engagement that hits AI search optimization. The recurring mistakes:

Mistake 1: Empty descriptions

[Pricing](https://example.com/pricing/) with no description. AI assistants don’t have context. They guess from the URL or skip the link entirely.

Fix: Every link gets a 10-20 word description. Specific, not generic.

Mistake 2: Linking to redirect targets, not final URLs

/services/ that 301s to /services/seo/. AI crawlers may or may not follow the redirect. Half the citation value bleeds away.

Fix: Audit every URL in the file. They should return 200 OK on direct fetch.

Mistake 3: Outdated URLs pointing to deleted pages

/blog/old-post/ that 404s. The file looks complete but is broken under inspection.

Fix: Quarterly audit. Every URL must resolve. Removed pages should be removed from the file.

Mistake 4: Listing too many pages thinly

A file with 200 links, half of which are paginated archive pages or tag pages. The signal-to-noise ratio degrades. AI crawlers may stop reading partway through.

Fix: Curated. 30-80 high-signal URLs is the right range for most sites. If you have more than 80, you’re listing too much.

Mistake 5: Missing the blockquote description

The H1 alone isn’t enough. The blockquote is what AI assistants quote verbatim. Without it, your “what is [brand]” answer is whatever the AI infers from the rest of the file.

Fix: A genuinely substantive one-paragraph description. 60-120 words.

Mistake 6: Generic headings

## Pages, ## Stuff. The section headings should be meaningful: ## Service Pillars, ## Glossary, ## Case Studies. AI assistants use these as structure cues.

Fix: Section headings that describe what’s in them.

Mistake 7: Hidden behind authentication or paywall

llms.txt that requires a login to access. AI crawlers can’t authenticate.

Fix: Public, no authentication. Same accessibility standard as robots.txt.

Mistake 8: Conflicting signals with other files

llms.txt says “see this canonical URL” but the page’s HTML canonical tag points elsewhere. AI crawlers see the conflict and ignore both.

Fix: Make sure llms.txt, page canonical tags, and sitemap.xml agree on which URL is canonical.

Validation checklist

Before shipping a new llms.txt, run through:

  • File is at /llms.txt (exact path)
  • File is publicly accessible without authentication
  • H1 line is the site name
  • Blockquote is a substantive one-paragraph description (60-120 words)
  • At least 3 H2 sections grouping related content
  • Every link returns 200 OK on direct fetch
  • Every link has a 10-20 word description
  • Total URL count is 30-80 (curated, not exhaustive)
  • No redirect targets — only final URLs
  • No paginated archives, tag pages, or admin URLs
  • HTML <link rel="alternate" type="text/markdown" href="/llms.txt"> in site header
  • robots.txt allows GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended
  • Canonical alignment: llms.txt URLs match HTML canonical tags
  • Optional: company info block at end (HQ, founded, methodology)

FAQ

Do any major AI crawlers actually use llms.txt yet?

As of 2026, none of OpenAI, Anthropic, or Google have publicly committed to using llms.txt. But several have indicated they check for it and at least some smaller-scale AI tools (developer-focused LLM apps, niche assistants) already use it. The standard is in early adoption. Cost to ship: low. Value if standard takes hold: high. Worth shipping.

Should I have llms.txt AND llms-full.txt?

llms-full.txt contains the actual concatenated text of pages (not just the link list). For most sites this is overkill and a maintenance burden. llms.txt alone is sufficient for the 95% case. Only ship llms-full.txt if you have a specific reason (e.g., supporting a custom AI assistant that ingests your full content).

How often should I update llms.txt?

Quarterly. Audit links for 404s, refresh descriptions if business changes, add new pillar content. If the site changes faster than quarterly (new blog posts every week), that’s fine — but the FILE shouldn’t list every blog post. List the blog INDEX with a description, and let crawlers discover individual posts from the index.

Does llms.txt conflict with my sitemap?

No. They serve different purposes. Sitemap = exhaustive URL list with technical metadata for search engines. llms.txt = curated, described URL list with topical structure for AI crawlers. Both can ship in parallel.

How does this fit into AI search optimization more broadly?

llms.txt is one of several layers in an AI search optimization program. The others: schema markup completeness, definitional content patterns (Quick Answer Blocks), entity authority (Wikidata, Wikipedia, sameAs schema), and earned third-party citations. Our AI Search Optimization Complete Guide covers the full stack.

Can I generate llms.txt automatically from my sitemap?

You can — but the auto-generated version is rarely good. The descriptions need editorial judgment per page. The curation (what to include / exclude) needs editorial judgment. Auto-generated files routinely fall into the thin-description trap. We recommend manual or semi-manual generation.


What to do next

If your site doesn’t have llms.txt yet, the 30-minute first action is to ship a minimal version: H1 with site name, blockquote with brand description, 3-5 sections listing your most important 20-30 pages with one-line descriptions each. Better minimal than absent. Iterate from there.

If you’d like Petros and Yuki to build a properly-architected llms.txt as part of a broader AI search optimization engagement, book a consultation or explore our AI Search & GEO services. The full implementation typically takes 4-8 hours including the curation pass; it’s one of the cheapest high-leverage investments in the AI search stack.

The standard isn’t yet binding. The brands that ship it now are betting on a future where it matters. Given the cost (zero) and the upside (citation share if/when adoption hits), the bet asymmetry is obvious.

Want strategy like this for your brand?

Get a free SEO audit

60+ dimensions, 48-hour turnaround.

Get a Free SEO Audit

Submit an enterprise RFP

Tailored proposal in 5 business days.

Submit an Enterprise RFP