Research · AI Citation Structural Analysis

Reverse-Engineering 100+ LLM Citations: The 12 Structural Features Every Cited Page Shares (2026)

We synthesized the public reverse-engineering research on LLM citations — Profound (4B citations), Semrush (248K Reddit URLs + 89K LinkedIn + 26K Quora), 5W (680M citations), Discovered Labs content-type analysis — into one framework. Twelve structural features recur across cited pages. This is the canonical reference for what AI search engines actually extract.

Quick answer. Across the published reverse-engineering research on LLM citations — Profound’s 4 billion citation analysis, Semrush’s combined 363,000 URL studies across Reddit/LinkedIn/Quora, 5W’s 680M citation index, and Discovered Labs’ content-type breakdown — twelve structural features recur in pages that AI engines cite at scale. They cluster in four buckets: content structure (definitional openers, Quick Answer Blocks, heading depth, extractable surfaces), authority and credibility (FAQPage schema, stable Organization @id, Person-level attribution, third-party citation density), technical foundations (server-rendered HTML, topic-anchored URLs), and anti-features (no clickbait language, no publish-date dependency). Engagement is not the signal — 80% of cited Reddit posts have fewer than 20 upvotes. Structure is. This is the canonical synthesis.

Table of contents

  1. How we built this analysis (methodology)
  2. Category A: Content structure
  3. Category B: Authority and credibility
  4. Category C: Technical foundations
  5. Category D: Anti-features (what cited pages don’t do)
  6. Five findings that contradict common wisdom
  7. How to apply this to your own site
  8. FAQ

How we built this analysis (methodology)

A short note on what this is and isn’t — same discipline we applied to our Reddit threads analysis and our 12 GEO mistakes audit framework.

What it is: a meta-analysis of the published reverse-engineering research on LLM citations. We did not run our own 100-page crawl from scratch. We synthesized the structural-feature findings from four independent research efforts that have already done this at scale:

  • Profound — 4 billion AI citations, 300M responses, Reddit-collab study published November 2025
  • Semrush — three combined studies analyzing 248K Reddit URLs, 89K LinkedIn URLs, and 26K Quora URLs cited in ChatGPT Search, Google AI Mode, and Perplexity (October-November 2025)
  • 5W AI Platform Citation Source Index 2026 — 680M citations consolidated from 6 underlying studies covering ChatGPT, Claude, Perplexity, Gemini, and Google AI Overviews
  • Discovered Labs — qualitative content-type breakdown across Reddit, LinkedIn, and editorial sources

The 100+ in our title refers to the qualitative review of cited examples we conducted across these studies — not a fresh quantitative crawl. The features below are the patterns that recur across all four research efforts, sometimes corroborating each other on the exact same metrics, sometimes surfacing complementary findings.

What it isn’t: a single new dataset. The quantitative scale (4B+ citations, 363K+ URLs) belongs to the research teams cited. Our contribution is the synthesis — combining their fragmented findings into one operational framework you can audit against. Resocial’s value-add is the 12-feature decomposition, the cross-cutting analysis, and the application layer at the end.

Why this matters: brands trying to “get cited” by AI search often optimize for the wrong things — engagement, word count, viral hooks. The published research is unambiguous that those are not the signals. Structure is. This piece exists so you don’t have to read all four research efforts to know which structural levers to pull.

Category A: Content structure

The first four features are about how content is arranged on the page — independent of authority, schema, or technical setup. Pages that get cited at scale share these structural patterns regardless of vertical, brand size, or domain authority.

#1 — The definitional opener

The feature: the page opens with a direct definitional sentence within the first 60 words. The pattern is "X is Y..." or close variants. No preamble. No anecdote. No “in this article we’ll explore…”

The evidence: across Profound’s analysis and Semrush’s Reddit/LinkedIn/Quora research, the cited pages overwhelmingly start with a direct definitional answer. The structure mirrors how AI engines decompose user queries: “what is X?” → extract the definition from the top of the most authoritative page that has one.

Why it works: AI engines have an effective context window for first-pass extraction. The first 60-100 words of a page carry disproportionate weight in the decision of whether to extract from that page at all. Pages that bury the definition below an introduction get skipped, even when their content is more authoritative.

The applied form: read the first 60 words of any priority page on your site. Does it directly answer the page’s primary query? If you have to read further to find the definition, you have mistake #5 from our 12 GEO mistakes audit. The fix is structural rewriting, not new content.

#2 — The Quick Answer Block

The feature: a visually-distinct 40–80 word summary block at the top of the page, dense with the page’s core answer. Often styled with a different background color, a callout border, or a dedicated container element (we use <div class="qab">).

The evidence: Semrush’s 248K Reddit posts research found that the cited Reddit threads average ~80 words for the original post — short, dense, answer-shaped. This isn’t a Reddit quirk. The same pattern appears in cited LinkedIn posts (Semrush 89K), cited Quora answers (Semrush 26K), and cited blog posts across the Profound 4-billion-citation dataset. AI engines reward short, dense, structured paragraphs because they’re high-extraction-confidence targets.

Why it works: a single high-density block of 40-80 words gives the AI engine one canonical extraction target. Without it, the engine has to assemble an answer from scattered sentences across the page, which lowers extraction confidence and reduces citation likelihood.

The applied form: every priority page on your site needs a Quick Answer Block. 40-80 words. Direct answer to the page’s primary query. Bold the key terms. Visually distinct from body content. This is the single highest-leverage formatting edit any site can make — and the one we add first in every Resocial audit.

#3 — Heading hierarchy depth

The feature: cited pages have rich heading hierarchies. Minimum H2 every 200-300 words. Multiple H3 subsections under each H2. Headings phrased as questions or definitional statements (not marketing copy).

The evidence: the Profound source-stack analysis shows that pages with 8+ H2/H3 headings are cited at materially higher rates than pages with fewer than 5. The heading itself becomes an extraction handle — each H2/H3 is a potential entry point for an AI engine looking for a specific sub-topic answer.

Why it works: headings function as a query-match layer. When a user asks “how does X affect Y?” the AI engine scans heading text first to find the closest match. Pages with thin heading hierarchies offer fewer match opportunities. Pages with rich, question-shaped headings act as a query index of themselves.

The applied form: count the H2/H3 headings on your top 10 pages. If the average is under 6, you have a structural gap. Rewriting prose-only sections into heading-anchored subsections is one of the highest-yield refactors for AI citation lift.

#4 — Multiple extraction surfaces

The feature: cited pages have multiple structural extraction surfaces within the body — at least one bulleted list, often a comparison table, sometimes code blocks for technical content, frequently definitional callouts beyond just the opening QAB.

The evidence: across all four research datasets, prose-only pages are cited at materially lower rates than pages with diverse structural elements. Each list item, each table row, each code block becomes an independent potential citation target. A page with a 10-row table has 10 extraction opportunities. A page with 2,000 words of prose has 1.

Why it works: AI engines extract at the finest reasonable granularity. They prefer to lift a single bullet or table row over a paragraph because the bullet is more atomic, easier to attribute, and harder to misinterpret. Tables especially compound: comparison tables are heavily extracted for “X vs Y” queries because the rows directly mirror the query structure.

The applied form: open a long-form post on your site. Count structural elements (lists, tables, code blocks, callouts). For a 2,000-word piece, aim for 3+ structured units. Convert “first… second… third…” prose patterns into numbered lists. Convert “X is better at A while Y is better at B” prose into comparison tables.

Category B: Authority and credibility

Content structure makes pages extractable. The next four features make them trustworthy to the AI engine’s entity-disambiguation layer. Both are required — extractable content from an unrecognized entity is still rarely cited.

#5 — FAQPage schema

The feature: pages that get cited by AI Overviews and answer engines disproportionately have FAQPage schema markup. The schema explicitly signals “this page has Q&A content, here are the questions and answers in structured form.”

The evidence: cross-referenced across the Profound 4B-citation dataset and the published AI Overview research, FAQPage schema is consistently listed as the single most reliable structural eligibility signal for Google AI Overviews specifically — and a strong secondary signal for ChatGPT and Perplexity. Pages with valid FAQPage schema are cited at higher rates than equivalent content without it.

Why it works: FAQPage schema gives the AI engine pre-structured Q&A pairs that exactly mirror how the engine assembles answers. Instead of inferring what’s a question and what’s an answer from prose, it reads the schema directly. Extraction confidence is much higher.

The applied form: every priority page should have FAQPage schema with 3-7 Q&A pairs. Questions phrased the way users actually search (starting with What/How/Why/When/Can). Answers under 80 words each. Mirroring the visual FAQ on the page exactly. See our Schema Markup Complete Guide for implementation details.

#6 — Stable Organization @id + sameAs depth

The feature: cited pages live on domains where Organization schema is present with a stable @id across all pages, and a sameAs array with 8+ entries linking to external authoritative profiles (LinkedIn, GitHub, Crunchbase, G2, Wikipedia, Wikidata, etc.).

The evidence: Profound’s analysis identifies entity authority as the largest single citation-rate predictor at the domain level. Brands with thin sameAs arrays (under 5 entries) or unstable Organization @ids are cited at materially lower rates than brands with rich entity graphs, even when content quality is comparable.

Why it works: AI engines maintain internal entity graphs. When a brand has rich sameAs linkage, the engine has higher confidence that this Organization is the same one referenced in other authoritative sources. The brand becomes “disambiguated” — known as a single entity rather than a candidate entity competing with similarly-named alternatives. Disambiguated brands get cited preferentially.

The applied form: audit your Organization schema. The @id should be a stable identifier (e.g., https://yoursite.com/#organization) used consistently across every page. The sameAs array should have 8-12 entries: LinkedIn company page, Twitter/X, GitHub, Crunchbase, G2, Capterra, AngelList, Wikipedia (if exists), Wikidata Q-number, industry-specific directories. This is mistakes #1 and #2 from our 12 GEO mistakes framework.

#7 — Person-level author attribution

The feature: cited content has named authors with Person schema, including knowsAbout arrays, worksFor linkage to the publishing Organization, and sameAs to LinkedIn/Twitter/speaker bios.

The evidence: across LinkedIn and editorial citation research, AI engines preferentially cite content from named, attributed authors with verifiable expertise areas. Anonymous or thinly-attributed content is cited at lower rates even when the content is technically equivalent. The pattern is strongest in YMYL (your money your life) verticals — health, finance, legal — but appears across all categories.

Why it works: AI engines build attribution chains. Content from a named expert with documented expertise area in knowsAbout gets weighted higher than anonymous content because the attribution provides an additional trust signal. Person schema feeds the engine’s E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) calculation.

The applied form: every blog post should have a named author. Every team page should include Person schema for each senior team member. The knowsAbout array should list 6-10 expertise areas. worksFor should resolve to the Organization’s stable @id. sameAs should link to LinkedIn at minimum, ideally Twitter and a Crunchbase founder profile too.

#8 — Outbound citation density to canonical sources

The feature: cited pages link out generously to canonical authoritative sources — Wikipedia, .gov, .edu, tier-1 publications, primary research. Not for SEO link-juice reasons (outbound links rarely move rankings significantly), but as a credibility signal AI engines weigh.

The evidence: Profound’s source-stack analysis shows that pages with rich outbound citation networks to canonical authorities are cited at higher rates than pages without. This is independent of inbound link profile. The signal appears to be “this page knows where the canonical knowledge lives, which suggests the author understands the field.”

Why it works: AI engines treat outbound citations as a proxy for editorial quality. A page that cites Wikipedia, primary research, and tier-1 publications is structurally signaling “I know the field’s canonical sources and respect their authority.” Pages that never cite anyone are structurally signaling the opposite — usually unintentionally.

The applied form: audit a long-form post on your site. Count outbound links to canonical sources (Wikipedia, .gov/.edu, primary research papers, tier-1 publications). For a 2,000-word piece, aim for 5-10 outbound canonical citations. Not in a “link dump” section — woven into the body where the citations contextualize claims.

Category C: Technical foundations

The next two features are technical preconditions. They’re not visible in the content, but they materially affect whether content can be cited at all.

#9 — Server-rendered HTML

The feature: cited pages render their content in server-rendered HTML that’s visible without JavaScript execution. SSR (server-side rendering), SSG (static site generation), or traditional server-rendered templates. Pages that require JS execution to expose their content are cited at materially lower rates.

The evidence: not all AI crawlers execute JavaScript. ChatGPT-User runs limited JS. PerplexityBot has improved but is inconsistent. ClaudeBot’s JS behavior is variable across versions. The result: content visible to users via a JS-heavy SPA is invisible to a non-trivial fraction of AI crawlers. Pages on SSR/SSG architectures dominate the cited-pages datasets across all four research efforts.

Why it works: it’s not that AI engines “prefer” SSR — it’s that they can actually parse it reliably. JS-rendered content is a coin flip depending on crawler version. Server-rendered HTML is a guarantee.

The applied form: disable JavaScript in Chrome DevTools and reload your priority pages. Is the content still visible in the initial HTML? If not, this is your single biggest technical citation cap. Migration to SSR or SSG is a significant project but it’s the only durable fix. For Resocial we chose Astro specifically because its static-first model eliminates this entire failure mode. Mistake #11 in our audit framework.

#10 — Topic-anchored URL slug

The feature: cited pages have URLs anchored to the topic, not to a date. /services/local-seo/ not /2024/01/local-seo-post/. /products/iphone-16/ not /2024/news/apple-event-iphone-16/. The URL stays stable as the content updates.

The evidence: Profound’s data shows the average cited page is ~1 year old, and 4% of cited pages are from 2019 or earlier. These persistent citation patterns require a stable URL. Date-anchored URLs make every “update” effectively a new URL or a stale one — and citations get fragmented across versions.

Why it works: AI engines build cumulative citation weight on stable URLs. A topic-anchored URL accumulates inbound citations, schema markup, and entity authority over years. A date-anchored URL caps that accumulation at the publish date. AI engines preferentially cite URLs with deeper accumulated weight.

The applied form: audit your URL structure. If your blog uses /YYYY/MM/slug/ patterns, that’s a structural cap. The migration is non-trivial (301 redirects from old URLs to new, sitemap regeneration, internal link updates) but the long-term citation compounding benefit is real. This is one of the core architectural choices our Technical SEO service addresses on day one.

Category D: Anti-features (what cited pages don’t do)

The final two features are inverted — they’re things that kill citation rate when present. Many content marketers default to these without realizing the cost.

#11 — No clickbait or marketing language

The feature: cited pages use neutral, declarative, definitional language. They avoid promotional phrases (“revolutionary,” “game-changing,” “you won’t believe…”), hyperbolic claims (“the only X you’ll ever need”), and emotional hooks. The tone is closer to Wikipedia than to a marketing landing page.

The evidence: Profound’s sentiment analysis is unambiguous — AI engines cite negatively-sentimented content at 6.1% and positively-sentimented content at 5.0%. Nearly identical. The signal is balanced honesty, not enthusiasm. Pages that read like marketing copy are filtered out. Pages that read like neutral encyclopedic explanations are filtered in.

Why it works: AI engines are explicitly trained to filter promotional language because it signals low-trust content. The training pipeline downweights pages with promotional patterns. When the engine assembles an answer, it preferentially extracts from neutral sources because they pass the model’s internal trust filter.

The applied form: read your priority pages out loud. Do they sound like a Wikipedia article or like a sales page? If sales page, you have a citation cap that no schema or structural change can lift. The rewrite is editorial — neutral language, factual claims with sources, no superlatives. Some content marketers find this hard because it conflicts with conversion-optimization training. The answer is: separate concerns. CTAs and conversion-optimized copy live below the fold. The body content above the fold reads neutrally.

#12 — No publish-date dependency

The feature: cited pages are written so the content stays correct without a date in the body. They use phrases like “as of [updated date]” rather than “in 2024” or “this year.” When the page gets updated 6 months later, the body still reads correctly.

The evidence: pages with hard date dependencies in the body (“In Q1 2024 we saw…” or “This year’s data shows…”) get cited briefly after publication, then decay rapidly as the date language ages out. Pages without date dependencies in the body accumulate citations over years. The Profound 4% finding (4% of cited pages are from 2019 or earlier) is concentrated in date-independent content.

Why it works: AI engines preferentially cite currently-correct content. A page that says “in Q1 2024” still ranks in some contexts, but the engine knows the content is stale relative to its claim. A page that says “as of [date last updated]” is structurally signaling that the body is current. This connects to the “living document” architecture we covered in The Blog Post Is Dead. The Document Is Not. — living documents naturally avoid this anti-feature.

The applied form: audit your top-traffic posts for hard date dependencies in the body. Replace “in 2024” with “as of [last-updated date].” Replace “this year” with “currently.” Move the date marker to the header metadata, not the body prose. The page now reads correctly across update cycles.

Five findings that contradict common wisdom

Across the four datasets, several findings invert what most content marketers assume. These are the parts of the analysis worth highlighting because they reframe day-to-day editorial decisions.

1. Engagement is not the citation signal. 80% of cited Reddit posts have fewer than 20 upvotes. 70% have fewer than 20 comments. AI engines don’t optimize for popularity; they optimize for extractable structure. Viral content rarely makes the citation list. Quietly clear content frequently does.

2. Length is short, not long. The median cited Reddit thread is ~80 words for the original post. Cited LinkedIn posts average under 300 words. Cited Quora answers average under 200. The “comprehensive 2,500-word piece” optimization assumption is a 2010-Google heuristic that doesn’t apply to AI extraction. Short, dense, structured wins.

3. Sentiment doesn’t predict citation direction. Positive sentiment (5.0%) and negative sentiment (6.1%) are cited at nearly equal rates. AI engines reward honest evaluation, not praise. Brands that publish only positive marketing content are structurally under-cited compared to brands that publish balanced reviews with pros and cons.

4. Recency is overweighted in most strategies. 4% of cited pages are from 2019 or earlier. The average cited page is ~1 year old. Most content marketers chase “fresh content” assuming AI engines prefer it. The data shows AI engines prefer stable canonical content with documented update history over fresh-but-thin new pages. The compounding rate of well-structured evergreen content beats the velocity of new-content publishing.

5. JavaScript is the silent citation killer. Of all 12 features, JS-rendered content is the one site teams least audit for. SSR/SSG architecture isn’t just a performance optimization — it’s a citation-rate precondition. Brands on JS-heavy SPAs are operating under a permanent citation cap that no content or schema work can lift.

How to apply this to your own site

The 12 features above translate into a sequenced audit framework. The order matters — Categories A and B compound first, C is a precondition, D is editorial discipline that takes longest.

Quarter 1 — Fix Category A (content structure). Audit your top 10 pages against features #1-4. Add Quick Answer Blocks. Rewrite preamble openings to direct definitional openers. Increase heading hierarchy depth. Add extraction surfaces (lists, tables, callouts). This is rewriting work, not new-content work. Typical lift in AI citation rate visible within 30-60 days.

Quarter 2 — Fix Category B (authority and credibility). Add FAQPage schema everywhere it makes sense. Audit Organization schema — stable @id, full sameAs array (8+ entries). Build out Person schema for senior team. Add outbound canonical citations across long-form content. This is schema + editorial work. Typical lift visible at 60-90 days.

Quarter 3 — Fix Category C (technical preconditions). If you’re on a JS-heavy SPA, plan a migration to SSR or SSG. Audit URL structure — migrate date-anchored slugs to topic-anchored slugs with proper 301s. This is engineering work and the lift is longer-tail but durable.

Always — Maintain Category D (anti-features discipline). Editorial review every new piece for promotional language and date dependencies. This is a process discipline, not a one-time fix. Build it into the brief-to-publish workflow.

For brands that want to skip the self-audit, our Generative Engine Optimization service runs this 12-feature framework on every engagement, prioritized by your site’s specific gaps. For brands at the entity-authority stage, our ChatGPT visibility service covers features #1-8 in depth.

FAQ

Are all 12 features equally important? No. The categories compound in order: Content Structure (A) → Authority (B) → Technical (C) → Anti-features (D). A and B produce the largest single citation lifts; C is a precondition that becomes a hard cap if unfixed; D is ongoing editorial discipline. Within Category A, the Quick Answer Block (#2) and the definitional opener (#1) typically produce the fastest visible improvement.

How is this different from the 12 GEO mistakes audit? That piece is the negative-space audit — what to detect and fix on existing pages. This piece is the positive-space pattern — what cited pages share. They’re complementary. The mistakes piece tells you what’s broken. This piece tells you what good looks like. Most engagements use both: detect via the mistakes framework, build toward the features framework.

Will fixing these guarantee AI citations? No. They establish structural eligibility — a brand that fixes all 12 features moves from “structurally unable to be cited” to “structurally eligible to be cited.” Whether the engine actually chooses your content depends on factors no one fully predicts (model-internal preferences, real-time query context, competitive content). But eligibility is the floor — without it, citation is essentially random luck.

What about content quality and accuracy? Implicit. The 12 features assume content is accurate, well-researched, and useful. We didn’t list “be accurate” as a feature because it’s the table-stakes precondition for everything else. AI engines do downweight factually inaccurate content over time. But assuming a baseline of accuracy, the 12 structural features are what differentiates “cited” from “not cited.”

Where does this differ across AI engines? Most features apply universally. Engine-specific nuances: ChatGPT weights Wikipedia and editorial sources heavily, so Category B (entity authority + author attribution) compounds faster there. Perplexity weights Reddit and LinkedIn heavily, so cross-platform presence in those surfaces matters. Google AI Overviews weights FAQPage schema disproportionately. Gemini weights YouTube transcripts. The categories apply universally; the relative weight shifts by engine. See our ChatGPT vs Perplexity for SEO for the engine-specific deeper dive.


This piece is the third in our citation-research series, following The 25 Most-Cited Domains in ChatGPT and 30 Reddit Threads ChatGPT, Perplexity, and Google AI Cite Most. For the operational audit framework that detects the inverse of these features on your site, see The 12 Most Common GEO Mistakes We See in Live Audits. For the broader context on how AI search is reshaping organic discovery, the Complete Guide to AI Search Optimization in 2026 covers the full discipline framework.

For brands that want to operationalize these 12 features into a measurable program, our Generative Engine Optimization service is built around exactly this framework — both detection (what’s missing) and construction (what to build). For the methodology behind running these audits at scale using our 25-agent workforce, see The Agentic SEO Operating Model.

Yuki & Klara, AI Search Strategy & Schema Architecture leads

Want strategy like this for your brand?

Get a free SEO audit

60+ dimensions, 48-hour turnaround.

Get a Free SEO Audit

Submit an enterprise RFP

Tailored proposal in 5 business days.

Submit an Enterprise RFP