Presets are starting points—tune every row for your counsel-approved policy before production.
Site information
AI crawling preferences
Maps to the companion robots.txt snippet. Tooltips describe each bot; directive help applies to the robots column.
GPTBot
ChatGPT-User
ClaudeBot
PerplexityBot
Google-Extended
Amazonbot
FacebookBot
Bytespider
Content sections for LLMs
Citation, contact, licensing
Best practices
Treat robots.txt and llms.txt as polite signals—malicious scrapers and some automation ignore them. Use authentication, rate limits, and legal terms for real protection.
Ship a coherent story: if llms.txt says “allow blog only,” align robots.txt path rules and on-page meta so you are not contradicting yourself across files.
Re-check vendor docs quarterly. AI user-agents and product names change; yesterday’s GPTBot guidance may not match tomorrow’s console settings.
Log server requests occasionally and compare to declared user-agents—spoofing happens. High-risk content should not rely on crawler honesty alone.
For GEO (Generative Engine Optimization), balance visibility in AI answers with licensing: being crawlable helps discovery but does not automatically grant training rights.
Merge the generated robots snippet into your main robots.txt; avoid duplicate conflicting User-agent blocks for the same token.
The SynthQuery llms.txt Generator is a browser-based utility for drafting a clear, structured llms.txt document that explains how you want AI systems to treat your public website, alongside a companion robots.txt snippet you can merge into your existing crawler rules. While traditional robots.txt encodes machine-readable Allow and Disallow paths for polite crawlers, the emerging llms.txt pattern gives publishers a human-readable narrative: who you are, which sections of the site are meant to be referenced by language models, how you prefer citations to look, where to send AI-related legal or partnership questions, and what your licensing posture is in plain language. That combination matters for Generative Engine Optimization (GEO) because AI assistants, answer engines, and retrieval-augmented products increasingly decide what to fetch, quote, and attribute based on both technical signals and published intent.
You do not need an account or server round-trips: every keystroke stays in your tab, drafts persist in local storage between visits, and you can copy or download UTF-8 text when you are ready to commit. Per-bot rows cover GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, Amazonbot, FacebookBot, and Bytespider, each with Allow, Disallow, or crawl-delay-oriented modes that map to the robots snippet we generate in parallel. Content-area toggles default to common paths for blog, documentation, products, about, and FAQ pages but remain fully editable so headless CMS routes or international prefixes stay accurate. Template presets jump-start common policies—fully permissive AI access, a hard block posture, selective allow lists, publisher-focused patterns, and a cautious e-commerce stance—while inline tooltips explain what each robots directive does and what each major crawler token is generally associated with.
The page also surfaces a concise best-practices checklist beside the outputs so legal, marketing, and engineering stakeholders review the same facts: robots and llms files are voluntary, spoofed user-agents exist, crawl-delay is not universally honored, and sensitive content still belongs behind authentication. When your policy prose needs polish, internal links route you to the Grammar Checker, SynthRead, and other SynthQuery tools; when crawl mechanics need deeper structure, pair this page with our Robots.txt Generator and XML Sitemap Generator so discovery, exclusion, and AI transparency stay aligned.
What this tool does
Per-bot configuration is the heart of the interface. Instead of memorizing vendor documentation each time a new model provider announces a crawler, you work from a curated list with plain-language tooltips and a three-mode control that directly drives the companion robots snippet. Allow emits an empty disallow (permissive under common parsers), Disallow maps to blocking the entire site for that token, and the crawl-delay mode keeps paths open while appending Crawl-delay with the integer you specify—useful when you want visibility but need to signal politeness for bots that still honor the directive.
Content area selection bridges IA and GEO: toggles describe which slices of your site you want highlighted as “safe to reference” in the llms narrative, while editable path fields prevent the tool from forcing incorrect folders on bespoke stacks. Citation preferences and licensing notes give models and compliance teams the same instructions humans read, reducing ambiguity when assistants summarize your pages in chat interfaces or research surfaces. The robots.txt companion generator stays intentionally focused—only the AI-related user agents you configured—so you can paste beneath your global rules without accidentally overwriting sitemap lines, host directives, or wildcard groups maintained elsewhere.
Template presets accelerate workshops: Allow All AI aligns every bot row permissively and enables all content toggles; Block All AI flips the opposite for staging environments or rights-sensitive archives; Allow Select Bots demonstrates a mixed posture often used by publishers who want answer-engine visibility from a subset of providers; Content Sites narrows e-commerce exposure while keeping editorial surfaces open; E-commerce introduces throttled fetches for certain bots and blocks others commonly associated with aggressive scraping. Explanatory tooltips sit next to both the narrative fields and the robots directives legend, so newcomers learn User-agent scoping, Disallow semantics, and crawl-delay limitations without leaving the page. Finally, the best practices guide encodes operational hygiene—verify tokens regularly, watch logs for spoofing, and pair declarative files with enforceable controls—so the generator functions as lightweight education, not just a textarea replacement.
Technical details
llms.txt is a proposed community convention—not an IETF RFC like robots.txt—meant to live at the site root as a UTF-8 text or markdown-friendly file that humans and tools can read before fetching large swaths of a domain. It differs from robots.txt because it can carry narrative policy, attribution guidance, and contact flows that do not map cleanly to key-value crawler directives. Robots.txt remains the workhorse for path-level exclusion for cooperating bots; llms.txt helps publishers articulate intent for AI-specific use cases that span training, retrieval, summarization, and citation.
Each vendor publishes its own user-agent tokens and product behaviors, and those tokens evolve. GPTBot and ChatGPT-User both relate to OpenAI surfaces but serve different purposes; Google-Extended is distinct from Googlebot; PerplexityBot targets answer-engine indexing; Amazonbot, FacebookBot, and Bytespider align with their respective parent companies’ crawlers. GEO sits at the intersection of crawlability, structured data, brand clarity, and measurement: if assistants cannot fetch your public pages, they cannot cite you, but fetchability without licensing clarity can create friction when models summarize paid content.
SynthQuery places generated files at https://your-domain/llms.txt alongside https://your-domain/robots.txt for predictable discovery. Always serve llms.txt with a 200 status, avoid accidental authentication on that path, and keep caching headers reasonable so iterative policy updates propagate. When international properties use hreflang, duplicate the policy per locale host if each host serves meaningfully different rights or contacts.
Use cases
Newsrooms and independent bloggers use llms.txt to state citation expectations before syndication partners or AI summaries quote their reporting. By listing the blog path explicitly and linking contact details for corrections, they create a paper trail that complements Creative Commons badges or subscription paywalls without pretending robots rules are legally binding by themselves. SaaS documentation teams enable /docs/ while leaving pricing or authenticated app routes out of the highlighted list, then mirror stricter blocks in robots.txt for API endpoints that should never be crawled.
E-commerce operators balance GEO against margin protection: product and FAQ paths might be promoted for assistant shopping flows while /account/ and /cart/ remain absent from the llms narrative and disallowed for bots in robots.txt. Universities publish research blogs and public course catalogs in the content section while pointing AI licensing questions to a general counsel inbox. Agencies maintain starter templates per vertical, download snippets for client repos, and attach both files to onboarding tickets so developers deploy them before launch traffic arrives.
Compliance-led organizations pair this generator with policy reviews: marketing drafts the plain-language licensing paragraph, legal approves, and engineering merges the robots companion during the same release that updates CDN WAF rules. When teams also need SERP presentation checks, they jump to the Meta Checker and Canonical Tag Builder; when structured FAQs accompany AI disclosure pages, they open the FAQ Schema Generator. Throughout, the workflow stays local-first—ideal for regulated environments where even aggregated URL lists feel sensitive until publication.
How SynthQuery compares
Manually authoring llms.txt in a generic text editor works until you need synchronized robots directives, consistent bot spelling, and stakeholder-friendly explanations in one pass. Spreadsheets and wikis drift out of sync with production; this tool keeps the narrative file and the robots companion tied to the same underlying state, highlights syntax for readability, and documents directive semantics inline. Compared to all-in-one enterprise crawler suites, SynthQuery stays free, local, and focused—ideal when you want a vetted starting point without standing up another SSO-protected dashboard.
Aspect
SynthQuery
Typical alternatives
Per-bot clarity
Eight major AI-related bots with tooltips, policy modes, optional crawl-delay, and User-agent overrides.
Static blog templates with outdated bot names or a single global allow/disallow.
Robots alignment
Live companion robots.txt snippet generated from the same choices as the narrative file.
Separate documents that contradict each other after the first edit.
Education
Directive tooltips, best practices callout, and long-form FAQ on this page.
Minimal placeholder pages with no compliance context.
Privacy posture
Runs entirely in-browser with optional local persistence; no upload to SynthQuery servers.
Hosted generators that may log inputs—check their policies carefully.
How to use this tool effectively
Begin with accurate site metadata because every downstream section inherits your primary URL when we expand relative paths into absolute links. Enter the public brand or publication name, a tight description that states audience and content boundaries, and the canonical https origin without tracking parameters. If you operate multiple properties, generate one file per hostname—llms.txt is conventionally served at the root of each site you care about, not on a centralized dashboard domain unless that domain is the one crawlers actually fetch.
Next, work through AI crawling preferences methodically rather than toggling randomly. For each bot row, read the tooltip summary, decide whether you want full access, a hard disallow that maps to Disallow: / in the robots companion, or a crawl-delay-oriented posture that still leaves paths open but asks for slower fetching. Adjust the User-agent override field only when a vendor documents a different token or when your security team standardizes alternate spellings; otherwise keep the default to reduce copy-paste errors. Remember that crawl-delay is not a substitute for authentication and that major search engines may ignore delay lines for their primary crawlers—use delays for politeness where supported, not secrecy.
Then choose which content sections you want to advertise to LLMs. Enable blog, docs, products, about, and FAQ independently and set path prefixes that mirror your router—trailing slashes are optional but consistency helps operators diff policies across environments. If a section should remain discoverable to humans but not highlighted to models, disable it here while tightening robots rules or meta tags separately. After paths look correct, fill the preferred citation format with explicit instructions: title case for brand, required URL, access date, and any “do not train on” language your counsel approves. Add an AI contact channel that reaches a monitored inbox or ticketing URL, and summarize licensing in short clauses that complement—not replace—your Terms of Service.
When the preview updates, read the llms.txt pane aloud with your team; awkward sentences become obvious. Copy or download the file, place it at /llms.txt on your origin, purge CDN caches if you edge-cache text/plain, and merge the robots snippet into your master robots.txt without duplicating User-agent blocks. Revalidate in staging fetch logs, update changelog entries when marketing rotates messaging, and bookmark /free-tools for the rest of SynthQuery’s lightweight SEO utilities.
Limitations and best practices
Neither llms.txt nor robots.txt blocks determined scrapers, authenticated API abuse, or manual copying. They also do not replace copyright, contracts, paywalls, or opt-out consoles some vendors provide separately. Google does not use crawl-delay for Googlebot, and many AI crawlers may interpret tokens differently than you expect—validate with server logs and official documentation rather than assumptions. Keep duplicate User-agent blocks out of robots.txt when merging snippets, retest after CDN or edge worker changes, and version-control policy updates like any production config. For broader SEO health, maintain clean XML sitemaps, accurate canonical tags, and honest meta descriptions; this generator accelerates AI transparency work but does not fix unrelated technical debt.
Open the full catalog of AI detection, readability, plagiarism, summarization, and enterprise-ready workflows from a single index (also linked from the site footer).
Score long-form explainers about GEO, crawlers, and licensing for the reading level your audience needs.
Frequently asked questions
llms.txt is a community-driven pattern for a small text or markdown file, usually hosted at the root of your website, that tells people and AI systems—in plain language—how you want large language models to use your public content. It often includes a short site description, pointers to sections that are safe to summarize, preferred citation formats, licensing notes, and contact information for AI-related questions. It is not a formal internet standard with the same interoperability guarantees as DNS or TLS, but it is increasingly referenced in GEO conversations because it communicates intent clearly when paired with robots.txt and legal terms.
robots.txt is a machine-oriented exclusion protocol crawlers may follow before fetching URLs; it expresses rules like which user agents may access which path prefixes. llms.txt is broader and more narrative: it can explain business context, attribution expectations, and licensing in sentences, not just directives. They solve overlapping but distinct problems—robots for fetch control among polite bots, llms.txt for transparent AI communication—so teams should keep both consistent rather than choosing one as a complete solution.
There is no universal answer: it depends on whether you value training data inclusion, live browsing citations, answer-engine presence, and how comfortable you are with each vendor’s documented practices. Publishers chasing AI Overviews or assistant citations sometimes allow major retrieval crawlers while blocking bots tied to training they do not license. Highly regulated data should stay off the public web entirely rather than relying on disallow lines. Always confirm the latest vendor documentation, involve counsel when contracts or copyrights matter, and revisit decisions quarterly because product lines rename crawlers frequently.
Blocking AI-specific user agents in robots.txt generally does not change how Googlebot indexes your site unless you accidentally misconfigure overlapping rules. However, reducing AI visibility can influence whether assistants quote or recommend your brand in generative interfaces, which is a GEO consideration distinct from classical blue-link rankings. Monitor Search Console for unintended side effects whenever you merge new snippets, and keep canonical URLs, structured data, and internal linking strong so traditional SEO signals remain healthy.
GPTBot is OpenAI’s documented web crawler identifier associated with fetching public pages for model improvement and related uses, subject to OpenAI’s current policies. Allowing it means you are not using robots.txt to request that OpenAI’s crawler avoid your site; blocking it signals the opposite for well-behaved fetches. The business tradeoff hinges on whether you want participation in OpenAI-driven experiences versus control over training exposure. Read OpenAI’s official guidance alongside your counsel because legal contexts differ for news, software, healthcare, and user-generated content sites.
Citations require fetchable, authoritative pages, clear titles, stable URLs, and content structured so models can attribute quotes accurately—headings, publication dates, and explicit author or organization signals help. llms.txt can state how you want mentions formatted, but it cannot force compliance. Technical discoverability still depends on robots rules, sitemaps, performance, and whether assistants choose to browse your domain in real time. Pair this tool with strong on-page metadata, FAQ schema where appropriate, and public documentation hubs assistants are likely to retrieve.
Host it at https://your-domain/llms.txt on the same registrable hostname users associate with your brand, served over HTTPS with a 200 OK for anonymous GET requests. Avoid requiring cookies or interactive challenges on that path or CDNs may cache error responses. If you operate subdomains per language or product, duplicate or adapt the file per host when policies diverge. Keep deployment in version control like robots.txt so rollbacks are trivial when messaging changes.
You can influence polite crawlers through robots.txt, vendor-specific opt-out tools, contracts, and copyright assertions, but you cannot technologically prevent every copying scenario once content is public. llms.txt communicates preferences; it does not encrypt data. For high-sensitivity material, use authentication, legal agreements, and data minimization rather than declarative text files alone. Document retention policies internally so marketing statements in llms.txt match what engineering actually enforces.
Training crawling typically ingests large corpora to update model weights offline, while retrieval crawling fetches specific pages live when a user asks a question that requires up-to-date web context. User agents and vendor policies may differ between these modes—some products separate tokens such as GPTBot versus browsing agents. Your robots and llms policies should spell out how you expect each mode to treat licensing, attribution, and rate limits, understanding that not every provider exposes granular toggles.
GEO focuses on being discoverable, trustworthy, and quotable in AI-generated answers, overviews, and assistant experiences. llms.txt contributes a transparent layer of intent and contact data that complements technical SEO, structured data, and brand-aligned copy. It does not replace measurement—you still need analytics, Search Console, and qualitative checks in AI products—but it reduces ambiguity for teams building responsible AI visibility programs. SynthQuery bundles this generator with robots and sitemap utilities so technical and editorial stakeholders can ship coherent policies together.