What this article is about
robots.txt is a 30-year-old plain text file at the root of every website. In 2026, it is suddenly the most important file you may have never updated. AI bots are 25 percent of all web traffic. The wrong robots.txt either invites the wrong AI bots in or locks out the right ones.
This article shows you what a 2026-era robots.txt should look like, the five most common mistakes I see in audits, and a copy-paste template you can use today.
Why robots.txt matters more in 2026 than it did in 2024
Two things changed.
First, the number of AI bots crawling the open web exploded. Imperva reports that 38 percent of web traffic is now bots, and a quarter of all web requests come from AI specifically. Three years ago, the number was a rounding error.
Second, AI bots are not all the same. Some bots search for live information that gets cited in answers. Others scrape your content to train future models. The first group helps you. The second group may or may not, depending on your business model. Treating them the same is a mistake.
The 2026 mental model is.
| Bot type | Examples | What they do | Should you allow? |
|---|---|---|---|
| Search retrieval | OAI-SearchBot, PerplexityBot, Bravebot, Applebot, Claude-SearchBot | Index your site for live AI search results | Yes, almost always |
| User-initiated browsing | ChatGPT-User, Claude-User, Perplexity-User, MistralAI-User | Fetch a page when a user asks ChatGPT to read it | Yes |
| Training scrapers | GPTBot, ClaudeBot, Google-Extended, CCBot, Amazonbot | Pull content to train future models | Your call |
| Aggressive scrapers | Bytespider | Often ignore robots.txt anyway | Block, but accept they may not respect it |
A modern robots.txt makes those distinctions explicit. The old "allow everything" or "block everything" patterns from the 2010s are no longer fit for purpose.
The five most common robots.txt mistakes I see
I have audited 200 plus sites in the last few weeks. These are the patterns that come up over and over.
Mistake 1: Blocking GPTBot when you mean to block training only
GPTBot is OpenAI's training scraper. OAI-SearchBot is the bot that indexes for ChatGPT search. They are different. Blocking GPTBot does not block ChatGPT search. Blocking OAI-SearchBot kills your ChatGPT search visibility.
I have seen sites that explicitly Disallow OAI-SearchBot, Disallow PerplexityBot, and then complain that AI does not recommend them. This is the most common cause.
Mistake 2: User-agent: * with Disallow: /
This is a wildcard block. It applies to every bot that respects robots.txt. If you do this without explicit Allow rules for the AI search bots, you are invisible to AI search.
The fix is to either remove the global block or to add explicit Allow rules for the AI search retrieval bots.
Mistake 3: No robots.txt at all
About 12 percent of the small business sites I have audited do not have a robots.txt. Without one, every bot defaults to its own behavior. Some are polite. Some are not. You have no control.
Even if you want to allow everything, ship a robots.txt. It signals that your site is professionally maintained.
Mistake 4: Allowing all AI bots to scrape training data with no thought
If your business sells specialized content, you may not want OpenAI's training scraper pulling your full archive for free. The default of "let everything in" used to be benign. In 2026, it has economic implications.
I am not telling you to block training scrapers. I am telling you to make a deliberate decision either way and document it.
Mistake 5: Using robots.txt to "secure" content
robots.txt is a request, not a barrier. Bytespider has been documented to ignore robots.txt entirely. Some scrapers spoof their User-Agent. If a page must not be public, password protect it or remove it. robots.txt does not protect anything.
A 2026 robots.txt template for a small business
Drop this at the root of your site as robots.txt. It allows AI search retrieval bots, allows user-initiated browsing bots, and blocks training-only scrapers. Adjust the training section based on your business model.
# Crawler rules — last updated 2026-05
# Allow AI search retrieval bots (the ones that get you cited)
# Block training scrapers (your call, change as needed)
# General
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/
# AI search retrieval (allow)
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Bravebot
Allow: /
User-agent: Applebot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Google-CloudVertexBot
Allow: /
User-agent: DuckAssistBot
Allow: /
# User-initiated browsing (allow, these fetch when a user asks)
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: MistralAI-User
Allow: /
# Training scrapers (block — change to Allow if you want training inclusion)
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# Aggressive scrapers (block)
User-agent: Bytespider
Disallow: /
User-agent: FirecrawlAgent
Disallow: /
# Sitemap reference
Sitemap: https://yourdomain.com/sitemap.xml
This file is about 60 lines. It takes 10 minutes to deploy. It usually pushes the AI Discovery score up by 30 to 50 points on its own.
How to verify your robots.txt is working
Three tests.
First, fetch the file directly. Go to https://yourdomain.com/robots.txt and read it. Does it match what you intended?
Second, test specific User-Agents. Use a tool like Google's robots.txt Tester or curl with a custom User-Agent header. Try fetching your homepage as GPTBot, then as OAI-SearchBot, and confirm the blocks and allows are correct.
Third, run the AIFreeAudit check. Our AI Discovery category specifically tests how each of the major AI bots is handled. It will flag conflicts and missing rules in seconds.
What about the rest of the bot ecosystem?
Beyond the major AI bots, there are 500 plus bots crawling the web in 2026. Most are harmless. Some are not. The full list is maintained at Dark Visitors and KnownAgents.com. Both are free.
For a small business, you do not need to know all 500. Cover the ones in the template above and you are handling about 95 percent of the AI traffic that matters. Update once a quarter when new major bots appear.
What llms.txt is and whether you need it too
llms.txt is a different file. It is a proposed standard from 2024 that gives AI agents a structured map of your site. Think of it as a sitemap with descriptions, written for LLMs rather than search engines.
The honest answer is that adoption is mixed. Anthropic uses parts of llms.txt. Most other engines do not. Google AI Overviews ignores it entirely. I wrote a separate article on whether llms.txt is worth your time.
Short version. If you have docs, yes. If you have a five-page marketing site, it is the seventh thing on your priority list. Robots.txt is the first.
Summary
robots.txt in 2026 is not the same file it was in 2018. AI bots are 25 percent of all web traffic, and they fall into four distinct categories with different effects on your business. The right robots.txt explicitly allows AI search retrieval bots, allows user-initiated browsing, and decides on training scrapers based on your business model. The template above is a starting point. The free audit at AIFreeAudit tells you exactly which bots you are blocking by mistake.
If your robots.txt has not been updated in the last 12 months, this is the single highest-impact 10-minute fix you can make today.