The 2026 robots.txt guide: why blocking Googlebot is not enough anymore

AI bots are 25 percent of all web traffic in 2026 and they fall into four very different categories. Most robots.txt files were written for a 2018 web and are now actively hurting AI visibility. Here is a 2026-era template, the five mistakes I see most often, and a copy-paste fix.

What this article is about

robots.txt is a 30-year-old plain text file at the root of every website. In 2026, it is suddenly the most important file you may have never updated. AI bots are 25 percent of all web traffic. The wrong robots.txt either invites the wrong AI bots in or locks out the right ones.

This article shows you what a 2026-era robots.txt should look like, the five most common mistakes I see in audits, and a copy-paste template you can use today.

Why robots.txt matters more in 2026 than it did in 2024

Two things changed.

First, the number of AI bots crawling the open web exploded. Imperva reports that 38 percent of web traffic is now bots, and a quarter of all web requests come from AI specifically. Three years ago, the number was a rounding error.

Second, AI bots are not all the same. Some bots search for live information that gets cited in answers. Others scrape your content to train future models. The first group helps you. The second group may or may not, depending on your business model. Treating them the same is a mistake.

The 2026 mental model is.

Bot type	Examples	What they do	Should you allow?
Search retrieval	OAI-SearchBot, PerplexityBot, Bravebot, Applebot, Claude-SearchBot	Index your site for live AI search results	Yes, almost always
User-initiated browsing	ChatGPT-User, Claude-User, Perplexity-User, MistralAI-User	Fetch a page when a user asks ChatGPT to read it	Yes
Training scrapers	GPTBot, ClaudeBot, Google-Extended, CCBot, Amazonbot	Pull content to train future models	Your call
Aggressive scrapers	Bytespider	Often ignore robots.txt anyway	Block, but accept they may not respect it

A modern robots.txt makes those distinctions explicit. The old "allow everything" or "block everything" patterns from the 2010s are no longer fit for purpose.

The five most common robots.txt mistakes I see

I have audited 200 plus sites in the last few weeks. These are the patterns that come up over and over.

Mistake 1: Blocking GPTBot when you mean to block training only

GPTBot is OpenAI's training scraper. OAI-SearchBot is the bot that indexes for ChatGPT search. They are different. Blocking GPTBot does not block ChatGPT search. Blocking OAI-SearchBot kills your ChatGPT search visibility.

I have seen sites that explicitly Disallow OAI-SearchBot, Disallow PerplexityBot, and then complain that AI does not recommend them. This is the most common cause.

Mistake 2: User-agent: * with Disallow: /

This is a wildcard block. It applies to every bot that respects robots.txt. If you do this without explicit Allow rules for the AI search bots, you are invisible to AI search.

The fix is to either remove the global block or to add explicit Allow rules for the AI search retrieval bots.

Mistake 3: No robots.txt at all

About 12 percent of the small business sites I have audited do not have a robots.txt. Without one, every bot defaults to its own behavior. Some are polite. Some are not. You have no control.

Even if you want to allow everything, ship a robots.txt. It signals that your site is professionally maintained.

Mistake 4: Allowing all AI bots to scrape training data with no thought

If your business sells specialized content, you may not want OpenAI's training scraper pulling your full archive for free. The default of "let everything in" used to be benign. In 2026, it has economic implications.

I am not telling you to block training scrapers. I am telling you to make a deliberate decision either way and document it.

Mistake 5: Using robots.txt to "secure" content

robots.txt is a request, not a barrier. Bytespider has been documented to ignore robots.txt entirely. Some scrapers spoof their User-Agent. If a page must not be public, password protect it or remove it. robots.txt does not protect anything.

A 2026 robots.txt template for a small business

Drop this at the root of your site as robots.txt. It allows AI search retrieval bots, allows user-initiated browsing bots, and blocks training-only scrapers. Adjust the training section based on your business model.

# Crawler rules — last updated 2026-05
# Allow AI search retrieval bots (the ones that get you cited)
# Block training scrapers (your call, change as needed)

# General
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/

# AI search retrieval (allow)
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Bravebot
Allow: /

User-agent: Applebot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Google-CloudVertexBot
Allow: /

User-agent: DuckAssistBot
Allow: /

# User-initiated browsing (allow, these fetch when a user asks)
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: MistralAI-User
Allow: /

# Training scrapers (block — change to Allow if you want training inclusion)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Aggressive scrapers (block)
User-agent: Bytespider
Disallow: /

User-agent: FirecrawlAgent
Disallow: /

# Sitemap reference
Sitemap: https://yourdomain.com/sitemap.xml

This file is about 60 lines. It takes 10 minutes to deploy. It usually pushes the AI Discovery score up by 30 to 50 points on its own.

How to verify your robots.txt is working

Three tests.

First, fetch the file directly. Go to https://yourdomain.com/robots.txt and read it. Does it match what you intended?

Second, test specific User-Agents. Use a tool like Google's robots.txt Tester or curl with a custom User-Agent header. Try fetching your homepage as GPTBot, then as OAI-SearchBot, and confirm the blocks and allows are correct.

Third, run the AIFreeAudit check. Our AI Discovery category specifically tests how each of the major AI bots is handled. It will flag conflicts and missing rules in seconds.

What about the rest of the bot ecosystem?

Beyond the major AI bots, there are 500 plus bots crawling the web in 2026. Most are harmless. Some are not. The full list is maintained at Dark Visitors and KnownAgents.com. Both are free.

For a small business, you do not need to know all 500. Cover the ones in the template above and you are handling about 95 percent of the AI traffic that matters. Update once a quarter when new major bots appear.

What llms.txt is and whether you need it too

llms.txt is a different file. It is a proposed standard from 2024 that gives AI agents a structured map of your site. Think of it as a sitemap with descriptions, written for LLMs rather than search engines.

The honest answer is that adoption is mixed. Anthropic uses parts of llms.txt. Most other engines do not. Google AI Overviews ignores it entirely. I wrote a separate article on whether llms.txt is worth your time.

Short version. If you have docs, yes. If you have a five-page marketing site, it is the seventh thing on your priority list. Robots.txt is the first.

Summary

robots.txt in 2026 is not the same file it was in 2018. AI bots are 25 percent of all web traffic, and they fall into four distinct categories with different effects on your business. The right robots.txt explicitly allows AI search retrieval bots, allows user-initiated browsing, and decides on training scrapers based on your business model. The template above is a starting point. The free audit at AIFreeAudit tells you exactly which bots you are blocking by mistake.

If your robots.txt has not been updated in the last 12 months, this is the single highest-impact 10-minute fix you can make today.