Verify AI tools and LLMs can access your pages (robots.txt)

robots.txt is the tiny file at your site root that tells AI crawlers which parts of your site they're allowed to read — and the single most common cause of accidental AI invisibility is an old config quietly blocking the wrong bots. This article walks you through checking yours in about ten minutes.

What this article covers

By the end, you'll understand what robots.txt is, how it controls which AI tools can read your site, and you'll be able to check whether you're accidentally blocking the AI crawlers you actually want. This one is mostly a checking exercise, so it's quick.

What is robots.txt?

robots.txt is a small plain-text file that sits at the root of your website (at yoursite.com/robots.txt). It's been around since the 1990s, long before AI, and its job has always been the same: it tells automated visitors (bots) which parts of your site they're allowed to read.

Think of it as a posted notice at your front door. It lists who's welcome and where they can go. The important thing to understand is that it works on the honor system. Well-behaved bots read it and follow the rules. It doesn't physically block anyone, it just states your preferences. The major AI companies (OpenAI, Anthropic, Google, Perplexity) say their main crawlers honor it, and in practice they generally do.

This matters for AI visibility because of one common, costly mistake: many websites were set up years ago to block unfamiliar bots by default, sometimes through a security tool or CDN rather than a deliberate choice. If your robots.txt (or your security settings) is blocking the AI crawlers, then those tools can't read your pages, and your content becomes ineligible to appear in their answers. Before you spend any effort optimizing your content for AI, it's worth confirming the door is actually open.

What robots.txt does (and doesn't do)

What it does: it lets you allow or block specific bots, by name, from reading some or all of your site.

What it doesn't do:

It doesn't force anyone in or out. It's a request, not a lock. A handful of badly-behaved crawlers ignore it entirely, and for those the only real defense is at the server level. That's an edge case, not your starting concern.
It doesn't affect your Google Search ranking when you adjust the AI-specific rules. Blocking Google's AI-training crawler does not change your normal Google Search position, because a different crawler handles Search. (More on that below.)
It doesn't help AI tools understand your content. That's a separate file (llms.txt). robots.txt only governs access.

The one distinction that trips everyone up

Not all AI bots do the same job. The big practical split is between two kinds, and they often have separate names from the same company:

Training crawlers collect content to train future AI models. Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google), CCBot (Common Crawl). Blocking these is a reasonable choice if your concern is keeping your content out of model training. It's an intellectual-property decision.

Search and retrieval crawlers fetch your pages so an AI can cite them in a live answer when someone asks a question. Examples: OAI-SearchBot and ChatGPT-User (OpenAI), Claude-SearchBot and Claude-User (Anthropic), PerplexityBot (Perplexity). These are the ones that affect whether you show up in AI answers. Blocking these is what removes you from AI visibility.

Here's the catch that causes most accidental mistakes: these are separate tokens, so blocking the training bot does not block the search bot, and vice versa. If you (or an old config) block ClaudeBot thinking you've blocked Anthropic entirely, you may still be visible through Claude-SearchBot, or you may have accidentally cut off the search bot while leaving training open. You have to handle each one explicitly. For most businesses that want to be found in AI answers, the sensible default is to allow the search and retrieval bots, and make a separate, deliberate decision about the training bots.

One useful, often-overlooked fact: blocking Google's AI-training token (Google-Extended) does not affect your normal Google Search ranking. Google has stated this directly. So you can opt out of Gemini training without paying any Search penalty. They're handled by different crawlers.

How to check whether AI tools can reach your pages

Step 1 — Open your current robots.txt

Type your domain followed by /robots.txt into a browser, for example https://yoursite.com/robots.txt. Whatever appears is the file every bot reads. If you get a 404 (file not found), you don't have one. That's not a problem in itself, because no file means bots are allowed everywhere by default.

Step 2 — Read the rules out loud in plain English

The file is made of simple blocks. Each block names a bot, then lists what it can or can't do:

User-agent: GPTBot
Disallow: /

That reads as: "GPTBot is not allowed anywhere on this site." A Disallow: followed by a single / means the whole site is off-limits to that bot. An Allow: / means the bot is welcome everywhere. A Disallow: with nothing after it also means "nothing is blocked." A line like Disallow: /private/ blocks only that folder.

Scan the file for any of the AI bot names listed earlier. For each one, ask: is it allowed or disallowed? Pay special attention to any block that starts with User-agent: *, because the * is a wildcard meaning "all bots," and a Disallow: / under it would block everything that isn't given its own explicit rule.

Step 3 — Flag anything that blocks a search or retrieval bot

If you find a Disallow: / under any of the search and retrieval crawlers (OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Claude-User, PerplexityBot), that's likely an accidental visibility problem and worth fixing. Blocking a training crawler may be intentional, so leave those unless you've decided otherwise.

Step 4 — Watch for deprecated bot names

Some old robots.txt files still reference Claude-Web and anthropic-ai. Anthropic no longer uses these. They're harmless to leave in place, but if your file blocks only those, you are not actually blocking Anthropic's current crawler. Don't assume an old name is still doing the job you think it is.

Step 5 — Test it with a free tool instead of guessing

Reading the rules by eye is error-prone, especially with multiple blocks. Free tools let you enter your URL and a specific bot name and tell you plainly whether that bot is allowed or blocked. Two reliable ones: the Merkle robots.txt Tester and Google Search Console (which shows how Google sees your file). Run each AI bot name you care about through one of these to confirm.

Step 6 — Check that your security tool isn't overriding the file

This is the step most people skip, and it causes the largest share of accidental AI invisibility. Even with a perfect robots.txt, a CDN or security service (Cloudflare, Sucuri, and similar) may be silently blocking AI crawlers at a different layer, treating them like malicious scrapers. If you use one, log into its dashboard and look for an "AI bots," "AI scrapers," or "bot management" setting, and confirm it isn't blocking the crawlers you want to allow. If your security tool also manages your robots.txt, make sure it isn't overriding the file on your own server.

Step 7 — Confirm with your server logs after a week or two

The real proof is whether the bots actually show up. Changes to robots.txt take effect only when crawlers next visit, which can take days to weeks. After that, check your server access logs (or your CDN's bot analytics) for the bot names. Searching your logs for tokens like GPTBot, OAI-SearchBot, ClaudeBot, Claude-SearchBot, or PerplexityBot shows which AI crawlers have genuinely reached your site. If a bot you've allowed never appears, it simply hasn't visited yet, which is normal for smaller or less frequently updated sites.

A safe starting configuration

If you want to be found in AI answers and you're comfortable with your public content being read, a simple "welcome the search bots" setup looks like this. You can paste blocks like these into your robots.txt, adjusting to your own decisions:

# Allow AI search and retrieval crawlers (these affect AI visibility)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Everything else allowed by default
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

If you also want to keep your content out of AI model training while staying visible in AI answers, add explicit Disallow: / blocks for the training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) underneath. That's a separate, legitimate choice, not a requirement.

Two small rules to avoid breaking the file: always put at least one Allow: or Disallow: line under every User-agent: (a name on its own does nothing), and leave a blank line between blocks so they don't merge.

The honest bottom line

robots.txt won't get you cited by AI on its own. What it does is make sure you're not accidentally locked out, which is a surprisingly common problem on older sites. The whole job here is verification: confirm the search and retrieval bots are allowed, make a deliberate choice about training bots, check that no security layer is quietly overriding you, and then confirm in your logs. Ten minutes of checking can save you from optimizing content that AI tools were never allowed to read.