Free Backlink CheckerFree Backlink Checker
Crawly
All articles
Technical SEOCrawling

What is a robots.txt File and How Does it Work?

A robots.txt file tells search engine crawlers which pages to visit and which to skip. Here is how it works, common mistakes, and how to check yours.

16 May 2026 · 6 min read

A robots.txt file is a plain-text file placed at the root of a website that tells search engine crawlers which pages they are and are not allowed to visit. It is one of the first things a crawler checks before it starts crawling a site.

The file lives at yoursite.com/robots.txt and follows a simple syntax understood by all major crawlers: Googlebot, Bingbot, and others.

How does robots.txt work?

When a crawler visits a site, it fetches the robots.txt file first. It reads the rules, determines which apply to it, and then crawls accordingly.

A basic robots.txt file:

User-agent: *
Disallow: /admin/
Disallow: /thank-you/

Sitemap: https://www.example.com/sitemap.xml
  • User-agent: *: applies the rule to all crawlers
  • Disallow: /admin/: prevents crawlers from visiting any URL starting with /admin/
  • Sitemap:: points crawlers to the XML sitemap

You can also target specific crawlers:

User-agent: Googlebot
Disallow: /staging/

User-agent: Bingbot
Disallow:

The second block with an empty Disallow: means Bingbot is allowed to crawl everything.

What robots.txt does and does not do

robots.txt controls crawling, not indexing. This is a critical distinction.

If a page is blocked in robots.txt, a crawler will not visit it. But if another site links to that page, Google may still index it, as it will not be able to read the page's content and will index the URL with no title or description.

To prevent indexing, you need a noindex meta tag on the page itself. To prevent both crawling and indexing, you need both:

<meta name="robots" content="noindex, nofollow">

Combined with a robots.txt Disallow rule. But note: if you block a page in robots.txt, the crawler cannot read the noindex tag either, so blocking important pages in robots.txt can create problems rather than solve them.

The safe rule: use robots.txt to block pages you do not want crawled (admin panels, internal search results, duplicate parameter URLs). Use noindex for pages you want crawled but not indexed.

Common robots.txt mistakes

Blocking the entire site

This happens during staging site setup and occasionally gets deployed to production:

User-agent: *
Disallow: /

This blocks every crawler from every page. If this rule ends up on a live site, all pages will eventually drop out of the index. Run a site crawl and check your robots.txt immediately if rankings drop suddenly across the entire site.

Blocking CSS and JavaScript

Old SEO advice suggested blocking CSS and JS in robots.txt to save crawl budget. This is harmful. Google needs to render pages to understand them, and blocking the resources it needs to render your pages makes them look broken to Googlebot.

Not including a sitemap

A robots.txt file is a natural place to declare your sitemap:

Sitemap: https://www.example.com/sitemap.xml

This makes it easier for crawlers to discover all your indexable pages, even if their internal link structure is not perfect.

Blocking important pages accidentally

URL path matching in robots.txt is case-sensitive and prefix-based. Disallow: /services will block /services, /services/seo, /services/design, and any other URL starting with /services. Double-check your rules carefully.

AI crawlers and robots.txt

In 2026, robots.txt is also how you control whether AI systems can use your content. Major AI crawlers have their own user-agent strings:

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Blocking any of these prevents that AI system from using your content in its training data or citation pool. If you want your site to be cited in AI search results, you need to explicitly allow these crawlers, or at least not block them.

How to check your robots.txt

Visit yoursite.com/robots.txt directly in a browser. If the file exists and is correctly configured, it will display as plain text.

Google Search Console also has a robots.txt tester under Settings that shows which URLs are blocked for Googlebot specifically.

If you need to generate or update your robots.txt file, use Crawly's free robots.txt generator. It builds a valid file with allow/disallow rules for any crawler, including AI user agents.

How robots.txt interacts with crawl budget

For large sites, robots.txt is one of the most direct ways to manage crawl budget. Blocking low-value URL patterns, including faceted navigation, internal search results, and session ID parameters, keeps Googlebot focused on the pages that matter.

For most small to medium sites, crawl budget is not a significant concern. But for ecommerce sites with thousands of filtered URLs, or large publishing sites with extensive tag and archive pages, thoughtful robots.txt configuration can make a measurable difference.


robots.txt is simple but consequential. A single misplaced rule can block crawlers from an entire section of your site. Check it as part of every technical SEO audit.

Generate a valid robots.txt file in seconds with Crawly's robots.txt generator.

Try it yourself with Crawly

Free to download. No page cap. Claude Code MCP built in.

Download free