robots.txt: The Complete Guide

What robots.txt does

A robots.txt file sits at the root of your domain (https://example.com/robots.txt) and tells search engine crawlers which URLs they are and are not permitted to visit. It is the first file most crawlers request when they visit a site.

robots.txt controls crawling, not indexing. A page blocked in robots.txt can still appear in search results if it receives external backlinks. Google may index the URL even without being able to crawl it — it just will not have a snippet. To prevent indexing, use a noindex tag.

The syntax

A robots.txt file is made up of groups. Each group starts with one or more User-agent lines specifying which crawler the rules apply to, followed by Allow and Disallow directives:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /

Sitemap: https://example.com/sitemap.xml

User-agent: * applies the rules to all crawlers. You can specify individual crawlers by name (e.g. User-agent: Googlebot) to create crawler-specific rules. The Sitemap directive at the bottom is optional but recommended — it tells crawlers where to find your sitemap.

Allow and Disallow rules

Rules are matched against the URL path. A Disallow: / blocks all pages. A Disallow: /admin/ blocks only URLs starting with /admin/.

When Allow and Disallow rules conflict, the more specific (longer) rule wins. If you disallow /products/ but allow /products/featured/, the featured section is crawlable even though the rest of /products/ is blocked.

User-agent: *
Disallow: /products/
Allow: /products/featured/

Configuration by site type

Standard content site or blog

For most content sites, the goal is to allow everything except admin areas and low-value parameter URLs:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /?s=
Disallow: /search/

Sitemap: https://example.com/sitemap.xml

Ecommerce site

Ecommerce sites generate many low-value URLs through filtering, sorting, and session parameters. Block these to focus crawl budget on product and category pages:

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /wishlist/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

Sitemap: https://example.com/sitemap.xml

If your filtering generates SEO-valuable faceted URLs (size, colour, material), consider using canonicalisation rather than robots.txt disallow, so the pages can still pass link equity while the canonical consolidates ranking signals.

SaaS or app site

Block authenticated areas and API endpoints entirely:

User-agent: *
Disallow: /app/
Disallow: /api/
Disallow: /login/
Disallow: /signup/
Disallow: /dashboard/

Sitemap: https://example.com/sitemap.xml

Allowing AI crawlers

In 2026, being indexed by AI search tools (ChatGPT, Perplexity, Claude, Gemini) requires explicitly allowing their crawlers. Many sites block these by default with an overly broad disallow:

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

If you block these bots, your content cannot be cited by AI search tools. Blocking them does not prevent AI models from using content they already have from earlier training data, but it cuts you off from fresh retrieval-based citations going forward.

Common mistakes

Blocking the entire site

The single most damaging mistake: Disallow: / for all user agents blocks every crawler from every page. This is sometimes left in place from a staging environment and not removed at launch. The result is a complete loss of organic traffic within weeks.

Crawly checks your robots.txt on every crawl and flags this immediately in the Issues tab.

Blocking CSS and JavaScript files

Google renders pages using CSS and JavaScript to evaluate user experience signals and understand content. Blocking these files in robots.txt prevents Google from seeing your site as users see it, which can affect rankings and Core Web Vitals assessment.

Using robots.txt to hide content

Blocking a page in robots.txt does not make it private. The URL can still be shared, linked to, and discovered. If a page contains sensitive content, protect it with authentication, not robots.txt.

Blocking pages you want in the index

Accidentally blocking category pages, product pages, or blog posts is more common than it looks. Wildcard rules (Disallow: /*?) can match valuable URLs if they contain any query parameter, including UTM parameters on pages that should be indexed.

Noindexing a page that is blocked in robots.txt

If Googlebot is blocked from visiting a URL by robots.txt, it cannot read the noindex tag on that page. The page may remain in Google's index even after you add a noindex tag. Always use one or the other, not both.

How Crawly checks your robots.txt

Crawly fetches and parses your robots.txt file at the start of every crawl. It reports:

The full content of your robots.txt file
Which crawled URLs are blocked by the current rules
Any pages in your sitemap that are blocked by robots.txt (a direct conflict)
Whether Googlebot, Bingbot, and AI crawlers are allowed or blocked

You can also check individual URLs using the robots.txt generator tool — enter any URL and see whether it is allowed or blocked for each major crawler.

Audit your robots.txt on every crawl

Crawly checks your robots.txt and flags conflicts automatically. Free to download.

Download free