robots.txt: The Complete Guide
What robots.txt does, the rules that matter, the mistakes that cost rankings, and how to configure it correctly for your site type.
What robots.txt does
A robots.txt file sits at the root of your domain (https://example.com/robots.txt) and tells search engine crawlers which URLs they are and are not permitted to visit. It is the first file most crawlers request when they visit a site.
robots.txt controls crawling, not indexing. A page blocked in robots.txt can still appear in search results if it receives external backlinks. Google may index the URL even without being able to crawl it — it just will not have a snippet. To prevent indexing, use a noindex tag.
The syntax
A robots.txt file is made up of groups. Each group starts with one or more User-agent lines specifying which crawler the rules apply to, followed by Allow and Disallow directives:
User-agent: * Disallow: /admin/ Disallow: /private/ Allow: / Sitemap: https://example.com/sitemap.xml
User-agent: * applies the rules to all crawlers. You can specify individual crawlers by name (e.g. User-agent: Googlebot) to create crawler-specific rules. The Sitemap directive at the bottom is optional but recommended — it tells crawlers where to find your sitemap.
Allow and Disallow rules
Rules are matched against the URL path. A Disallow: / blocks all pages. A Disallow: /admin/ blocks only URLs starting with /admin/.
When Allow and Disallow rules conflict, the more specific (longer) rule wins. If you disallow /products/ but allow /products/featured/, the featured section is crawlable even though the rest of /products/ is blocked.
User-agent: * Disallow: /products/ Allow: /products/featured/
Configuration by site type
Standard content site or blog
For most content sites, the goal is to allow everything except admin areas and low-value parameter URLs:
User-agent: * Disallow: /wp-admin/ Disallow: /wp-login.php Disallow: /?s= Disallow: /search/ Sitemap: https://example.com/sitemap.xml
Ecommerce site
Ecommerce sites generate many low-value URLs through filtering, sorting, and session parameters. Block these to focus crawl budget on product and category pages:
User-agent: * Disallow: /cart/ Disallow: /checkout/ Disallow: /account/ Disallow: /wishlist/ Disallow: /*?sort= Disallow: /*?filter= Disallow: /*?page= Sitemap: https://example.com/sitemap.xml
If your filtering generates SEO-valuable faceted URLs (size, colour, material), consider using canonicalisation rather than robots.txt disallow, so the pages can still pass link equity while the canonical consolidates ranking signals.
SaaS or app site
Block authenticated areas and API endpoints entirely:
User-agent: * Disallow: /app/ Disallow: /api/ Disallow: /login/ Disallow: /signup/ Disallow: /dashboard/ Sitemap: https://example.com/sitemap.xml
Allowing AI crawlers
In 2026, being indexed by AI search tools (ChatGPT, Perplexity, Claude, Gemini) requires explicitly allowing their crawlers. Many sites block these by default with an overly broad disallow:
User-agent: GPTBot Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Claude-SearchBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Allow: /
If you block these bots, your content cannot be cited by AI search tools. Blocking them does not prevent AI models from using content they already have from earlier training data, but it cuts you off from fresh retrieval-based citations going forward.
Common mistakes
Blocking the entire site
The single most damaging mistake: Disallow: / for all user agents blocks every crawler from every page. This is sometimes left in place from a staging environment and not removed at launch. The result is a complete loss of organic traffic within weeks.
Crawly checks your robots.txt on every crawl and flags this immediately in the Issues tab.
Blocking CSS and JavaScript files
Google renders pages using CSS and JavaScript to evaluate user experience signals and understand content. Blocking these files in robots.txt prevents Google from seeing your site as users see it, which can affect rankings and Core Web Vitals assessment.
Using robots.txt to hide content
Blocking a page in robots.txt does not make it private. The URL can still be shared, linked to, and discovered. If a page contains sensitive content, protect it with authentication, not robots.txt.
Blocking pages you want in the index
Accidentally blocking category pages, product pages, or blog posts is more common than it looks. Wildcard rules (Disallow: /*?) can match valuable URLs if they contain any query parameter, including UTM parameters on pages that should be indexed.
Noindexing a page that is blocked in robots.txt
If Googlebot is blocked from visiting a URL by robots.txt, it cannot read the noindex tag on that page. The page may remain in Google's index even after you add a noindex tag. Always use one or the other, not both.
How Crawly checks your robots.txt
Crawly fetches and parses your robots.txt file at the start of every crawl. It reports:
- The full content of your robots.txt file
- Which crawled URLs are blocked by the current rules
- Any pages in your sitemap that are blocked by robots.txt (a direct conflict)
- Whether Googlebot, Bingbot, and AI crawlers are allowed or blocked
You can also check individual URLs using the robots.txt generator tool — enter any URL and see whether it is allowed or blocked for each major crawler.
Audit your robots.txt on every crawl
Crawly checks your robots.txt and flags conflicts automatically. Free to download.
Download free