robots.txt: The Small File That Controls Crawlers

Of all the small technical files that ship with a website, robots.txt is the one most often misunderstood and most often slightly wrong. It is a plain-text file that lives at the root of the domain (yoursite.com/robots.txt) and tells search engine crawlers what they can and cannot access. The format is simple. The implications are not.

This post explains what robots.txt actually does, what it does not do, what mine looks like for a typical service-business site, and the common ways it goes wrong.

What robots.txt actually is

robots.txt is a file at the root of a website that follows the Robots Exclusion Protocol, an informal standard that has been in use since 1994. The file is a polite request: it tells well-behaved crawlers (Googlebot, Bingbot, Yandex, DuckDuckBot, etc.) what areas of the site they are welcome to crawl and what areas they should avoid.

The format is plain text, one rule per line. A typical entry reads:

User-agent: *
Disallow: /admin/

Sitemap: https://example.com/sitemap.xml

That says "to all crawlers, avoid the /admin/ path, and the canonical sitemap is at this URL." The crawler reads the file, applies the rules to its crawl plan, and proceeds.

What it does NOT do

Two common misunderstandings worth naming up front.

robots.txt does not provide security. The file is publicly readable; anyone can see what paths are listed. A "Disallow: /private-customer-data/" entry tells search engines not to crawl that path, but it also tells anyone reading robots.txt exactly where the private data lives. If something needs to be private, it needs server-side authentication, not a robots.txt entry.

robots.txt does not guarantee compliance. Well-behaved crawlers (the ones from Google, Bing, Yandex, etc.) respect the file. Less well-behaved crawlers (some scrapers, some AI training bots, some bad-actor bots) ignore it entirely. The file is a request, not an enforcement mechanism.

robots.txt does not remove pages from search results. If a page is already indexed and you add a Disallow entry, the page may stay in the index for some time (because the crawler is no longer visiting it to confirm it should be removed). To remove a page from search results, the right tool is the noindex meta tag or HTTP header on the page itself, not a robots.txt entry.

What mine looks like for a typical small-business site

The robots.txt I ship on every site I build is short and conservative. The whole file is usually a dozen lines:

User-agent: *
Allow: /

Disallow: /thank-you/
Disallow: /client-portal/
Disallow: /onboarding/
Disallow: /404.html

Sitemap: https://yoursite.com/sitemap.xml

The breakdown:

Allow: / explicitly welcomes all crawlers to all paths by default. This is the conservative starting point. Some templates start with Disallow: / (block everything) and then carve out specific allowed paths, which is the wrong default for a public marketing site.

Disallow: /thank-you/ blocks the post-form-submission thank-you page from search results. The page is meaningful only to visitors who just submitted a form; having it appear in search results would be confusing.

Disallow: /client-portal/ blocks the existing-client portal. Not because it contains private data (it does not; it is just a directory of client-side links and forms), but because there is no value in indexing it.

Disallow: /onboarding/ blocks the onboarding form. The form is meaningful only to clients who have signed up; appearing in search results would be misleading.

Disallow: /404.html blocks the custom 404 page from appearing in search results as a real destination.

Sitemap: ... tells crawlers where the canonical XML sitemap is. This single line meaningfully accelerates discovery of new content; crawlers check the sitemap and find new pages quickly rather than waiting to discover them through link-following.

What I do not include

A few patterns that show up in robots.txt files of inherited sites and that I generally remove:

Crawl-delay directives. Some templates include "Crawl-delay: 10" or similar, asking crawlers to space out their visits. Googlebot ignores this directive entirely; the only crawler that respects it is Bing. For a small site on a CDN, the crawl rate is never an issue, so the directive does not earn its keep.

Per-bot blocks. Some inherited sites have lists of specific User-agents to block (BadBotName, ScraperBot, etc.). The lists are perpetually out of date and the bad actors do not respect the rules anyway. Server-side rate-limiting at the CDN is a more effective tool.

AI-training bot blocks. Whether to block GPTBot, Claude-Web, ChatGPT-User, and similar AI-training crawlers is a policy decision worth making explicitly. For most service-business sites, allowing them is fine (the AI training is unlikely to harm the business and may help it via inclusion in AI-powered search tools). For sites with strong opinions about AI training, the relevant rules are well-documented and easy to add.

For my own site, I currently allow all reasonable crawlers, including AI-training bots. The decision is reviewed annually as the AI landscape evolves.

How to check your own robots.txt

The fastest check is just to visit the URL directly. Go to https://yoursite.com/robots.txt in a browser. If you see a plain-text file, it exists. If you see a 404 or a redirect, your site does not have one.

Three things to look for in the content:

Is there a Sitemap line? Without one, crawlers have to discover the sitemap through other channels (Search Console submission, primarily). Adding the line meaningfully accelerates content discovery.

Are private paths blocked? Thank-you pages, admin paths, internal-only directories. If they are not blocked, they may show up in search results in awkward ways.

Are public paths inadvertently blocked? The most common mistake is a stale "Disallow: /" left over from a development environment. If your robots.txt blocks everything, your site will not be indexed at all.

Google Search Console has a robots.txt Tester (under Settings) that lets you verify specific URLs against the file. Useful when investigating why a page is or is not being crawled.

The relationship to noindex

robots.txt and the noindex meta tag are often confused but solve different problems:

robots.txt controls crawling. It tells the crawler whether to fetch the page at all.

noindex controls indexing. It tells the crawler that, even after fetching the page, the page should not appear in search results.

For a typical small-business site, the right tool is usually noindex on individual pages that should not appear in search results. robots.txt is the right tool for blocking entire directory trees from being crawled at all.

For sites I build, every individual page that should not appear in search results carries a noindex meta tag. robots.txt is reserved for directory-level rules that apply to many pages at once. The two tools work together; neither replaces the other.

If your current site has no robots.txt

The fix is small. Create a plain-text file named robots.txt, put it at the root of your domain, and include at minimum:

User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

That is the minimum-viable file. It welcomes all crawlers to the entire site and points them at the sitemap. From there, you can add Disallow lines for any private paths.

For sites I build, the robots.txt is part of the build and ships at every deploy. For sites I do not build, the platform almost always handles robots.txt; the work is to find where it lives and verify it says what you want it to say.

Configured at every launch

Crawler access wired in by default.

Every site I build ships with a clean robots.txt referencing the sitemap, blocking the right paths, and welcoming legitimate crawlers. Part of the standard plan.

Start a Conversation → See what's included

robots.txt: The Small File That Controls Crawlers

What robots.txt actually is

What it does NOT do

What mine looks like for a typical small-business site

What I do not include

How to check your own robots.txt

The relationship to noindex

If your current site has no robots.txt

Crawler access wired in by default.

Related articles

HubSpot vs. Custom: What's Right for a Service Business

Carrd vs. a Custom Site for a Service Business

The Owner's Guide: What Every Client Gets at Launch

Email DNS: MX, SPF, DKIM, DMARC for Small Businesses