What exactly is robot txt in SEO

Robots.txt

In the Robots.txt it is a text document that, among other things Crawlers and bots (e.g. from search engines) specifies which content on a page should be read and which should not.

Almost every website on the Internet contains a Robots.txt file, but not all website operators deal precisely with its function or even know that it exists.

In the following we will explain to you what the Robots.txt is and show you how you can use it to make your website more search engine friendly.

What is a Robots.txt file?

The so-called Robots.txt file is a text document that tells search engines which content of the respective website may and may not be crawled by search engines.

The Robots.txt is regularly checked by search engines for instructions, so-called "directives", which tell the search engine what can be read.
Is no Robots.txt is available, the search engine examines all of the content that is linked via the source text. In addition, search engines still decide for themselves whether to adhere to the instructions in Robots.txt or whether to ignore them partially or even completely.

The Robots.txt also has the function of blocking other crawlers, such as SEO analysis tools or bots from "reading" the website, if you want to protect your content from them, for example.

How does the Robots.txt work?

Search engines like Google and Bing are constantly scouring websites to discover content and make it available to their users. To crawl websites, search engines follow internal and external links. This behavior is commonly referred to as "crawling" (from the English "crawling", hence also "crawler"). The path the crawlers choose is similar to a spider web. Crawlers are therefore also known as spiders.

When the crawler of a search engine arrives at a website, the crawler searches for a robots.txt file. If it finds this, the crawler first reads this file before continuing with the ‘crawling’ of the website.
Since the robots.txt information may contain instructions on how the search engine should crawl, the information found there directs further crawler actions on that particular website.
If the robots.txt does not contain instructions prohibiting user agent activity (or if the website does not have a robots.txt file), the crawler will crawl all information on the website.

The Robots.txt syntax

The syntax of the Robots.txt file can be thought of as the "language" of the robots.txt files.
The following terms are the most common commands in a Robots.txt:

User agent:

The specific web crawler that you are giving crawl instructions to (usually a search engine).

Disallow:

The command used to ask a user agent not to crawl a specific URL. Only one line "Disallow:" is allowed for each URL.

Allow:

This command only applies to the Googlebot. It tells the Googlebot that it can access a page or subfolder, although the parent page or subfolder may not be allowed.

Crawl delay:

How many seconds should a crawler wait before loading and searching page content (mostly ignored by Google).

Sitemap:

Used to get the location of all XML sitemap (s) associated with this URL.

Creation and editing of a Robots.txt file

If you don't have a Robots.txt yet, you can easily create one.

At Wordpress A kind of sample Robots.txt can be created with one click using the Yoast plug-in under ‘Tools’. The Robots.txt can then also be edited there.

The classic way leads us on the server of your website.
You can easily enter here in the start or root directory via an FTP client such as FileZilla Text document create, via Notepad e.g., with the name ‘robots.txt’.

This file can then also be edited very easily via the server. To be on the safe side, you should of course always make a backup copy of your old Robots.txt file before making changes.

Incidentally, Google has provided webmasters with instructions for creating a Robots.txt file.

Examples of a Robots.txt

In the following we show you examples of the content of Robots.txt files:

Example domain: www.example.de URL of the Robots.txt: www.example.de/robots.txt

Classic format of a Robots.txt:

User-agent: [user-agent name / name of the search engine, of the crawler / * = all crawlers] Disallow: [URL, subdirectory, element to be excluded from crawling]

Block all content for all crawlers:

User-agent: * Disallow: /

Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages on www.example.com, including the home page.

Allow all content for all crawlers:

User agent: * Disallow:

Using this syntax in a robots.txt file instructs web crawlers to crawl all pages on www.example.com, including the home page.

Block an entire subfolder from a specific crawler:

User agent: Googlebot Disallow: / example folder /

This syntax means that only the Google crawler (name of the user agent Googlebot) is not allowed to crawl pages that contain the URL string www.domain.de/example-folder/.

Block a specific subpage for a specific crawler:

User-agent: Bingbot Disallow: /example-folder/blocked-page.html

This syntax only tells the Bing crawler (name of the Bing user agent) to avoid the specific page under www.domain.de/example-folder / ... being searched.

Block specific urls using characters:

Disallow: /*.php Disallow: /copyrighted-images/*.jpg

In the example above, * is extended to the appropriate file name or element in order to lock them out. Here everything is blocked with ‘.php’ and all ‘.jpg’ images in the ‘copyrighted-images’ folder.

Test Robots.txt

In the old version of the Google Search Console, webmasters still have the option to test the functionality and correctness of their Robots.txt file:

Simply select the corresponding property on your website and enter the content or URL ending for which the function of the Robots.txt should be checked in the text field below.

If the respective crawler can read the URL, "appears to the right of the bar in green"Authorized".
If the URL cannot be read by the crawler, a red "will appear there"Blocked"and in the window above the line with the command from the Robots.txt that blocks this URL is marked.

Where is the Robots.txt file located?

The robots.txt of your website should always be in the root or start or root directory of your domain. So if your website can be reached at www.beispiel.de, the Robots.txt file should be at https://www.beispiel.de/robots.txt. In addition, it is crucial for the functionality that your robots.txt file is actually called ‘robots.txt’. The correct name is essential for it to be found and read by search engines.

Do I need a Robots.txt?

Robots.txt plays a role in search engine optimization (SEO) depending on the website an important role.

By the way, many SEOs have the creed that Robots.txt does not prohibit search engines from content so that they can decide for themselves which content is relevant and which is not.

With Wordpress, for example, access to the admin area (wp-admin) is often denied standardized via the Robots.txt, also because sensitive data on the database is located here.

Other types of websites, such as online shops, block certain parameters or IDs via the Robots.txt, and so on Prevent duplicate content or so to limit the amount of irrelevant pages for search engines and the Focus on relevant content to control.
The function of the Robots.txt should always be viewed with caution. On the one hand, search engines decide for themselves whether to follow the instructions in the Robots.txt file or ignore them. On the other hand, incorrect information can make relevant content inaccessible to search engines.

XML sitemaps in the Robots.txt

While the main use of the robots.txt file is to tell search engines which pages or content should not be crawled, the robots.txt file can also be used to help search engines faster to the XML sitemap to draw attention. This method is supported by Google, Bing, Yahoo and Ask, among others.

The XML sitemap can or should be stored at the end of the robots.txt as an absolute URL (e.g. https://www.beispiel.de/sitemap1.xml). Referencing the XML sitemap in the robots.txt file is one of the best practices for making search engines aware of your website's XML sitemaps, even if you have your XML sitemap already in the Google Search Console and the Bing Webmaster, for example -Tools submitted.

When maintaining Robots.txt, remember: There are more search engines than Google, even though Google is the most used.

It is of course also possible to store several XML sitemaps in a robots.txt file.

Example: User-agent: * Disallow: / wp-admin / Sitemap: https://www.beispiel.de/sitemap1.xml Sitemap: https://www.beispiel.de/sitemap2.xml

meta-Robots vs. Robots.txt

Before you fill your Robots.txt with content, the "meta-robots" tag be mentioned.
If you do not want to index individual subpages, you should use the "meta-robots" tag in the source code to "noindex" instead of excluding this single URL via the Robots.txt.
This is the safest way, also because, as we have learned, search engines are free to follow the instructions in Robots.txt.

Conclusion & further information

The Robots.txt determines the crawl behavior for a website, while the Meta-Robots-Tag can determine the indexing behavior at the level of the individual page (or the page element).

We recommend a Robots.txt file for every website. However, it is not always easy to use. For smaller websites, a Robots.txt does not have to contain numerous instructions, but for larger sites and online shops, correct operation of the Robots.txt can play an important role for crawlability and proper indexing.

Subscribe to the newsletter now.

Receive exclusive & curated content straight to your inbox.


Philippe Grossmann

Philippe develops online marketing campaigns, is a web analytics enthusiast and a big fan of PPC marketing.