Learn All About Robots.Txt

Learn all about robots.txt file and wild card operators used in this file. You can also create your custom robots.txt file by understanding all aspects in this guide.

What is Robots.txt File?

Robots.txt file is used by webmasters to give instructions to web robots or crawlers about the website. This is also termed as robots exclusion protocol. You can instruct web robots and crawlers how to crawl your website.

When a web robot or crawler visit the website, it looks for permission from robots.txt.
Example-

User-agent: *
Disallow:

The first question that comes in your mind and mostly asked by interviewers is

“What does User- agent: * specify”

This user-agent specifies the section is applied for every web robots.  Asterisk (*) is a wildcard operator that represents a series of characters. “Disallow:” tells they can crawl the different website pages.

Also sometimes many webmasters confuse between different robots.txt file formats. Let’s take a look at different formats.

1.  Allow All Web Robots and Crawlers

Whether you write

User-agent: *
Disallow:

Or

User-agent: *

Allow: /

Both statements imply same instruction to all robots, allow indexing of everything on the website.

2.  Block All Web Robots or Crawlers

If we insert backslash after disallow (Disallow: /) that means you are disallowing indexing of everything on the website from all web crawlers.

User-agent: *
Disallow: /

3. Block Particular Folder From Crawling

To disallow indexing of a specific folder on a website this format is used

User-agent: *
Disallow: /folder/

4. Block Particular Web Robot Or Crawler From Crawling Your Website

To exclude a single web robot from crawling your website this format is used

User-agent: Google
Disallow:/

5. Block All Website Urls From Crawling

If you want to block all website URLs with “?” use this format

User-agent: *
Disallow: /*?

6. Block Specific URLs From Crawling

Special character $ can be used to specify the end of particular URLs.

Example – If I want to block indexing of URLs with .asp then I will use this format
User-agent: Googlebot
Disallow: /*.asp$  Exclude Unwanted Pages From Crawling

7. Exclude Unwanted Pages From Crawling

If you have an e-commerce website you can exclude checkout, payment info pages from indexing.

Example-
User-agent: *
Allow: /
Disallow: /checkout
Disallow: /cart
Disallow: /checkout/paymentinfo

Whether you have a WordPress website or website created in other CMS available on the internet, you can easily exclude website pages, particular folder from Indexing by all web robots or by particular web robot using the robots.txt file.

You can also block with Meta Noindex, you can use this Meta Robot Tag to specify web robots can visit the webpage, but do not add the web page in search results. Even Meta Robot tag is considered better than blocking with Robots.txt file.

Meta Robots tag can just block from indexing and web crawlers can still crawl the web page. If the web page contains links that can pass link juice can be still crawled by using Meta Robot tag. To get more clarification you can read here (https://moz.com/learn/seo/robotstxt)

Follow the SEO best practices at #Scoolico in social media channels to get latest online marketing tips.

Facebook Comments