Robots.txt

From mw.mh370.wiki
Jump to navigation Jump to search


A Guide to Using MediaWiki in a Hosted Environment

An instructional website by the developer of mh370wiki.net - a MediaWiki site about Malaysia Airlines Flight MH370.


Directives to Robots

The most common method for website administrators to control web crawlers is by creating a robots.txt file. This is placed in the website root directory, usually with a sitemap.

The robots.txt file tells search engine robots where to find a Site Map, and instructions which allow or disallow robot's access to parts of the website. For example, some Namespaces may be accessible and search is allowed; others are disallowed.

Google provides some insight into how their web crawler will interpret a robots.txt file, and also provides a tool to assist in the creation and testing of a robots.txt file.

However, MediaWiki presents some interesting problems because of the Namespaces and Talk namespaces, Special Pages and maybe other areas which should be specifically denied.

Robots and MediaWiki

MediaWiki has several methods to control robots:-

  1. A robots.txt file can be placed in the root directory. This can also link to a sitemap.
  2. Robot policies can be defined using variables in LocalSettings.php - see Manual:Robot policy
  3. Magic Words can be added to specific pages to control indexing.

Robots.txt File is Visible

Because the robots.txt file is known to be placed in a root directory anyone can browse to a website websiteURL/robots.txt and the text will be displayed in a browser window.

For a normal webnsite if you list files or directories which you don't want bots or robots to scan then you are also listing files or directories which exist but now have listed for anyone who is reading this file.

For a MediaWiki-based website the 'files or directories' are virtual - they are pages or articles and namespaces.

To prevent crawlers from accessing specific namespaces it is probably better to rely on the variables which can be configured in LocalSettings.php, and avoid publishing the namespace names in a robots.txt file.


Manual:Robot policy

Although this Manual has limited information, it does list the variables which can be configured to tell web crawlers and other tools what to exclude. The use of the directive 'allow' in robots.txt is not really supported - robots.txt works by exclusion. Basically, anything not denied is allowed.

The configuration settings listed in the Manual:Robot policy are:-


Links

Manual:Robot policy
https://www.mediawiki.org/wiki/Manual:Robot_policy
Manual:robots.txt
https://www.mediawiki.org/wiki/Manual:Robots.txt
Manual:Noindex
https://www.mediawiki.org/wiki/Manual:Noindex
Introduction to robots.txt (Google Search Central)
https://developers.google.com/search/docs/crawling-indexing/robots/intro
How to write and submit a robots.txt file (Google Search Central)
https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
The Ultimate 2025 List of Web Crawlers and Good Bots: Identification, Examples, and Best Practices
https://www.humansecurity.com/learn/blog/crawlers-list-known-bots-guide/
Manual:Handling web crawlers
https://www.mediawiki.org/wiki/Manual:Handling_web_crawlers This is a relatively new resource on the MediaWiki website and also mentions a new Extension:CrawlerProtection.