Pages

Friday, April 29, 2016

Robots.txt File


Why shouldn’t I edit it with my Dreamweaver FTP client, for instance?
Because all those fancy apps insert useless crap like formatting, HTML code and whatnot. Most probably search engines aren’t capable to interpret a robots.txt file like:
DOCTYPE text/plain PUBLIC 
"-//W3C//DTD TEXT 1.0 Transitional//Swahili" 
"http://www.w3.org/TR/text/DTD/plain1-transitional.dtd"> 
{\b\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 
User-agent: Googlebot}
{ \lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 \line 
Disallow: / \line Allow: }{\cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095
{\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 content}{ \cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 /} ...

(Ok Ok, I’ve made up this example, but it represents the raw contents of text files saved with HTML editors and word processors.)
Where Do I put robots.txt
Robots.txt resides in the root directory of your Web space, that’s either a domain or a subdomain, for example
"/web/user/htdocs/example.com/robots.txt"
resolving to
http://example.com/robots.txt.
Can I use Robots.txt in sub directories?
Of course you’re free to create robots.txt files in all your subdirectories, but you shouldn’t expect search engines to request/obey those. If you for some weird reasons use subdomains like crap.example.com, then the example.com/robots.txt is not exactly a suitable instrument to steer crawling of subdomains, hence ensure each subdomain serves its own robots.txt. When you upload your robots.txt then make sure to do it in ASCII mode, your FTP client usually offers “ASCII|Auto|Binary” – choose “ASCII” even when you’ve used an ANSI editor to create it.
Why?
Because plain text files contain ASCII content only. Sometimes standards that say “upload *.htm *.php *.txt .htaccess *.xml files in ASCII mode to prevent them from inadvertently corruption during the transfer, storing with invalid EOL codes, etc.” do make sense. (You’ve asked for the idiot version, didn’t you?)
What about if I am on a Free Host?
If you’re on a free host, robots.txt is not for you. Your hosting service will create a read-only robots.txt “file” that’s suitable to steal even more traffic than its ads that you can’t remove from your headers and footers. Now, if you’re still interested in the topic, you must learn how search engines work to understand what you can archive with a robots.txt file and what’s just myths posted on your favorite forum.
Sebastian, Do you know how search engines work, then?
Yep, to some degree. ;) Basically, a search engine has three major components: A crawler that burns your bandwidth fetching your unchanged files over and over until you’re belly up. An indexer that buries your stuff unless you’re Matt Cutts or blog on a server that gained search engine love making use of the cruelest black hat tactics you can think of. A query engine that accepts search queries and pulls results from the search index but ignores your stuff coz you’re neither me nor Matt Cutts.
What goes into the robots.txt file?
Your robots.txt file contains useful but pretty much ignored statements like
 # Please don't crawl this site during our business hours!
(the crawler is not aware of your time zone and doesn’t grab your office hours from your site), as well as actual crawler directives. In other words, everything you write in your robots.txt is a directive for crawlers (dumb Web robots that can fetch your contents but nothing more), not indexers (high sophisticated algorithms that rank only brain farts from Matt and me).
Currently, there are only three statements you can use in robots.txt:
Disallow: /path
Allow: /path
Sitemap: http://example.com/sitemap.xml
Some search engines support other directives like “crawl-delay”, but that’s utterly nonsense, hence safely igore those.
The content of a robots.txt file consists of sections dedicated to particular crawlers. If you’ve nothing to hide, then your robots.txt file looks like:

 User-agent: *
 Disallow:
 Allow: /
 Sitemap: http://example.com/sitemap.xml 

If you’re comfortable with Google but MSN scares you, then write:

 User-agent: *
 Disallow:
User-agent: Googlebot
 Disallow:
User-agent: msnbot
 Disallow: /

Please note that you must terminate every crawler section with an empty line. You can gather the names of crawlers by visiting a search engine’s Webmaster section.
From the examples above you’ve learned that each search engine has its own section (at least if you want to hide anything from a particular SE), that each section starts with a

 User-agent: [crawler name]
line, and that each section is terminated with a blank line. The user agent name “*” stands for the universal Web robot, that means that if your robots.txt lacks a section for a particular crawler, it will use the “*” directives, and that when you’ve a section for a particular crawler, it will ignore the “*” section. In other words, if you create a section for a crawler, you must duplicate all statements from the “all crawlers” (“User-agent: *”) section before you edit the code.
Now to the directives. The most important crawler directive is

 Disallow: /path
“Disallow” means that a crawler must not fetch contents from URIs that match “/path”. “/path” is either a relative URI or an URI pattern (“*” matches any string and “$” marks the end of an URI). Not all search engines support wildcards, for example MSN lacks any wildcard support (they might grow up some day).
URIs are always relative to the Web space’s root, so if you copy and paste URLs then remove the http://example.com part but not the leading slash.

Allow: path/
refines Disallow: statements, for example
 User-agent: Googlebot 
 Disallow: / 
 Allow: /content/
allows crawling only within http://example.com/content/

Sitemap: http://example.com/sitemap.xml
points search engines that support the sitemaps protocol to the submission files.
Please note that all robots.txt directives are crawler directives that don’t affect indexing. Search engines do index disallow’ed URLs pulling title and snippet from foreign sources, for example ODP (DMOZ – The Open Directory) listings or the Yahoo directory. Some search engines provide a method to remove disallow’ed contents from their SERPs on request.
Say I want to keep a file / folder out of Google. Exactly what what would I need to do? 
You’d check each HTTP request for Googlebot and serve it a 403 or 410 HTTP response code. Or put a “noindex,noarchive” Googlebot meta tag.
(*meta name=”Googlebot” content=”noindex,noarchive” /*). Robots.txt blocks with Disallow: don’t prevent from indexing. Don’t block crawling of pages that you want to have deindexed, as long as you don’t want to use Google’s robots.txt based URL terminator every six months.
If someone wants to know more about robots.txt, where do they go?
Honestly, I don’t know a better resource than my brain, partly dumped here. I even developed a few new robots.txt directives and posted a request for comments a few days ago. I hope that Google, the one and only search engine that seriously invests in REP evolvements, will not ignore this post caused by the sneakily embedded “Google bashing”. I plan to write a few more posts, not that technical and with real world examples.
Can I ask you how you auto generate and mask robots.txt, or is that not for idiots? Is that even ethical?
Of course you can ask, and yes, it’s for everybody and 100% ethical. It’s a very simple task, in fact it’s plain cloaking. The trick is to make the robots.txt file a server sided script. Then check all requests for verified crawlers and serve the right contents to each search engine. A smart robots.txt even maintains crawler IP lists and stores raw data for reports. I recently wrote a manual on cloaked robots.txt files on request of a loyal reader.
Think Disney will come after you for your avatar now you are famous after being interviewd on the Hobo blog?
I’m sure they will try it, since your blog will become an authority on grumpy red crabs called Sebastian. I’m not too afraid though, because I use only a tiny thumbnailed version of an image created by a designer who –hopefully– didn’t scrape it from Disney, as icon/avatar. If they become nasty, I’ll just pay a license fee and change my avatar on all social media sites, but I doubt that’s necessary. To avoid such hassles I’ve bought an individually drawed red crab from an awesome cartoonist last year. That’s what you see on my blog, and I use it as avatar as well, at least with new profiles.
Who do you work for?
I’m a freelancer loosely affiliated with a company that sells IT consulting services in several industries. I do Web developer training, software design / engineering (mostly the architectural tasks), and grab development / (technical) SEO projects myself to educate yours truly. I’m a dad of three little monsters, working at home. If you want to hire me, drop me a line. ;)
Sebastian, a big thanks for slapping me about about Robots.txt and indeed for helping me craft the Idiot’s Guide To Robots.txt. I certainly learned a lot from talking to you for a day, and I hope some others can learn from this snippet article. You’re a gentleman spammer. :)
If you enjoyed this step by step guide for beginners – you can take your knowledge to the next level at http://sebastians-pamphlets.com/
What Google says about Robots txt files
A robots.txt file restricts access to your site by search engine robots that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages. (All respectable robots will respect the directives in a robots.txt file, although some may interpret them differently. However, a robots.txt is not enforceable, and some spammers and other troublemakers may ignore it. For this reason, we recommend password protecting confidential information.) To see which URLs Google has been blocked from crawling, visit the Blocked URLs page of the Health section of Webmaster Tools. You need a robots.txt file only if your site includes content that you don’t want search engines to index. If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
Article published on http://www.hobo-web.co.uk/

No comments:

Post a Comment