Digital hypocrisy: crossing the use/misuse continuum on the Web with ROBOTS.TXT
I've always found the optional file you can save in a Web site, ROBOTS.TXT, while sound in purpose, extremely hypocritical and potentially lethal to a site's integrity. As a guy who’s been in technical marketing for more than a decade, it's always been interest of mine to see the practical use of tidbits of information towards giving a site maximum exposure. As a budding developer years ago, this was also one of my first forays into “security“.
As a refresher, ROBOTS.TXT is a simple text file stored in the root directory of a Website, containing metadata, instructing search engine spiders which directories/subdirectories to avoid browsing so as not to include sensitive information in their indexes. A simple concept, but the fact that these files can be browsed by any idiot with a browser and Internet connection of any speed makes them dangerous.
For more on ROBOTS.TXT, visit http://www.robotstxt.org/wc/robots.html
It's literally like saying, "Hey, there are certain directories I have secretive content stashed in, and I don't want you to see them at all...and here they are."
Need proof? Check these URLs out for some good examples how varying organizations in varying industries creatively use the file:
http://www.intel.com/robots.txt
http://msn.espn.go.com/robots.txt
http://www.ford.com/robots.txt
http://www.cisco.com/robots.txt
http://www.cnet.com/robots.txt
http://slashdot.org/robots.txt
In fact, if memory serves, I recall an engineer at Sun Microsystems several years back writing quite the scathing criticism about the use of ROBOTS.TXT on www.sun.com, seeing as how it gave hackers one less challenge to break their stuff (Sun apparently had a bunch of internal download sections, CGI scripts and administrative utilities located in directories they didn't want search engine spiders to find out about). By storing the directory names in ROBOTS.TXT, Sun was essentially giving people the direct URL(s) to their private information, which granted was password-protected, but still overcame arguably THE major hurdle of hacking a site - figuring out which directories contain the good stuff.
As for me, I constantly use the META tag
in pages I don't want spiders to see. That normally does the trick. Using ROBOTS.TXT improperly just invites users savvy enough to know it exists (as many of you now do, after reading this) to type in your site’s domain name, and appending “/robots.txt”.
To be the file’s proponent, it does do an effective job of preventing spiders from indexing your stuff. And sure, this locks unwanted access out from I'd dare say 97% of the Web browsing community. It would only be Web developers trying to hack Web developers, and one would hope that there would be enough honor among thieves, as it were, or at least an appreciation for parity, that savvy people would not engage such pursuits.
However, some organizations do use the file to their advantage, not implementing it as a security means, but more so as a way to not let redundant content or data that would otherwise clutter the Web even more be indexed.
Check out:
http://www.asp.net/robots.txt
http://www.google.com/robots.txt
And just in case you’re wondering, don’t even bother looking for the file on my site - it doesn’t exist. :)