Jason Salas' WebLog

On-air and online: making people laugh, making people think, pissing people off

Sponsors

ASP.NET sites that kick ass

Pals with blogs

Podcasts I listen to

Digital hypocrisy: crossing the use/misuse continuum on the Web with ROBOTS.TXT

I've always found the optional file you can save in a Web site, ROBOTS.TXT, while sound in purpose, extremely hypocritical and potentially lethal to a site's integrity. As a guy who’s been in technical marketing for more than a decade, it's always been interest of mine to see the practical use of tidbits of information towards giving a site maximum exposure. As a budding developer years ago, this was also one of my first forays into “security“.

As a refresher, ROBOTS.TXT is a simple text file stored in the root directory of a Website, containing metadata, instructing search engine spiders which directories/subdirectories to avoid browsing so as not to include sensitive information in their indexes. A simple concept, but the fact that these files can be browsed by any idiot with a browser and Internet connection of any speed makes them dangerous.

For more on ROBOTS.TXT, visit
http://www.robotstxt.org/wc/robots.html

It's literally like saying, "Hey, there are certain directories I have secretive content stashed in, and I don't want you to see them at all...and here they are."

Need proof? Check these URLs out for some good examples how varying organizations in varying industries creatively use the file:

http://www.intel.com/robots.txt
http://msn.espn.go.com/robots.txt
http://www.ford.com/robots.txt
http://www.cisco.com/robots.txt 
http://www.cnet.com/robots.txt 
http://slashdot.org/robots.txt 

In fact, if memory serves, I recall an engineer at Sun Microsystems several years back writing quite the scathing criticism about the use of ROBOTS.TXT on
www.sun.com, seeing as how it gave hackers one less challenge to break their stuff (Sun apparently had a bunch of internal download sections, CGI scripts and administrative utilities located in directories they didn't want search engine spiders to find out about). By storing the directory names in ROBOTS.TXT, Sun was essentially giving people the direct URL(s) to their private information, which granted was password-protected, but still overcame arguably THE major hurdle of hacking a site - figuring out which directories contain the good stuff.

As for me, I constantly use the META tag in pages I don't want spiders to see. That normally does the trick. Using ROBOTS.TXT improperly just invites users savvy enough to know it exists (as many of you now do, after reading this) to type in your site’s domain name, and appending “/robots.txt”.

To be the file’s proponent, it does do an effective job of preventing spiders from indexing your stuff. And sure, this locks unwanted access out from I'd dare say 97% of the Web browsing community. It would only be Web developers trying to hack Web developers, and one would hope that there would be enough honor among thieves, as it were, or at least an appreciation for parity, that savvy people would not engage such pursuits.

However, some organizations do use the file to their advantage, not implementing it as a security means, but more so as a way to not let redundant content or data that would otherwise clutter the Web even more be indexed.

Check out:
http://www.asp.net/robots.txt 
http://www.google.com/robots.txt 

And just in case you’re wondering, don’t even bother looking for the file
on my site - it doesn’t exist. :)

Comments

Russ C. said:

I thought this one was quite good,

http://www.whitehouse.gov/robots.txt
# December 23, 2003 4:52 AM

Jason Salas said:

Good grief! :)
# December 23, 2003 5:04 AM

TrackBack said:

# April 30, 2005 4:48 PM
Leave a Comment

(required) 

(required) 

(optional)

(required)