Sep
19th

Robot Theft for Webmasters

Filed under Webmaster/Blogger | Posted by Saphrym

Reprinted from Theebs.com, my old site.1


Warning! Are you letting people know your web site’s most inner secrets?

Robots.txt. Many webmasters can see this file pop up in their error logs showing them that a search engine tried to use it but couldn’t find it.

The reason they tried to use it, is because they (they being the NICE search engine robots) try to make sure they don’t go snooping somewhere they don’t belong.

The problem is, for this to work, the robots.txt file has to be in your main directory, and it has to be accessible.

Here’s a good example:

EXAMPLE

That’s right! It’s Google’s robots.txt file!

So anyone can see this file.

Now that doesn’t mean you need to go and completely delete the file to make sure people can’t see your secret stuff. It’s better to keep this file so the search engines keep your secret stuff out of their lists.

But, there’s unscrupulous people (we’ll call them Robot Thieves) who use these files to GAIN access to your private stuff. Some Robot Thieves even send out their own little spy robots to read robots.txt files just to find vulnerabilities (or free products) on your site.

To make sure they don’t have access to your private files, follow these steps:

  1. Make sure you do not have any references to actual files in your robots.txt file.
  2. Use only directory entries, and put “no access” index files in those directories. A “no access” index file would be a .html or .php file that either redirects back to the main page or tells the person they don’t have access. This will be explained further in a bit.

In other words:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /free/

Is a good robot file.

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /free/something.html

Is NOT a good robot file.

The first one was actually an exact duplicate of my robots.txt file.

Within those three directories I have an index.html file that redirects the user to the main web page by using the following:

<HTML><HEAD>
<TITLE>Sorry, this page is for private viewing. We’re taking you back to the main page.</TITLE>
</HEAD>
<BODY onLoad=window.setTimeout(”location.href=’http://www.saphrym.com’”)>
<A HREF=”http://www.saphrym.com”>Click here</A> if you don’t go back automatically.
</BODY></HTML>

Or if you are using PHP, make an index.php with the following code:

<?php

header(”Location: http://www.saphrym.com”);

?>

That will keep the snoopers out, and it will allow you to still keep the NICE (the ones that actually listen to robots.txt) search engines from spidering the more private parts of your web site.

OVERVIEW:

  1. Create a directory to hold your private files.
  2. Create a “no access” index.html or index.php file in the folder.
  3. Add that directory to your robots.txt file.

That’s all there is to it. Kick those Robot Thief butts!

  • Digg
  • Sphinn
  • Mixx
  • StumbleUpon
  • Technorati
  • TwitThis
  1. An article I wrote when I ran the Theebs.com website back in 2000. To find out more about theebs.com, click here. [«]

Related Posts:

Related posts brought to you by Yet Another Related Posts Plugin.

  1. 1 Trackback(s)

  2. Sep 19, 2007: Webmasters » Webmasters September 18, 2007 11:32 pm

Post a Comment