Transact SQL Other Articles Software Reviews
How to stop automated web robots from visiting ASP/ASP.NET websites
While the growth in website users over the last few years has been spectacular, there has also been a corresponding increase in unwelcome website visitors. Many websites are now plagued by unwanted automated web robot visitors which steal content, interfere with interactive website elements and use up large amounts of bandwidth. This article helps to determine if your website has a robot problem and what to do about it if it does!
Do you have a robots problem?
The scale of the robots problem largely depends on the type of website as well as the type of content it offers. The following pointers are consistent with robot activity:
Although some of these indicators will be identified by website server statistical analysis packages, it is often necessary to manually look at the log files in a text editor or use a specialized log reporting tool such as Microsoft's Log Parser.
Why robots are a problem
There are a number of problems associated with robots.
Large amounts of web robot traffic cause an increase in the bandwidth consumed by the website. On top of the increased financial cost of bandwidth, the bandwidth usage can reduce overall server performance, especially if the robots are making large numbers of requests to resource intensive pages, such as database search results pages.
Automated web traffic can distort website statistics, especially if there is a large amount of robot traffic or the robot traffic varies significantly from month to month. This can lead to awkward questions from senior management if they notice unusual traffic peaks. It also makes it difficult to gauge the success of marketing campaigns, etc.
The robots may be doing something with your website's content. Stealing website content and republishing it in order to benefit from pay per click advertising is a highly profitable industry.
Techniques for Stopping Robots
The Web Robots Exclusion Standard
There is a semi-official standard for preventing robots from visiting all or part of a website. This is the Standard for Robot Exclusion and the details of it are at http://www.robotstxt.org/wc/norobots.html. This standard proposes that web servers that want to change the behavior of robots visiting the site should control the behavior through a robots.txt text file placed in the root of the web server (i.e. http://www.foo.com/robots.txt).
Unfortunately the Standard for Robot Exclusion is not an official standard and has never been ratified by an official Internet organization. Furthermore, robots are under no obligation to follow the guidelines in a robots.txt file. Consequently, a robots.txt file is of very limited use when attempting to stop all but the most well behaved robots from visiting.
The robots meta tag
Although the web robots exclusion standard is useful for stopping certain robots from visiting an entire website or parts of an entire website, it is not really suited for stopping robots visiting individual pages. The other drawback is that in order to use a robots.txt file, the file must be placed in the root folder of the website - something that is not always possible to do depending on the configuration of the web hosting plan or the internal IT regulations of a large corporation.
For this reason it is sometimes better to use the robots meta tag in individual pages of the website. The HTML required for stopping a robot indexing a page is:
<meta name="robots" content="noindex">
This HTML should be placed within the element of the document.
It is also possible to stop a robot from following the links from a particular document using the following syntax.
<meta name="robots" content="nofollow">
The two instructions can also be combined in a single meta tag.
<meta name="robots" content="noindex, nofollow">
However, this technique of using meta tags is unlikely to stop all but the most well behaved robots.
Make registration mandatory
If you have valuable content on your website and it is appropriate to do so, it may be worthwhile to make all or part of the website content only accessible once a user has logged in.
The main drawback of doing this is that preventing robots will also stop a search engine's own web robots from visiting the website's content which will cause your website to be less visible in search engine catalogs. If your website relies on a significant portion of its revenue earning traffic from search engine referrals then this technique will obviously be counter-productive.
Slowing robots down
An alternative to stopping robots altogether is to slow them down. Many of the common legitimate robots that visit websites and obey the robots exclusion protocol can be slowed down. For example, to slow down Yahoo!'s robot so that it requests URLs with reduced frequency, the following lines can be added to the robots.txt file.
User-agent: Slurp Crawl-delay: 10
Note that Crawl-delay is measured in seconds.
Unfortunately, there is no agreed standard for slowing down robots, so it has to be implemented on a robot by robot basis.
For robots that do not understand any instructions to slow down, it is possible to force them to slow down. This could be achieved by writing a custom add-on to a website that introduces a delay in returning content should a specific user make more than a certain number of requests in a specific time period.
As an alternative to writing a custom add-on, it is possible to find commercial offerings that will accomplish the same. The Slow Down Manager ASP.NET component within VAM: Visual Input Security is able to slow down anyone who makes repeated requests for pages and can be configured to deny them access to the pages if they make more than a certain number of requests. Further details about the Slow Down Manager are available from http://www.peterblum.com/VAM/VISETools.aspx#SDM.
While slowing down robots is in theory a good solution, it is fraught with difficulties. For example, most robots can be configured to visit websites at preset intervals. If the robot user noticed it was being slowed down, it could simply increase the time interval between robot visits. Slowing down website visitors based on IP address may also reduce response times for legitimate users using the same web cache/proxy server as the robot user. Slowing down robots by introducing a delay in the response time would also use up processor resources while the delay was introduced.
While stopping robots from visiting is one solution, the other is to make your website a lot less useful to them. This can be achieved by either making the website structure difficult to navigate, or by obfuscating the content so that it is more difficult to parse and extract content.
Obfuscating the content of the website
A straightforward way of making life more difficult for robots is to use the .NET Framework. The HTML produced by ASP.NET can be more difficult to parse than that created using classic ASP. This is particularly so if the content the robots are interested in can only be displayed after a form is posted back. The .NET Framework gives form fields names such as _ctl10__ctl1_DropDownListPrice which can often be inconsistent if the page contains different numbers of controls each time it is viewed or it contains controls with many subcontrols within them, such as DataGrids.
Blocking robot user-agents
Most requests made to a web server will contain a description of the web browser or automated web robot being used - the "user agent string." This description can be accessed via the HTTP_USER_AGENT server variable, Request.ServerVariables ("HTTP_USER_AGENT"), in either VBScript in classic ASP or VB.NET in ASP.NET. Most legitimate robots will identify themselves. For example, Google's content retrieval robot identifies itself as:
Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html)
A web browser will generally identify itself as something like:
Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0).
However, there are now so many variants of the user agent string that it can be difficult to keep up with things. Classic ASP used to have a Browser Capabilities component that could be used to identify web browsers, but it relied on manually updating the server's browscap.ini file as new web browsers were released.
Commercial alternatives to the Browser Capabilities component are often much better at identification of user agents. Of the various commercial offerings, BrowserHawk is probably the best known. Its ASP component contains a Crawler property that can be used to determine if the client is a robot.
While in theory using the user agent string to identify and block robots is possible, it is possible for the users of robots to "fake" the user agent string. The usual method of accomplishing this is to use a user agent string from a commonly used web browser such as Internet Explorer 6 on Windows. The web server is then unable to distinguish it from the normal website users unless more sophisticated robot detection techniques are employed.
A further problem is that an increasing number of proxy servers have been configured to strip out information, such as the user agent string from the request, so it is not uncommon to see the user agent masked or absent altogether.
Robot honey pot
Since the user agent string is open to abuse, a more sophisticated method of stopping robots is required.
One way of achieving this is by looking for website visitors that request a high ratio of pages to other content such as images. Robots are primarily interested in text content, so this is a good way of identifying robots. The downside to this is that it is not straightforward to accomplish this through ASP or ASP.NET, but it can be accomplished by analysis of the web server's logfiles. Analysis of robot behavior in log files can be carried out using Microsoft's Log Parser. Alternatively, the analysis could potentially be done in near real time by making use of an ISAPI filter to log requests as they are made to the web server. Logging website requests to a SQL Server could also be used, but for large websites this would require substantial SQL Server resources to log the amount of data generated.
A variant of this is to look for website visitors that just request the dynamic parts of the site. For example, an online store may have product catalog pages that robots will tend to visit in order to extract the product details and republish on another site, such as a shopping comparison site. The exact pattern of robot usage will tend to vary depending on the type of content offered by the website.
Instead of looking through log files, an alternative for identifying robots is to put a hidden link on a page which only robots will follow. This link could then take the robot to an ASP page that logs its IP address to a database. Of course, this technique cannot be effective against robots just visiting specific pages within the website, but it is reasonably good at identifying robots that crawl entire websites.
Once a robot has been identified then it can be blocked from the site. The usual method of this is to prevent requests from the robot's IP address.
Testing your robot defenses
If you want to robot proof your website and then test the results, I wrote a small utility - The Website Utility - to simulate a robot's eye view of a website.