CrawlWall™ Technology Overview
CrawlWall uses the following technology to secure your website and protect your content. All of the various methods are designed to work together in harmony to make sure that all of the spiders with permission and legitimate visitors get into your website without issue and all of the rogue crawlers get stopped and never gain admission.
Dynamic Robots.txt
All spiders are shown a custom robots.txt file based on user defined permissions. Each robot that is allowed to crawl your web site is shown an individual robots.txt file based on permissions. All other spiders are shown a robots.txt file that says everything is denied so if they continue to crawl we know they ignore robots.txt and stop them instantly.
The additional beauty of a this technique is that the allowed list of robots is never exposed to the outside world. CrawlWall monitors for spiders testing user agents to gain access so changing user agents or "spoofing" will result in an instant banning of a rogue robot.
White-List Opt-In Permissions
There are just too many spiders to block so CrawlWall doesn't play that game, we're opt-in only.
Spiders that identify themselves with a proper user agent which has been given permission to crawl, like "Googlebot" or "Yahoo Slurp", are permitted access if they match all known criteria. CrawlWall uses a combination of the spider name and known IP addresses of where these spiders originate before they are allowed to crawl. This allows CrawlWall to easily filter out fake spiders or spiders crawling thru a proxy site.
Second Pass Filters
Everything that passes the first stage of permissions is then filtered for additional crawlers and website downloaders that usually claim to be something like Internet Explorer with a word "Downloader" or "Crawler" in the user agent string.
Surprisingly, there are a lot of these types of applications and the list grows daily.
Ban by IP or Address
Many rogue spiders and bots originate from various server-only hosting locations so all of these that we know about are all filtered out as well. This means that server farms hosting new spiders or data aggregators tend to get blocked before they crawl the first page.
Proxy Blocking
Some spiders, people with malicious intent or snooping competitors try to use anonymous proxy servers to do whatever it is they're going to do without being identified. CrawlWall blocks these proxies as most anonymous proxy sites filter your web pages, remove your advertisements, break your navigation, and then place 3rd party ads on your web pages.
Presenting a Challenge
When robot activity is suspected CrawlWall presents a random challenge to the visitor which is something a human can get right in 1 or 2 tries but will trap a robot that just keeps asking for more and more pages without ever successfully answering the challenge.
Quarantined IPs
Badly behaving IP addresses are temporarily quarantined until CrawlWall can determine if the IP address is a shared proxy server, like AOL, which may be used by spiders and humans, or just a dedicated crawler location. If it's a shared location then each subsequent session is met with a challenge that allows a human access but stops spiders cold. If the IP address is never used by a human, meaning the challenge is never answered, then the IP address is eventually escalated to banned. |