Tuesday, 6 December 2005
Crawlers and Aggregators and Apache Logging Tricks 
« Slashdot - Growth, Rank, and Attention | Main | Welcome, Slashdot Crowd (and other del.icio.us emmigrants) »
Short and geekly sweet this post will be. I've talked about crawlers and aggregators and how they affect web site traffic patterns twice in the past: first in Web 2.0 Tagging Study, Part 4: Nocturnal Activity Pattern, and then in Web 1.0 vs. Web 2.0 Web Activity Pattern.
In order to get a better understanding of Simpy's usage, I wanted to weed out the numerous crawlers and aggregators that constantly hit Simpy, crawl it, fetch its web pages and suck its feeds. As Simpy uses Apache web server, this was quite simple to do, using mod_setenvif Apache module. The following is a snippet from Simpy's web server configuration file:
CustomLog /log/apache/access_log combined env=!DoNotLog CustomLog /log/apache/robot_log combined env=ROBOT <IfModule mod_setenvif.c> ## you need to make all this be a single line BrowserMatchNoCase "Feed|Google Desktop|Bloglines|Teoma|NaverBot|msnbot|Googlebot| Slurp|crawler|Crawler|rojonet|NewsGator|MagpieRSS|PubSub-RSS|AppleSyndication| ZyBorg|ia_archiver|kinjabot|Yahoo-Blogs|BecomeBot|Blogslive|Blogpulse|SBIder| NetNewsWire|ping.blo.gs|RssBandit|Baiduspider" ROBOT=1 DoNotLog=1 </IfModule>
Clearly, the trick is to list all user agents that you want to exclude, set an environment variable, and then use that variable to do conditional logging of web requests. You will notice that my configuration is a bit redundant, as I could really accomplish the same with only 1 variable.
This is obviously not rocket science, but I find it useful to log human and non-human requests separately, and I thought I'd share this, in case anyone else wants to do the same.
Technorati Tags: simpy crawler robot spider aggregator apache httpd mod_setenvif
