Tuesday, 6 December 2005
Crawlers and Aggregators and Apache Logging Tricks 
« Slashdot - Growth, Rank, and Attention | Main | Welcome, Slashdot Crowd (and other del.icio.us emmigrants) »
Short and geekly sweet this post will be. I've talked about crawlers and aggregators and how they affect web site traffic patterns twice in the past: first in Web 2.0 Tagging Study, Part 4: Nocturnal Activity Pattern, and then in Web 1.0 vs. Web 2.0 Web Activity Pattern.
In order to get a better understanding of Simpy's usage, I wanted to weed out the numerous crawlers and aggregators that constantly hit Simpy, crawl it, fetch its web pages and suck its feeds. As Simpy uses Apache web server, this was quite simple to do, using mod_setenvif Apache module. The following is a snippet from Simpy's web server configuration file:
CustomLog /log/apache/access_log combined env=!DoNotLog CustomLog /log/apache/robot_log combined env=ROBOT <IfModule mod_setenvif.c> ## you need to make all this be a single line BrowserMatchNoCase "Feed|Google Desktop|Bloglines|Teoma|NaverBot|msnbot|Googlebot| Slurp|crawler|Crawler|rojonet|NewsGator|MagpieRSS|PubSub-RSS|AppleSyndication| ZyBorg|ia_archiver|kinjabot|Yahoo-Blogs|BecomeBot|Blogslive|Blogpulse|SBIder| NetNewsWire|ping.blo.gs|RssBandit|Baiduspider" ROBOT=1 DoNotLog=1 </IfModule>
Clearly, the trick is to list all user agents that you want to exclude, set an environment variable, and then use that variable to do conditional logging of web requests. You will notice that my configuration is a bit redundant, as I could really accomplish the same with only 1 variable.
This is obviously not rocket science, but I find it useful to log human and non-human requests separately, and I thought I'd share this, in case anyone else wants to do the same.
Technorati Tags: simpy crawler robot spider aggregator apache httpd mod_setenvif

Comments on this entry:
Very interesting site and beautiful design !! Thank.
Your site is very nice :) Respect to admin !
Hello all! Very nice site and very informativity!
It is healthy, I shall come on your site more often, thank.
I think your site is very good and complete, but the information you have here
Keep a good work man!
Hello admin, nice site you have!
Hello all! Very nice site and very informativity!
nice site !Good content,beautiful design, thank! ericsson ringtones
Very interesting site, beautiful design, thank.
Very interesting site and beautiful design !! Thank.
� good site, good short contents of the good work. �ongratulations !
Keep a good work man!
I think your site is very good and complete, but the information you have here
I love you so much! Great place to visit!
Sofa Sectionals
Good work, webmaster! Nice site!
Great web-site,admin! Very informativity! antique furniture
Very interesting site, beautiful design, thank.
Keep a good work man!
I think your site is very good and complete, but the information you have here
Very interesting site ! Good work ! �ongratulations :)
Your comment: