Tuesday, 6 December 2005

Crawlers and Aggregators and Apache Logging Tricks

« Slashdot - Growth, Rank, and Attention | Main | Welcome, Slashdot Crowd (and other del.icio.us emmigrants) »

Short and geekly sweet this post will be. I've talked about crawlers and aggregators and how they affect web site traffic patterns twice in the past: first in Web 2.0 Tagging Study, Part 4: Nocturnal Activity Pattern, and then in Web 1.0 vs. Web 2.0 Web Activity Pattern.

In order to get a better understanding of Simpy's usage, I wanted to weed out the numerous crawlers and aggregators that constantly hit Simpy, crawl it, fetch its web pages and suck its feeds. As Simpy uses Apache web server, this was quite simple to do, using mod_setenvif Apache module. The following is a snippet from Simpy's web server configuration file:

CustomLog /log/apache/access_log combined env=!DoNotLog
CustomLog /log/apache/robot_log combined env=ROBOT

<IfModule mod_setenvif.c>
  ## you need to make all this be a single line
  BrowserMatchNoCase "Feed|Google Desktop|Bloglines|Teoma|NaverBot|msnbot|Googlebot|
  Slurp|crawler|Crawler|rojonet|NewsGator|MagpieRSS|PubSub-RSS|AppleSyndication|
  ZyBorg|ia_archiver|kinjabot|Yahoo-Blogs|BecomeBot|Blogslive|Blogpulse|SBIder|
  NetNewsWire|ping.blo.gs|RssBandit|Baiduspider" ROBOT=1 DoNotLog=1
</IfModule>

Clearly, the trick is to list all user agents that you want to exclude, set an environment variable, and then use that variable to do conditional logging of web requests. You will notice that my configuration is a bit redundant, as I could really accomplish the same with only 1 variable.

This is obviously not rocket science, but I find it useful to log human and non-human requests separately, and I thought I'd share this, in case anyone else wants to do the same.

Posted by otis at 2:07 AM in Tips & Tricks

Comments on this entry:

Left by WebMan at Fri, 3 Mar 9:47 AM

Very interesting site and beautiful design !! Thank.

Left by Neo at Fri, 3 Mar 3:12 PM

Your site is very nice :) Respect to admin !

Left by Arnie at Fri, 3 Mar 3:14 PM

Hello all! Very nice site and very informativity!

Left by Frank at Fri, 3 Mar 6:39 PM

It is healthy, I shall come on your site more often, thank.

Left by Rokko at Fri, 3 Mar 6:44 PM

I think your site is very good and complete, but the information you have here

Left by Neo at Fri, 3 Mar 10:18 PM

Keep a good work man!

Left by Aron at Fri, 3 Mar 10:20 PM

Hello admin, nice site you have!

Left by Aron at Sat, 4 Mar 1:33 AM

Hello all! Very nice site and very informativity!

Left by Tanya at Sat, 4 Mar 9:58 PM

nice site !Good content,beautiful design, thank! ericsson ringtones

Left by Kir at Sun, 5 Mar 3:01 PM

Very interesting site, beautiful design, thank.

Left by Nelly at Mon, 6 Mar 12:40 AM

Very interesting site and beautiful design !! Thank.

Left by Rustie at Mon, 6 Mar 9:41 AM

� good site, good short contents of the good work. �ongratulations !

Left by Junior at Mon, 6 Mar 4:23 PM

Keep a good work man!

Left by Damn at Mon, 6 Mar 5:38 PM

I think your site is very good and complete, but the information you have here

Left by Kir at Mon, 6 Mar 9:02 PM

I love you so much! Great place to visit!

Left by JeyZee at Mon, 6 Mar 10:52 PM

Sofa Sectionals

Left by Frank at Tue, 7 Mar 4:51 AM

Good work, webmaster! Nice site!

Left by Felixs at Tue, 7 Mar 11:34 AM

Great web-site,admin! Very informativity! antique furniture

Left by Sasha at Tue, 7 Mar 11:39 AM

Very interesting site, beautiful design, thank.

Left by Jameel at Tue, 7 Mar 6:10 PM

Keep a good work man!

Left by WebMan at Wed, 8 Mar 12:27 AM

I think your site is very good and complete, but the information you have here

Left by Dominic at Wed, 8 Mar 6:58 AM

Very interesting site ! Good work ! �ongratulations :)

Your comment:

(not displayed)

Answer 6 + 6 =

 
 
 

Live Comment Preview:

 
« December »
SunMonTueWedThuFriSat
    123
45678910
11121314151617
18192021222324
25262728293031
       

Powered by blojsom