See also Sisynala: http://mithrandr.moria.org/code/sisynala/ , written in Python. I've been trying to find a good OpenSource web log analysis tool. Here are tools I found so far: * AWStats_ -- written in Perl * Analog_ -- written in C * Webalizer_ -- written in C .. _AWStats: http://awstats.sourceforge.net/ .. _Analog: http://www.statslab.cam.ac.uk/~sret1/analog/ .. _Webalizer: http://www.mrunix.net/webalizer/ IMO, all of them are very inflexible and completely tied to their web interfaces (which suck for the most part). None of them provide a programer's interface, which is what I want. Here's the details of what I'm looking for: * must be OpenSource, preferably with a Python / BSD or LGPL style license rather than GPL
(well, not everyone necessarily agrees with that: I consider The GPL a good thing -- Main.IanBicking
I agree for applications, but I'm thinking about something that can be used as a library without dictating the license terms of the applications that use it. --TavisRudd_ Of course, the GPL only affects proprietary programs that are distributed, so I don't see a big problem...? I wouldn't be happy with people making my work proprietary when I gave it to them in good faith, and that's exactly what the GPL prevents --IanBicking_) * must provide a programmer's interface, not just a web interface * must be completely portable * must be easy to install and configure * must work with Apache combined log format + IIS format * Actually, IIS uses the W3C logfile format (http://www.w3.org/TR/WD-logfile.html), so W3C compliance would do. --ChristianPackmann_ * should be able to work with the Common Log Format * should be possible to write new parser modules for new log formats * fast parsing of the log-files * preferably written in Python, or providing a Python interface * provides reverse DNS lookup * process log files split by load balancing mechanisms or log rotation * reports: * pages views * unique visits * unique human visits * referrers * authentificated users * robots (+ can filter them out) * file mime types * browsers * os * http errors * 404 errors, * kewords/phrases from search engines * entry pages * exit pages * domains/countries, * allows filtering by anything: IP address, mime type, domain, whatever * can provide the following summaries: * hourly, daily, weekly, monthly, yearly summary of all logged variables * ranked (highest-to-lowest + vice versa) listings of all logged variables * event-based summary periods -- e.g., you'd want to reset logs after doing a lot of search engine submittal, which may not be aligned with a weekly or monthly summary, but you'd like grouped together in a larger unit than daily. * can use cookie vars to trace and summarize traffic flows (i.e. userid cookies) [see mod_session] Flexibility and extensibility matter more to me than raw performance. Unless a package that meets these requirements exists, I'm proposing that we start a project to build one. -- TavisRudd_ - 03 Nov 2001 -------------------- *Implemention thoughts* * the parsing and log reading classes should be completely separate from the rest of the classes * some (configurable) manager to see (a) where you left off parsing last time and (b) where all possible logfiles are. This should deal both with a logfile growing since the last parsing, and with logfile rotation. * the raw log data should be parsed and then stored in a simple DBM style database (there may be faster formats, depending on how the data is queried -- usually I imagine it would be sequentially based on a start/stop time) * all derived data should calculated from the DBM store and in turn stored in a DBM format * the dates should be handled using mx.DateTime * all config settings should be managed using the SettingsManager API used in Cheetah Rough Class layout: * SettingsManager * Parser & associated classes for getting raw data out of the logs * Storage classes for getting data in and out of the internal data store * Classes for calculating all the basic derived data (reverse dns lookup, etc.) * Classes for calculating the advanced derived data and summaries -- with some sort of caching mechanism for storing the results. * Interface classes for putting it all together -- TavisRudd_ - 03 Nov 2001