See also Sisynala: http://mithrandr.moria.org/code/sisynala/ , written in Python.
I've been trying to find a good OpenSource web log analysis tool. Here
are tools I found so far:
* AWStats_ -- written in Perl
* Analog_ -- written in C
* Webalizer_ -- written in C
.. _AWStats: http://awstats.sourceforge.net/
.. _Analog: http://www.statslab.cam.ac.uk/~sret1/analog/
.. _Webalizer: http://www.mrunix.net/webalizer/
IMO, all of them are very inflexible and completely tied to their web
interfaces (which suck for the most part). None of them provide a
programer's interface, which is what I want.
Here's the details of what I'm looking for:
* must be OpenSource, preferably with a Python / BSD or LGPL style
license rather than GPL
(well, not everyone necessarily agrees
with that: I consider The GPL a good thing -- Main.IanBicking
I
agree for applications, but I'm thinking about something that can be
used as a library without dictating the license terms of the
applications that use it. --TavisRudd_
Of course, the GPL only affects proprietary programs that are
distributed, so I don't see a big problem...? I wouldn't be happy with
people making my work proprietary when I gave it to them in good
faith, and that's exactly what the GPL prevents --IanBicking_)
* must provide a programmer's interface, not just a web interface
* must be completely portable
* must be easy to install and configure
* must work with Apache combined log format + IIS format
* Actually, IIS uses the W3C logfile format (http://www.w3.org/TR/WD-logfile.html), so W3C compliance would do. --ChristianPackmann_
* should be able to work with the Common Log Format
* should be possible to write new parser modules for new log formats
* fast parsing of the log-files
* preferably written in Python, or providing a Python interface
* provides reverse DNS lookup
* process log files split by load balancing mechanisms or log rotation
* reports:
* pages views
* unique visits
* unique human visits
* referrers
* authentificated users
* robots (+ can filter them out)
* file mime types
* browsers
* os
* http errors
* 404 errors,
* kewords/phrases from search engines
* entry pages
* exit pages
* domains/countries,
* allows filtering by anything: IP address, mime type, domain, whatever
* can provide the following summaries:
* hourly, daily, weekly, monthly, yearly summary of all logged variables
* ranked (highest-to-lowest + vice versa) listings of all logged variables
* event-based summary periods -- e.g., you'd want to reset logs
after doing a lot of search engine submittal, which may not be
aligned with a weekly or monthly summary, but you'd like grouped
together in a larger unit than daily.
* can use cookie vars to trace and summarize traffic flows (i.e. userid cookies) [see mod_session]
Flexibility and extensibility matter more to me than raw performance.
Unless a package that meets these requirements exists, I'm proposing
that we start a project to build one.
-- TavisRudd_ - 03 Nov 2001
--------------------
*Implemention thoughts*
* the parsing and log reading classes should be completely separate from the rest of the classes
* some (configurable) manager to see (a) where you left off parsing last time and (b) where all possible logfiles are. This should deal both with a logfile growing since the last parsing, and with logfile rotation.
* the raw log data should be parsed and then stored in a simple DBM style database (there may be faster formats, depending on how the data is queried -- usually I imagine it would be sequentially based on a start/stop time)
* all derived data should calculated from the DBM store and in turn stored in a DBM format
* the dates should be handled using mx.DateTime
* all config settings should be managed using the SettingsManager API used in Cheetah
Rough Class layout:
* SettingsManager
* Parser & associated classes for getting raw data out of the logs
* Storage classes for getting data in and out of the internal data store
* Classes for calculating all the basic derived data (reverse dns lookup, etc.)
* Classes for calculating the advanced derived data and summaries --
with some sort of caching mechanism for storing the results.
* Interface classes for putting it all together
-- TavisRudd_ - 03 Nov 2001