See also Sisynala: http://mithrandr.moria.org/code/sisynala/ , written in Python.
I've been trying to find a good OpenSource web log analysis tool. Here
are tools I found so far:
IMO, all of them are very inflexible and completely tied to their web
interfaces (which suck for the most part). None of them provide a
programer's interface, which is what I want.
Here's the details of what I'm looking for:
- must be OpenSource, preferably with a Python / BSD or LGPL style
license rather than GPL <BR> (well, not everyone necessarily agrees
with that: I consider The GPL a good thing -- Main.IanBicking <BR>I
agree for applications, but I'm thinking about something that can be
used as a library without dictating the license terms of the
applications that use it. --TavisRudd?
Of course, the GPL only affects proprietary programs that are
distributed, so I don't see a big problem...? I wouldn't be happy with
people making my work proprietary when I gave it to them in good
faith, and that's exactly what the GPL prevents --IanBicking)
- must provide a programmer's interface, not just a web interface
- must be completely portable
- must be easy to install and configure
- must work with Apache combined log format + IIS format
- should be able to work with the Common Log Format
- should be possible to write new parser modules for new log formats
- fast parsing of the log-files
- preferably written in Python, or providing a Python interface
- provides reverse DNS lookup
- process log files split by load balancing mechanisms or log rotation
- reports:
- pages views
- unique visits
- unique human visits
- referrers
- authentificated users
- robots (+ can filter them out)
- file mime types
- browsers
- os
- http errors
- 404 errors,
- kewords/phrases from search engines
- entry pages
- exit pages
- domains/countries,
- allows filtering by anything: IP address, mime type, domain, whatever
- can provide the following summaries:
- hourly, daily, weekly, monthly, yearly summary of all logged variables
- ranked (highest-to-lowest + vice versa) listings of all logged variables
- event-based summary periods -- e.g., you'd want to reset logs
after doing a lot of search engine submittal, which may not be
aligned with a weekly or monthly summary, but you'd like grouped
together in a larger unit than daily.
- can use cookie vars to trace and summarize traffic flows (i.e. userid cookies) [see mod_session]
Flexibility and extensibility matter more to me than raw performance.
Unless a package that meets these requirements exists, I'm proposing
that we start a project to build one.
-- TavisRudd? - 03 Nov 2001
Implemention thoughts
- the parsing and log reading classes should be completely separate from the rest of the classes
- some (configurable) manager to see (a) where you left off parsing last time and (b) where all possible logfiles are. This should deal both with a logfile growing since the last parsing, and with logfile rotation.
- the raw log data should be parsed and then stored in a simple DBM style database (there may be faster formats, depending on how the data is queried -- usually I imagine it would be sequentially based on a start/stop time)
- all derived data should calculated from the DBM store and in turn stored in a DBM format
- the dates should be handled using mx.DateTime
- all config settings should be managed using the SettingsManager API used in Cheetah
Rough Class layout:
- SettingsManager
- Parser & associated classes for getting raw data out of the logs
- Storage classes for getting data in and out of the internal data store
- Classes for calculating all the basic derived data (reverse dns lookup, etc.)
- Classes for calculating the advanced derived data and summaries --
with some sort of caching mechanism for storing the results.
- Interface classes for putting it all together
-- TavisRudd? - 03 Nov 2001