Skip to content

Stone Steps Webalizer is a fast command line application for web server and web proxy log file analysis. It supports multiple log formats and produces highly customizable HTML reports in many languages.

License

Notifications You must be signed in to change notification settings

StoneStepsInc/StoneStepsWebalizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Stone Steps Webalizer (v6.4.0)

Copyright (c) 2004-2022, Stone Steps Inc. (www.stonesteps.ca)

The version of The Webalizer provided with this distribution is a fork based on the version 2.01-10 of the original Webalizer:

The Webalizer - A web server log file analysis tool Copyright 1997-2000 by Bradford L. Barrett (brad@mrunix.net)

Distributed under the GNU GPL. See the files "COPYING" and "Copyright" supplied with the distribution for additional info.


What is The Webalizer?

The Webalizer is a web server log file analysis program which produces usage statistics in HTML format for viewing with a browser. The results are presented in both columnar and graphical format, which facilitates interpretation. Yearly, monthly, daily and hourly usage statistics are presented, along with the ability to display usage by host, URL, referrer, user agent (browser), search string, entry/exit page, username and country (some information is only available if supported and present in the log files being processed). Processed data may also be exported into most database and spreadsheet programs that support tab delimited data formats.

The Webalizer supports W3C, IIS, Nginx, Apache, Squid, Common Log Format, as well as Combined Log Format. The latter two are referred to as CLF in this document and the difference between them is handled trasparently when log files are being processed.

Gzip compressed logs may be used as input directly. Any log filename that ends with a .gz extension will be assumed to be in gzip format and uncompressed on the fly as it is being read.

In addition, the Webalizer also supports DNS and GeoIP lookup capabilities.

Installing the Webalizer

Windows

Windows pre-built package contains all run-time dependencies and can be used as-is. Extract package contents to any directory, such as c:\tools\webalizer\, and run it from there.

If you intend to use just one configuration file, the installation directory is probably the best place for it. Otherwise, you can use the -c option to specify any configuration file. See Configuration Files for details on how configuration files are processed.

Linux

Stone Steps Webalizer depends on the following packages:

  • GD Library v2 or newer
  • Berkeley DB v4.3 or newer
  • ZLIB v1 or newer
  • MaxMindDB v1.2 or newer

You can see development package names for some common Linux flavors in devops/Docker.* files in the source repository to figure out binary packages.

Extract contents of a pre-built Linux package to any directory and run the sudo ./install script from there.

The script makes use of the following set of directories.

  • /usr/local/bin/webalizer-stonesteps

    Contains the executable webalizer. If there is no existing file /usr/local/bin/webalizer, a symbolic link with this name is created to point to the executable in this directory.

  • /usr/local/share/webalizer-stonesteps/www

    Contains shared CSS and JavaScript source files that should be copied from this location to a directory where they can be referenced by HTML reports.

    Set HTMLCssPath and HTMLJsPath to point to the corresponding source files in that final target directory that is accessible from the web site serveing HTML reports generated by Stone Steps Webalizer. These source files may be shared by multiple HTML reports.

  • /usr/local/share/webalizer-stonesteps/lang

    Contains language localization files. Set LanguageFile to point to a language file of your choice.

  • /usr/local/share/webalizer-stonesteps/maxmind

    This directory is created as a location for MaxMind GeoIP and ASN databases, which you need to download on your own from locations described further in this file under their configuration values.

    Set GeoIPDBPath and ASNDBPath to point to the corresponding MaxMind database files in this location.

  • /usr/local/share/doc/webalizer-stonesteps/

    Contains documentation files, licenses and change logs.

  • /var/local/lib/webalizer-stonesteps/

    Set DbPath to point to this directory, which will contain current and historic state database files, such as webalizer.db and webalizer_202102.db. These state files may be used to generate HTML reports with the information gathered when they were created.

If you intend to use just one configuration file, /etc is the best place for it. Otherwise, you can use the -c option to specify any configuration file. See Configuration Files for details on how configuration files are processed.

Run sudo ./uninstall from the directory where this script is located to remove all of the directories above. Note that existing state databases and MaxMind database files will not be deleted and if there are any, you will need to delete them manually.

Building from Source

Building from the source on Windows requires Visual Studio 2019 or newer. All source dependencies are configured as Nuget packages and should be pulled automatically during a build. Compiled binaries are generated in the build directory.

Building from the source on Linux requires development packages for all dependencies. See devops/Docker.* files for development package names for some common Linux flavors.

Once all dependencies are installed, change to the source directory and run make.

Compiled binaries are generated in the build directory, if you want to copy them into directories of your choice.

You can also install all files into directories described in the installation section above by running sudo make install. A short reference for installation directories and configuration variables is available via make install-info.

Run sudo make uninstall to uninstall.

Running the Webalizer

The Webalizer was designed to be run from a Linux or Windows command line prompt or as a cron job. There are several command line options which will modify the results it produces, and configuration files can be used as well. The format of the command line is:

webalizer [options ...] [logfile [[ logfile]...] | report-database]

Where options can be one or more of the supported command line switches described below.

logfile is the name of the log file to process. Log file names are collected from these distinct sources:

  • LogFile variables in the default configuration file and its includes.
  • Command line, LogFile variables in configuration files specified with the -c option and Include directives.
  • If the --pipe-log-names option was used, log file names are read from the standard input.

If a log file name is found in any of the sources above, the previous set of log file names is cleared, which prevents the possibility of the same log file name accepted for processing more than once. For example, if log files A and B are specified in webalizer.conf, and log files C and D are specified on the command line, then only C and D will be processed.

Multiple log file collected from the same source will be processed in their log record time stamp order. This feature is intended for processing load-balanced log files from multiple web servers behind the same site, but can also be used to process log files from the same web server, in which case newer logs will just be processed after older logs.

If a dash (-) is specified for the log-file name, STDIN will be used.

report-database is the name of the database, not including file path or file extension, that will be used to generate a report.

Once executed, the general flow of the program follows:

  • The default configuration file is searched for in these locations:

    • current directory
    • system configuration directory (/etc on Linux or c:\windows on Windows)
    • the directory where webalizer or webalizer.exe is located

    If found, the default configuration file will be processed regardless of whether one or more -c options are used on the command line.

  • Any command line arguments given to the program are parsed. This may include the specification of a configuration file (-c), which is processed at the time it is encountered.

  • If any of the configuration files contains Include directives, all included configuration files are processed as well.

  • All configuration files are processed in the order they are specified on the command line or via Include directories. Single-value variables overwrite those seen erarlier, but those that take multiple values, such as IgnoreURL or SearchEngine, will all be collected in the order they are encountered.

  • If --prepare-report was specified on the command line, the last argument will be interpreted as a database file name. In this case a report will be generated from the data stored in the database and no log files will be processed.

  • If a log file was specified, it is opened and made ready for processing. If no log file was given, or the filename - is specified on the command line, STDIN is used for input.

  • If an output directory was specified, the program will generate output in this directory. If no output directory was given, the current directory is used.

  • If a non-zero number of DNSChildren processes were specified, the corresponding number of DNS worker threads will be started, and IP addresses in the specified log file will be either resolved to host names or looked up in the GeoIP database, or both.

  • If no hostname was given, the program attempts to get the hostname using a uname system call. If that fails, localhost is used.

  • A history file is searched for. This file keeps previous month totals used on the main index.html page. The default file is named webalizer.hist, kept in the specified output directory, however may be changed using the HistoryName configuration file keyword.

  • A database file containing the internal state data of the program at the end of a previous run is searched for and opened, if found. The default database file is webalizer.db and is kept in the directory specified using the DbPath configuration variable.

  • Main processing begins on the log file. If the log spans multiple months, a separate HTML document is created for each month.

  • After main processing, the main index.html page is created, which has totals by month and links to each months HTML document.

  • A new history file is saved to disk, which includes totals generated by The Webalizer during the current run.

  • The database file is updated to contain the internal state data at the end of this run.

Incremental Processing

Incremental processing preserves current log items and numbers that you see in all reports, such as URLs, IP addresses, request counts, transfer amounts, etc., in a database file called webalizer.db and adds new items to this database or updates numbers for existing items when new log files are processed.

Originally, incremental processing was an optional mode that made it possible for website administrators to process multiple rollover log files created by the web server in the course of one month. The non-incremental mode remained the default mode and was intended for large log files that contained one or more months worth of data. With the rapid increase of web traffic over the years, large log files became hard to manage and most website administrators choose to rotate their logs either by size or by date. Given this trend, the incremental mode was changed to be the default processing mode in Stone Steps Webalizer v4.2.1.

IMPORTANT: Stone Steps Webalizer uses the time stamp in each log record to track log records that have been already processed and if your logs are rotated based on the log size, the time stamp in the first few records of each new log may fall into the same second as last records of the previous log and will be ignored when that log will be processed. Configure your log rotation based on the date instead to avoid this, so each new log has distinct log record time stamps.

Some special precautions need to be taken when using the incremental run capability of The Webalizer. Configuration options should not be changed between runs, as that could cause corruption of the internal stored data.

For example, changing the MangleAgents level will cause different representations of user agents to be stored, producing invalid results in the user agents section of the report. If you need to change configuration options, do it at the end of the month after normal processing of the previous month and before processing the current month. You may also want to delete the database file as well.

The Webalizer also attempts to prevent data duplication by keeping track of the timestamp of the last record processed. This timestamp is then compared to current records being processed, and any records that were logged previous to that timestamp are ignored. This, in theory, should allow you to re-process logs that have already been processed, or process logs that contain a mix of processed/not yet processed records, and not produce duplication of statistics. The only time this may break is if you have duplicate timestamps in two separate log files... any records in the second log file that do have the same timestamp as the last record in the previous log file processed, will be discarded as if they had already been processed. This setup also necessitates that you always process logs in chronological order, otherwise data loss will occur as a result of the timestamp compare.

Output Produced

The Webalizer produces several reports (html) and graphics for each month processed. In addition, a summary page is generated for the current and previous months, a history file is created and the current month's processed data.

The exact location and names of these files can be changed using configuration files and command line options. The files produced, (default names) are:

name description
index.html Main summary page (extension may be changed)
usage.png Yearly graph displayed on the main index page
usage_YYYYMM.html Monthly summary page (extension may be changed)
usage_YYYYMM.png Monthly usage graph for specified month/year
daily_usage_YYYYMM.png Daily usage graph for specified month/year
hourly_usage_YYYYMM.png Hourly usage graph for specified month/year
site_YYYYMM.html All hosts listing (if enabled)
url_YYYYMM.html All urls listing (if enabled)
ref_YYYYMM.html All referrers listing (if enabled)
agent_YYYYMM.html All user agents listing (if enabled)
search_YYYYMM.html All search strings listing (if enabled)
site_YYYYMM.tab tab delimited hosts file
url_YYYYMM.tab tab delimited urls file
ref_YYYYMM.tab tab delimited referrers file
agent_YYYYMM.tab tab delimited user agents file
user_YYYYMM.tab tab delimited usernames file
search_YYYYMM.tab tab delimited search string file
agent_YYYYMM.json JSON array of user agents
asn_YYYYMM.json JSON array of Autonomous System Numbers (ASN)
city_YYYYMM.json JSON array of Cities
country_YYYYMM.json JSON array of Countries
daily_YYYYMM.json JSON array of days
dl_YYYYMM.json JSON array of downloads
err_YYYYMM.json JSON array of errors
host_YYYYMM.json JSON array of hosts
hourly_YYYYMM.json JSON array of hours
ref_YYYYMM.json JSON array of referrers
search_YYYYMM.json JSON array of searches
url_YYYYMM.json JSON array of URLs
usage_YYYYMM.json JSON object for the named month
user_YYYYMM.json JSON array of users
webalizer.hist Previous month history (may be changed)
webalizer.db Incremental Data (may be changed)
webalizer_YYYYMM.db Incremental Data (may be changed)

The yearly (index) report shows statistics for a number of months specified by the HistoryLength configuration parameter and links to each month. The monthly report has detailed statistics for that month with additional links to any URL's and referrers found. The various totals shown are explained below.

Hits

Any request made to the server which is logged, is considered a hit. The requests can be for anything... html pages, graphic images, audio files, CGI scripts, etc... Each valid line in the server log is counted as a hit. This number represents the total number of requests that were made to the server during the specified report period. A request does not have to be successful to be counted as a hit.

For example, if a non-existing file is requested, the web server will respond with a 404 (Not Found) error and the log file will contain an entry for this request, which will be counted as one hit.

Files

Successful requests served by the server are counted as files. A file can be an html page or a dynamically processed page, such as a PHP or ASP page or an image. File total in the reports is a subset of of hits total.

Pages

Pages are, well, pages! Generally, any HTML document, or anything that generates an HTML document, would be considered a page. This does not include the other stuff that goes into a document, such as graphic images, audio clips, etc... This number represents the number of pages requested only, and does not include the other stuff that is in the page. What actually constitutes a page can vary from server to server. The default action is to treat anything with the extension .htm, .html or .cgi as a page. A lot of sites will probably define other extensions, such as .phtml, .php3 and .pl as pages as well. Some other programs (and people :) refer to this as Pageviews.

For example, if a request for an HTML document is made that contains two links to images and one of these images is missing, Stone Steps Webalizer will count three hits (one for the HTML document and two for linked image files), two files (one for the HTML document and one for existing image) and one page (just the HTML document).

Hosts

Each request made to the server comes from a unique host, which can be referenced by a name or, ultimately, an IP address. The hosts number shows how many unique IP addresses made requests to the server during the reporting time period. This DOES NOT mean the number of unique individual users (real people) that visited, which is impossible to determine using just logs and the HTTP protocol (however, this number might be about as close as you will get).

Visits

Whenever a request is made to the server from a given IP address (host), the amount of time since a previous request by the address is calculated. If the time difference is greater than a pre-configured visit timeout value (or has never made a request before), it is considered a new visit, and this total is incremented for the host. The default timeout value is 30 minutes (can be changed), so if a user visits your site at 1:00 in the afternoon, and then returns at 3:00, two visits would be registered.

Note: in the Top Hosts table, the visits total is a sum of all visits of the grouped hosts.

Note: A visit is started when any request is made to the server, whether it was successful or not. Consider the following time diagram, where each p represents a successful page request and each o represents any other type of request (e.g. a failed page request, a file request, etc).

        ~~~~ visit ~~~~                  ~~ visit ~~
       v               v                v           v
 ------o---p------o-o--o----------------o---o-o-----o---------->
       ^~~~^           ^~~~~~~~~~~~~~~~~^                  time
       5 min                 40 min

Stone Steps Webalizer will count in this case two visits, even though the second visit did not request any pages.

Transfer

The Transfer value shows the amount of data that was sent out by the server during the specified reporting period. This value is generated directly from the log file, so it is up to the web server to produce accurate numbers in the logs. In general, this should be a fairly accurate representation of the amount of traffic the server had, regardless of the web servers reporting quirks.

By default, most servers only log outgoing amounts (i.e. response sizes). IIS, Nginx and Apache may log incoming amounts as well (i.e. request sizes). Stone Steps Webalizer will include this type of traffic into the amount reported as Transfer if UpstreamTraffic is set to yes in the configuration file.

Transfer amounts are reported since v4.2 with a unit suffix, such as 12.3 GB, or as a number kilobytes. This behavior may be changed by setting ClassicKBytes to yes. One kilobyte is counted as either 1024 or 1000 bytes, depending on the value of the DecimalKBytes configuration variable.

Top Entry and Exit Pages

The Top Entry and Exit tables give a rough estimate of what URL's are used to enter your site, and what the last pages viewed are. Because of limitations in the HTTP protocol, log rotations, etc... this number should be considered a good "rough guess" of the actual numbers, however will give a good indication of the overall trend in where users come into, and exit, your site.

Sometimes web servers log linked content before logging the page containing the links. Stone Steps Webalizer will track for each host whether a page request has been made or not during the current visit and will report the first page URL, if any, as an entry URL. For example, if an HTML page contained two linked image files, these files may be logged before the page itself. Nevertheless, the page URL will be reported as the entry page.

JSON Output

JSON output is similar to TSV output in that it is intended for importing data into a database, except that JSON output is formatted using the Mongo DB flavor of JSON in order to deal with the lack of support for 64-bit integers in native JSON and to accommodate repeated imports within the same database.

Extended JSON

JavaScript natively supports double-precision floating point numbers, which limits integer values in JavaScript and, consequently, in JSON to 52 bits allocated for the significand. Mongo DB defines Extended JSON syntax, which introduces special syntax for types that cannot be expressed natively in JSON, including 64-bit numbers, which look like this in JSON output:

{
  "hits": { "$numberLong": "12345678" }
}

When imported into Mongo DB, data fields formatted this way will have the type Int64.

The _id Field

Each log record item is represented in JSON as an object and each of those objects is identified via the _id field, which Mongo DB recognizes as a special unique document identifier. This identifier is structured in JSON output to keep all data items within the year and month of their log files.

For example, a host with an IP address 12.34.56.78 may be tracked within the monthly state database, webalizer.db, for April under a sequential identifier 321. The same IP address may be tracked in the state database for May under a sequential identifier 654. When this host is exported as a JSON object for each of these months, their _id fields will be constructed from the year, month and item ID from their monthly state database, packed into a 64-bit integer, and would look like this in the Mongo DB collection storing hosts.

JSON output host_202104.json:

{
  "_id": { "$numberLong": "4551450373411307841" },
  "item_id": { "$numberLong": "321" },
  "year": 2021,
  "month": 4,
  "ipaddr": "12.34.56.78",
  ...
}

JSON output host_202105.json:

{
  "_id": { "$numberLong": "4551591110899663502" },
  "item_id": { "$numberLong": "654" },
  "year": 2021,
  "month": 5,
  "ipaddr": "12.34.56.78",
  ...
}

This means that host_202104.json and host_202105.json may be imported into the same Mongo DB collection without collisions for their monthly state database identifiers. Moreover, the same JSON output file may be imported repeatedly into the same Mongo DB collection, as more log files are processed for the same month, using Mongo DB insert-or-update operation, which is known as upsert.

The value of the _id field for most items is comprised of 12 bits for year, shifted left to follow the sign bit, 4 bits for the month shifted left to follow the year and the rest used for the item ID, which means that data items sorted by _id will sort by year and month first and only then by their item ID.

The _id field for daily and hourly objects is comprised of the year, month and either day or hour, respectively, represented as a 32-bit integer (e.g. "_id": 20210308).

The _id field for monthly usage data is simply a year and a month reporented as a 32-bit integer (e.g. "_id": 202105).

Data Units

All counts are reported as-is, so 1000 hits means that the item for which this value is reported was requested 1000 times.

Transfer amounts are reported in bytes, so the 12304 in xfer means that 12304 bytes were transferred for that item.

Visit duration is reported in minutes.

URL response times are reported in seconds.

Percentages are reported against the corresponding total number in monthly stats. For example, 10% in daily hits means that the day for which this number is reported accounts for 10% of all requests reported for the corresponding month.

Data Layout

Data item objects described in this section may have different structure, depending on data being processed, which may affect database queries issued against these collections.

Simple searches are reported as plain strings, such as this:

{
  ...
  "hits": { "$numberLong": "2" },
  "visits": { "$numberLong": "1" },
  "search": "webalizer setup aspx sample"
}

Advanced searches that capture additional search details, such as whether all words must match, may be reported as an array of search terms and their types, such as this:

"hits": { "$numberLong": "1" },
"visits": { "$numberLong": "1" },
"termcnt": 2,
"search": [
  {
    "type": "All Words",
    "term": "stone steps"
  },
  {
    "type": "Any Word",
    "term": "webalizer"
  }
]

Importing JSON Output

JSON output may be imported into a Mobo DB using the mongoimport utility provided by Mongo DB.

Command line examples below use simplified Mongo DB connection syntax, assuming the database server is running locally. See mongoimport documentation for connecting to a remote database with authentication.

All examples should appear as single-line commands and are split onto multiple lines via the \ character for visiblity. This character will work as a line continuation on Linux, but will not on Windows. Either remove these characters or use the ^ character for Windows command prompt.

All data items, as well as hourly and daily data, are structured as JSON arrays and may be imported using following syntax.

mongoimport --db=webalizer_db \
            --collection=agents \
            --mode=upsert \
            --type=json \
            --jsonArray \
            mongodb://localhost agent_202105.json

Monthly usage data in JSON output contains a single JSON object and must be imported without the --jsonArray option.

mongoimport --db=webalizer_db \
            --collection=usage \
            --mode=upsert \
            --type=json \
            mongodb://localhost usage_202105.json

Database and collection names can be anything, as long as they are they same for all imports. The --mode=upsert option will update existing documents with matching _id fields and insert new ones.

Command Line Options

The Webalizer supports many different configuration options that will alter the way the program behaves and generates output. Most of these can be specified on the command line, while some can only be specified in a configuration file. The command line options are listed below, with references to the corresponding configuration file keywords.

Note that most command line options are case-sensitive. That is, -F and -f are different options.

General Options

  • -h, --help

    Display all available command line options and exit program.

  • -v, -V, --version

    Display program version and exit program.

  • -w, -W, --warranty

    Displays the GNU warranty disclaimer.

  • -d

    Display additional debugging information for errors and warnings produced during processing. This normally would not be used except to determine why you are getting all those errors and wanted to see the actual data. Normally The Webalizer will just tell you it found an error, not the actual data. This option will display the data as well.

    Config file keyword: Debug

  • -F

    Specify that the log file format. By default, Stone Steps Webalizer expects IIS log file, but may be instructed to process other formats: W3C, IIS, Nginx, Apache, CLF, Squid.

    Config file keyword: LogType

  • -i

    Ignore history file. USE WITH CAUTION. This causes The Webalizer to ignore any existing history file produced from previous runs and generate it's output from scratch. The effect will be as if The Webalizer is being run for the first time and any previous statistics will be lost (although the HTML documents, if any, will not be deleted) on the main index.html (yearly) web page.

    Config file keyword: IgnoreHist

  • -q

    Quiet mode. Normally, The Webalizer will produce various messages while it runs letting you know what it's doing. This option will suppress those messages. It should be noted that this WILL NOT suppress errors and warnings, which are output to STDERR.

    Config file keyword: Quiet

  • -Q

    ReallyQuiet mode. This allows suppression of all messages generated by The Webalizer, including warnings and errors. Useful when The Webalizer is run as a cron job.

    Config file keyword: ReallyQuiet

  • -T

    Display timing information. The Webalizer keeps track of the time it begins and ends processing, and normally displays the total processing time at the end of each run. If quiet mode (-q or Quiet yes in configuration file) is specified, this information is not displayed. This option forces the display of timing totals if quiet mode has been specified, otherwise it is redundant and will have no effect.

    Config file keyword: TimeMe

  • -c file

    This option specifies a configuration file to use. Configuration files allow greater control over how The Webalizer behaves, and there are several ways to use them. Note that the default configuration file is processed regardless of whether any -c option is specified.

  • -n name

    This option specifies the hostname for the reports generated. The hostname is used in the title of all reports, and is also prepended to URL's in the reports. This allows The Webalizer to be run on log files for virtual web servers or web servers that are different than the machine the reports are located on, and still allows clicking on the URL's to go to the proper location. If a hostname is not specified, either on the command line or in a configuration file, The Webalizer attempts to determine the hostname using a uname system call. If this fails, localhost will be used as the hostname.

    Config file keyword: SiteName

  • -o dir

    This options specifies the output directory for the reports. If not specified here or in a configuration file, the current default directory will be used for output.

    Config file keyword: OutputDir

  • -x name

    This option allows the generated pages to have an extension other than .html, which is the default. Do not include the leading period (.) when you specify the extension.

    Config file keyword: HTMLExtension

  • -P name

    Specify the file extensions for pages. Pages (sometimes called PageViews) are normally html documents and CGI scripts that display the whole page, not just parts of it. Some system will need to define a few more, such as phtml, php3 or pl in order to have them counted as well. The default is htm* and cgi for web logs.

    Config file keyword: PageType

  • -t name

    This option specifies the title string for all reports. This string is used, in conjunction with the hostname (if not blank) to produce the actual title. If not specified, the default of "Usage Statistics for" will be used.

    Config file keyword: ReportTitle

  • -Y

    Suppress Country graph. Normally, The Webalizer produces country statistics in both Graph and Columnar forms. This option will suppress the Country Graph from being generated.

    Config file keyword: CountryGraph

  • -G

    Suppress hourly graph. Normally, The Webalizer produces hourly statistics in both Graph and Columnar forms. This option will suppress the Hourly Graph only from being generated.

    Config file keyword: HourlyGraph

  • -H

    Suppress Hourly statistics. Normally, The Webalizer produces hourly statistics in both Graph and Columnar forms. This option will suppress the Hourly Statistics table only from being generated.

    Config file keyword: HourlyStats

  • -L

    Disable Graph Legends. The color coded legends displayed on the in-line graphs can be disabled with this option. The default is to display the legends.

    Config file keyword: GraphLegend

  • -l num

    Graph Lines. Specify the number of background reference lines displayed on the in-line graphics produced. The default is 2 lines, however can range anywhere from zero (0) for no lines, up to 20 lines (looks funny!).

    Config file keyword: GraphLines

  • -P name

    Page type. This is the extension of files you consider to be pages for Pages calculations (sometimes called pageviews). The default is htm* and cgi (plus whatever HTMLExtension you specified if it is different). Don't use a period!

  • -m num

    Specify a visit timeout. Visits are calculated by looking at the time difference between the current and last request made by a specific host. If the difference is greater that the visit timeout value, the request is considered a new visit. This value is specified in number of seconds. The default is 30 minutes (1800). Optional suffixes m and h may be used to specify this value in minutes or hours, respectively.

    Config file keyword: VisitTimeout

  • -M num

    Mangle user agent names. See MangleAgents entry below for details about this option.

    Configuration file keyword: MangleAgents

  • -g num

    This option allows you to specify the level of domains name grouping to be performed. The numeric value represents the level of grouping, and can be thought of as the "number of dots" to be displayed. The default value of 0 disables any domain name grouping.

    Configuration file keyword: GroupDomains

  • -D name

    This allows the specification of a DNS cache file name. This filename MUST be specified if you have dns lookups enabled (using the -N command line switch or DNSChildren configuration keyword). The filename is relative to the default output directory if an absolute path is not specified (ie: starts with a leading /).

  • -N num

    Number of DNS child processes to use for reverse DNS and GeoIP lookups. If specified, DNSCache or GeoIPDBPath must also be specified. If you do not wish a DNS cache file to be generated, omit DNSCache to disable DNS resolution. A value 0 will disable DNS and GeoIP look-ups.

  • --prepare-report

    Instructs Stone Steps Webalizer to interpret the last argument as a name of the database file rather than a log file name and generate monthly report using the data from the database.

  • --last-log

    Allows SSW to avoid generating an unnecessary end-of-month report at the beginning of the next month. That is, when a log file is being processed, it is not known whether there is more data to process or not and, consequently, all active visits and downloads are kept active at the end of the run. When the first log record from the next month is processed, all active visits and downloads are ended and the final report is generated. The --last-month option allows Stone Steps Webalizer to avoid this step by explicitly marking the current log file as the last one for the month, so the final report can be generated.

  • --batch

    Instructs Stone Steps Webalizer to run in the batch mode. For details, the Database Configuration Options section of this document.

  • --end-month

    End all active visits in the current database, close it and roll over the database file. This command works only against the current database and requires only the output path. For example:

      webalizer -o reports --end-month
    
  • --compact-db

    Compact the database file to attempt to decrease its size.

  • --db-info

    Print information about the specified database

  • --pipe-log-names

    Instructs Stone Steps Webalizer to read log file names from the standard input. Each log name must be on its own line.

    Windows (line break is for readability):

      dir /s /b srv-a\logs\ex1202*.log srv-b\logs\ex1202*.log |
          webalizer -o reports --pipe-log-names
    

    Linux (line break is for readability):

      ls -1 srv-a/logs/ex1202*.log srv-b/logs/ex1202*.log |
          webalizer -o reports --pipe-log-names
    

Hide Options

The following options take a string argument to use as a comparison for matching. Except for the IndexAlias, ExcludeSearchArg and IncludeSearchArg options, the string argument can be plain text, or plain text that either starts or ends with the wildcard character *.

A string argument without an asterisk will be interpreted as a substring and will match anywhere in the original string. A string ending with an asterisk will match the beginning of the string. A string starting with an asterisk will match the end of the string.

Note that asterisks may be specified only at the beginning or end of the argument and not in the middle. That is, this argument is invalid: some*text. For example:

Given the string yourmama/was/here, the arguments was, *here and your* will all produce a match.

  • -a name

    This option allows hiding of user agents (browsers) from the "Top User Agents" table in the report. This option really isn't too useful as there are a zillion different names that current browsers go by, depending where they were obtained, however you might have some particular user agents that hit your site a lot that you would like to exclude from the list. You must have a web server that includes user agents in it's log files for this option to be of any use. In addition, it is also useless if you disable the user agent table in the report (see the -A command line option or TopAgents configuration file keyword). You can specify as many of these as you want on the command line. The wildcard character * can be used either in front of or at the end of the string. (ie: Mozilla/4.0* would match anything that starts with Mozilla/4.0).

    Config file keyword: HideAgent

  • -r name

    This option allows hiding of referrers from the "Top Referrer" table in the report. Referrers are URL's, either on your own local site or a remote site, that referred the user to a URL on your web server. This option is normally used to hide your own server from the table, as your own pages are usually the top referrers to your own pages (well, you get the idea). You must have a web server that includes referrer information in the log files for this option to be of any use. In addition, it is also useless if you disable the referrers table in the report (see the -R command line option or TopReferrers configuration file keyword). You can specify as many of these as you like on the command line.

    Config file keyword: HideReferrer

  • -s name

    This option allows hiding of hosts from the "Top Hosts" table in the report. Normally, you will only want to hide your own domain name from the report, as it usually is one of the top hosts to visit your web server. This option is of no use if you disable the top hosts table in the report (see the -S command line option or TopSites configuration file option).

    Config file keyword: HideSite

  • -X

    This causes all individual hosts to be hidden, which results in only grouped hosts to be displayed on the report.

    Config file keyword: HideAllHosts

  • -u name

    This option allows hiding of URL's from the "Top URL's" table in the report. Normally, this option is used to hide images, audio files and other objects your web server dishes out that would otherwise clutter up the table. This option is of no use if you disable the top URL's table in the report (see the -U command line option or TopURLs configuration file keyword).

    Config file keyword: HideURL

  • -I name

    This option allows you to specify additional index.html aliases. Unless NoDefaultIndexAlias is specified in the configuration file, Stone Steps Webalizer strips the string index. from URL's before processing, which has the effect of turning a URL such as /somedir/index.html into just /somedir/ which is really the same URL and should be treated as such. This option allows you to specify additional strings that are to be treated the same way. Use with care, improper use could cause unexpected results.

    For example, if you specify the alias string of home, a URL such as /somedir/homepages/brad/home.html would be converted into just /somedir/ which probably isn't what was intended.

    This option is useful if your web server uses a different default index page other than the standard index.html or index.htm, such as home.html or homepage.html. The string specified is searched for anywhere in the URL, so home.htm would turn both /somedir/home.htm and /somedir/home.html into just /somedir/. Go easy on this one, each string specified will be scanned for in EVERY log record, so if you specify a bunch of these, you will notice degraded performance. Wildcards are not allowed on this one.

    Config file keyword: IndexAlias

Table Size Options

  • -e num

    This option specifies the number of entries to display in the "Top Entry Pages" table. To disable the table, use a value of zero (0).

    Config file keyword: TopEntry

  • -E num

    This option specifies the number of entries to display in the "Top Exit Pages" table. To disable the table, use a value of zero (0).

    Config file keyword: TopExit

  • -A num

    This option specifies the number of entries to display in the "Top User Agents" table. To disable the table, use a value of zero (0).

    Config file keyword: TopAgents

  • -C num

    This option specifies the number of entries to display in the "Top Countries" table. To disable the table, use a value of zero (0).

    Config file keyword: TopCountries

  • -R num

    This option specifies the number of entries to display in the "Top Referrers" table. To disable the table, use a value of zero (0).

    Config file keyword: TopReferrers

  • -S num

    This option specifies the number of entries to display in the "Top Sites" table. To disable the table, use a value of zero (0).

    Config file keyword: TopSites

  • -U num

    This option specifies the number of entries to display in the "Top URL's" table. To disable the table, use a value of zero (0).

    Config file keyword: TopURLs

Configuration Files

The Webalizer allows configuration files to be used in order to simplify life for all. There are several ways that configuration files are accessed by the Webalizer. When The Webalizer first executes, it looks for a default configuration file named webalizer.conf in the following directories, in this order:

  • current directory
  • system configuration directory (/etc on Linux or c:\windows on Windows)
  • the directory where the webalizer or webalizer.exe is located

Custom configuration files may be specified on the command line with the -c option. Custom configuration files will be processed after the default configuration file, in the order they are encountered on the command line.

The default configuration file and custom configuration files may include additional configuration files via the Include directive, which may contain an optional domain name as a condition.

For example, the following configuration parameter will instruct Stone Steps Webalizer to read the configuration file called webalizer_hide.conf located in the specified directory:

Include    c:\tools\webalizer\webalizer_hide.conf

Configuration files may be included based on the domain name. That is, if a domain name is specified with the -n option, the domain name in the Include directive will be compared with the command line domain name.

For example, given these two configuration lines:

Include c:\tools\webalizer\webalizer_hide-a.conf  www.a.com
Include c:\tools\webalizer\webalizer_hide-b.conf  www.b.com

, the first include file will be processed if the command line contains -n www.a.com and the second one will be processed if there is -n www.b.com.

Domain-specific includes are particularly useful when processing log files for multiple sites, as they allow to maintain common configuration in a single file, which greatly simplifies configuration maintenance.

There are lots of different ways you can combine the use of configuration files and command line options to produce various results. The evaluation order is as follows:

  • The default configuration file, if found, is processed.

  • Configuration files specified with the -c option, are processed as they are encountered. Values for options that are further on the command line override earlier options, including those found in any configuration files.

  • Configuration files specified in Include directives are queued for further processing.

  • After all custom configuration files are processed, queued include files are processed in the order they appeared in their configuration files.

If you specify a configuration file on the command line, you can override those options by additional command line options which follow.

Some options cannot be overridden, such as Quiet yes because the command line option -q only enables this behavior and there is no option to disable it.

The configuration files are standard ASCII text files that may be created or edited using any standard editor. Blank lines and lines that begin with a pound sign (#) are ignored. Any other lines are considered to be configuration lines, and have the form "Keyword Value", where the Keyword is one of the currently available configuration keywords defined below, and Value is the value to assign to that particular option. Any text found after the keyword up to the end of the line is considered the keyword's value, so you should not include anything after the actual value on the line that is not actually part of the value being assigned. The file sample.conf provided with the distribution contains lots of useful documentation and examples as well. It should be noted that you do not have to use any configuration files at all, in which case, default values will be used (which should be sufficient for most sites).

General Configuration Keywords

  • LogFile

    This defines the log file to use. It can be a fully qualified path or a relative path. In the latter case, the current directory and the directory identified with LogDir will be searched. If used more than one time, Stone Steps Webalizer will process log records from each file in their time stamp order. If LogFile is not specified and none is provided through the command line, the logfile defaults to STDIN.

  • LogDir

    Defines an optional path to the log directory. If LogDir is not empty and LogFile is not an absolute path (i.e. not starting with a drive letter or a path separator slash character), the complete log file path will be derived by combining LogDir and LogFile.

    Default value: none

  • LogType

    This specified the log file type being used. Values may be either w3c, iis, apache, clf or squid. Ensure that you specify the proper file type, otherwise you will be presented with a long stream of invalid record messages ;)

    Command line argument: -F

  • OutputDir

    This defines the output directory to use for the reports. If it is not specified, the current directory is used.

    Command line argument: -o

  • OutputFormat

    Specifies the format of the generated reports. These formats, are supported:

    • HTML
    • TSV
    • JSON

    HTML is the default format and will be used if no other format is specified.

    TSV stands for "tab-separated values" and will instruct Stone Steps Webalizer to generate .tab files, as if all DumpX options were set to yes. Note that if at least one DumpX option is used, TSV report is added automatically to the list of output formats.

    JSON stands for JavaScript Object Notation and outputs all items with the intent of importing them into Mongo DB.

    Multiple OutputFormat entries may be used in order to generate reports in more than one format.

  • HistoryName

    Allows specification of a history path/filename if desired. The default is to use the file named webalizer.hist, kept in the normal output directory (OutputDir above). Any name specified is relative to the normal output directory unless an absolute path name is given (ie: starts with a /).

  • ReportTitle

    This specifies the title to use for the generated reports. It is used in conjunction with the hostname (unless blank) to produce the final report titles. If not defined, the default of "Usage Statistics for" is used.

    Command line argument: -t

  • SiteName

    This defines the site hostname. The hostname is used in the report title as well as being prepended to URL's in the "Top URL's" table. This allows The Webalizer to be run on "virtual" web servers, or servers that do not reside on the local machine, and allows clicking on the URL to go to the right place. If not specified, The Webalizer attempts to get the hostname via a uname system call, and if that fails, will default to localhost.

    Command line argument: -n

  • SiteAlias

    Specifies an alternative site host name. Multiple SiteAlias entries can be used in configuration files to register more than one alias. Site aliases are used as a white list when checking for spam links in search strings.

  • UseHTTPS

    Causes the links in the Top URL's table to use https:// instead of the default http:// prefix if the request port is either not present in the logs or if the URL was requested over a secure port at least once. Ports are identified by HttpsPort and HttpPort settings, respectively. UseHTTPS is ignored for URLs that were requested over the port identified by HttpPort or some port other than HttpsPort.

    For example, if HttpPort is 80 and HttpsPort is 443 and

    • page A was requested over port 80,
    • page B was requested over port 443,
    • page C was requested over ports 80 and 443,
    • page D was requested over ports 80 and 8080,
    • page E was requested over ports 8080 and 443

    , then when UseHTTPS is set to no, only page B will be rendered with the https:// prefix and when UseHTTPS is set to yes, pages B, C and E will be rendered with the https:// prefix.

  • Quiet

    This allows you to enable or disable informational messages while it is running. The values for this keyword can be either yes or no. Using Quiet yes will suppress these messages, while Quiet no will enable them. The default is no if not specified, which will allow The Webalizer to display informational messages. It should be noted that this option has no effect on Warning or Error messages that may be generated, as they go to STDERR.

    Command line argument: -q

  • TimeMe

    This allows you to display timing information regardless of any "quiet mode" specified. Useful only if you did in fact tell the webalizer to be quiet either by using the -q command line option or the Quiet keyword, otherwise timing stats are normally displayed anyway. Values may be either yes or no, with the default being no.

    Command line argument: -T

  • UTCTime, GMTTime

    This keyword allows timestamps to be displayed in GMT (UTC) time instead of local time. Normally The Webalizer will display timestamps in the time-zone of the local machine (ie: PST or EDT). This keyword allows you to specify the display of timestamps in GMT (UTC) time instead. Values may be either yes or no.

    Default is no.

  • UTCOffset

    Specifies the difference between the local time and UTC time, without factoring in the daylight saving time adjustment. For example, Eastern Standard Time (EST) is 5 hours behind UTC and the UTCOffset value for EST would be -5h. UTCOffset is only applied when the log time zone type doesn't match the UTCTime value. In other words, log time in Apache and CLF logs will not be adjusted by UTCOffset if UTCTime is set to no, but IIS logs will be in the same configuration.

    Default value: 0

  • LocalUTCOffset

    Controls if the UTC offset of the machine running Stone Steps Webalizer can be used to automatically set UTCOffset. Note that daylight savings time offset is not included into UTCOffset.

    Default value: no

  • DSTOffset

    Specifies the difference between the standard and daylight saving time. For example, setting DSTOffset to 1h instructs Stone Steps Webalizer to add one hour to those log time stamps that are greater than or equal to DSTStart and less than DSTEnd.

    Default value: 0

  • DSTStart, DSTEnd

    Specifies the beginning and the end of the daylight saving time. DSTStart is in local time and DSTEnd is in local daylight saving time. All log time stamps greater than or equal to DSTStart and less than DSTEnd will be adjusted by DSTOffset.

    Multiple DSTStart and DSTEnd ranges may be added to cover more than one year. For example, this configures Eastern Daylight Saving Time for 2009 and 2010:

      DSTStart		2009/03/09 2:00
      DSTEnd		2009/11/01 2:00
      DSTStart		2010/03/14 2:00
      DSTEnd		2010/11/07 2:00
    

    DSTStart and DSTEnd are not evaluated if DSTOffset is zero.

    Default value: none

  • Debug

    This tells The Webalizer to display additional information when it encounters Warnings or Errors. Normally, The Webalizer will just tell you it found a bad record or field. This option will enable the display of the actual data that produced the Warning or Error as well. Useful only if you start getting lots of Warnings or Errors and want to determine the cause. Values may be either yes or no, with the default being no.

    Command line argument: -d

  • IgnoreHist

    This suppresses the reading of a history file. USE WITH EXTREME CAUTION as the history file is how The Webalizer keeps track of previous months. The effect of this option is as if The Webalizer was being run for the very first time, and any previous data is discarded. Values may be either yes or no, with the default being no.

    Command line argument: -i

  • VisitTimeout

    Set the visit timeout value. Visits are determined by looking at the time difference between the current and last request made by a specific host. If the difference in time is greater than the visit timeout value, the request is considered a new visit. The value is in number of seconds, and defaults to 30 minutes (1800). Optional suffixes m and h may be used to specify this value in minutes or hours, respectively.

    Command line argument: -m

  • MaxVisitLength

    Sets the maximum visit length, which will forcibly end long visits, regardless of the VisitTimeout value.

    Default value: 0.

  • MinVisitLength

    Sets the minimum visit value for human visitors. Intended to account for people searching for specific things and clicking the Back button soon after skimming or scanning through some of the page text. Used only if the visit length would be computed as zero otherwise and if the visit has a successful page or a file request.

    Default value: 0

  • PageType

    Allows you to define the page type extension. Normally, people consider HTML and CGI scripts as pages. This option allows you to specify what extensions you consider a page. Default is txt, php, htm* and cgi for Apache and CLF logs, txt, asp, aspx, and htm* for IIS logs.

    Command line argument: -P

  • PageEntryURL

    HTTP requests are logged at the end of the request processing cycle, which may cause a page to be logged after all of the resources it references, such as images or style sheets. In this case, an entry URL may be an image or some other file, which is not very informative. When PageEntryURL is set to yes, Stone Steps Webalizer will ignore non-pages when collecting data for entry URL. If set to no, any successful first request in a visit will be reported as an entry URL, regardless of its type.

    Default value: yes

  • GraphLegend

    Enable/disable the display of color coded legends on the produced graphs. Default is yes, to display them.

    Command line argument: -L

  • GraphLines

    Specify the number of background reference lines to display on produced graphs. The default is 2. To disable the use of background lines, use zero (0).

    Command line argument: -l

  • CountryGraph

    This keyword is used to either enable or disable the creation and display of the Country Usage graph. Values may be either yes or no, with the default being yes.

    Command line argument: -Y

  • JavaScriptChartsMap

    If CountryGraph is enabled and JavaScriptCharts is configured, the country chart will be rendered as a world map instead of a pie chart.

  • DailyGraph

    This keyword is used to either enable or disable the creation and display of the Daily Usage graph. Values may be either yes or no, with the default being yes.

  • DailyStats

    This keyword is used to either enable or disable the creation and display of the Daily Usage statistics table. Values may be either yes or no, with the default being yes.

  • HourlyGraph

    This keyword is used to either enable or disable the creation and display of the Hourly Usage graph. Values may be either yes or no, with the default being yes.

    Command line argument: -G

  • HourlyStats

    This keyword is used to either enable or disable the creation and display of the Hourly Usage statistics table. Values may be either yes or no, with the default being yes.

    Command line argument: -H

  • IndexAlias

    This allows additional index.html aliases to be defined. Normally, The Webalizer scans for and strips the string "index." from URL's before processing them. This turns a URL such as /somedir/index.html into just /somedir/ which is really the same URL. This keyword allows additional names to be treated in the same fashion for sites that use different default names, such as home.html. The string is scanned for anywhere in the URL, so care should be used if and when you define additional aliases.

    For example, if you were to use an alias such as home, the URL /somedir/homepages/brad/home.html would be turned into just /somedir/ which probably isn't the intended result. Instead, you should have specified home.htm which would correctly turn the URL into /somedir/homepages/brad/ like intended.

    It should also be noted that specified aliases are scanned for in EVERY log record... A bunch of aliases will noticeably degrade performance as each record has to be scanned for every alias defined. You don't have to specify index. as it is always the default.

    Command line argument: -I

  • MangleAgents

    The MangleAgents keyword specifies the level of user agent name mangling, if any.

    Normally, The Webalizer will keep track of the user agent field verbatim. Unfortunately, there are a ton of different names that user agents go by, and the field also reports other items such as machine type and OS used.

    For example, Netscape 4.03 running on Windows 95 will report a different string than Netscape 4.03 running on Windows NT, so even though they are the same browser type, they will be considered as two totally different browsers by The Webalizer. For that matter, Netscape 4.0 running on Windows NT will report different names if one is run on an Alpha and the other on an Intel processor! Internet Exploder is even worse, as it reports itself as if it were Netscape and you have to search the given string a little deeper to discover that it is really MSIE! In order to consolidate generic browser types, this option will cause The Webalizer to "mangle" the user agent field, attempting to consolidate generic browser types.

    Stone Steps Webalizer has two methods for mangling user agent names. One is the classic Webalizer method and one is a new method based on user agent filters (ExcludeAgentArgs, IncludeAgentArgs and GroupAgentArgs).

    Classic User Agent Name Mangling

    Classic user agent mangling method allows to specify 5 levels of mangling, each producing different level of detail.

    • Level 5 displays only the browser name (MSIE or Mozilla) and the major version number.

    • Level 4 will also display the minor version number (single decimal place).

    • Level 3 will display the minor version number to two decimal places.

    • Level 2 will add any sub-level designation (such as Mozilla/3.01Gold or MSIE 3.0b).

    • Level 1 will also attempt to add the system type.

    The default Level 0 will disable name mangling and leave the user agent field unmodified, producing the greatest amount of detail.

    User Agent Name Mangling Filters

    If UseClassicMangleAgents is set to no, which is the default setting, filter-based mangling will be used. In this mode, user agent arguments are classified as product versions, URLs or generic arguments and may be manipulated via mangle level or via user agent include, exclude and group filters.

    Before v6.2.0 setting mangle level to a non-zero value resulted in a few predefined filters added automatically. Starting from this version, there are no default filters added. If you would like to mimic the pre-v6.2.0 behavior, you can add the same filters into your configuration, which makes them more visible and easier to maintain.

    MangleAgents level values are interpreted as follows. Examples are shown with the assumption that user agent include, exclude and group filters are empty.

    • Level 0: At this level user agent values are not modified and are reported as they appear in the log file.

      • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
      • Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0
    • Levels 1, 2: At these levels user agent arguments are reported as individual arguments separated with ; sequences.

      • Mozilla/5.0; compatible; Googlebot/2.1; +http://www.google.com/bot.html
      • Mozilla/5.0; Windows NT 10.0; Win64; x64; rv:100.0; Gecko/20100101; Firefox/100.0
    • Level 3: At this level product versions are truncated to the major version.

    • Level 4: At this level product versions are removed.

    • Level 5: At this level URL tokens are removed.

      • Mozilla; compatible; Googlebot
      • Mozilla; Windows NT 10.0; Win64; x64; rv:100.0; Gecko; Firefox

    In addition to mangle levels described above, you can use IncludeAgentArgs and ExcludeAgentArgs to remove individual user agent arguments, regarless of their type.

    These filters work the same way as search argument filters and when text without an asterisk is specified, the entire argument must match. For example ExcludeAgentArgs KHTML will not filter out KHTML, like Gecko, but the one shown below will.

      ExcludeAgentArgs    KHTML, like Gecko
    

    EnablePhraseValues must be set to yes to allow spaces in configuration values, like in this example.

    Generic user agent arguments (i.e. not product versions or URLs) may also be rewritten using GroupAgentArgs. For example, to avoid fragmentation for various versions of Mac OS 10, such as Intel Mac OS X 10_15_7 and Intel Mac OS X 10_11_6, this group agent argument filter may be used.

      GroupAgentArgs		Intel Mac OS X*		Intel Mac OS X
    

    GroupAgentArgs is ignored for product version and URL arguments, so Edge product version Edg/101.0.1210.53 cannot be rewritten as Edge or Edge/101.0.1210.53.

    User agent values often do not follow HTTP RFC guidelines and may have product versions in field comments, which is the text between ( and ) characters, or may use product versions for arbitrary text, such as Version/7.0 or Language/zh_CN. You may want to experiment with include, exclude and group user agent argument filter that work best for you. Keep in mind, however, that changing filters permanently should not be done in the middle of the month in the log processing cycle because it will produce arbitrarily fragmented groupings.

    Command line argument: -M

  • SearchEngine

    This keyword allows specification of search engines and their query strings. Search strings are obtained from the referrer field in the record, and in order to work properly, the Webalizer needs to know what query strings different search engines use. The SearchEngine allows you to specify the search engine and it's query string to parse the search string from. The line is formatted as: SearchEngine engine-string query-string where engine-string is a substring for matching the search engine with, such as yahoo.com or altavista. The query-string is the unique query string that is added to the URL for the search engine, such as search= or MT= with the actual search strings appended to the end. There is no command line option for this keyword.

    This configuration parameter supports additional syntax to be able to combine various search terms.

    For example, somebody using Google to find pages that contain all words and to be a certain file type (e.g. PDF) use different search arguments compared to a usual search. Following configuration allows Stone Steps Webalizer to process such cases using this configuration:

      SearchEngine    www.google.     as_q=All Words
      SearchEngine    www.google.     as_filetype=File Type
    

    All matching search strings will be reported on one line, separated by the term qualifier.

    For example, the following line describes that somebody was looking for a PDF file containing words webalizer and apache:

      [All Words] webalizer apache [File Type] pdf
    

    For performance reasons, all search terms for the same site (e.g. www.google.) must be grouped. The first line with the mismatching domain name pattern will cancel further search.

  • Incremental

    This allows incremental processing to be enabled or disabled. Incremental processing allows processing partial logs without the loss of detail data from previous runs in the same month. This feature saves the internal state of the program so that it may be restored in following runs. See the section above titled "Incremental Processing" for additional information. The value may be yes or no, with the default being yes.

  • ApacheLogFormat

    Defines the format Stone Steps Webalizer will use when parsing Apache log files. This configuration variable will only be evaluated when the current log file type is Apache.

    Default value: none

  • BundleGroups

    Controls whether grouped items in the reports should be bundled together at the beginning of the report or not. Bundling groups together makes it easier to stack them up against each other.

    Default value: yes

  • ConvURLsLowerCase

    Controls whether URL characters will be converted to lower case (yes) or not (no).

    Default value: no

  • DownloadPath

    Lists a URL path that Stone Steps Webalizer will use to detect file downloads for the downloads report.

    For example, if you would like to track downloads of a file called util.zip located in the /downloads/ directory, add the following entry to webalizer.conf:

      DownloadPath    /downloads/util.zip    Utility Download
    

    Wildcard characters (*) may be used at the beginning and the end of the path to list partial paths. Note that query strings are ignored when logged URL's are compared with DownloadPath entries. Multiple DownloadPath entries may be used to track more than one path.

    Default value: none

  • DownloadTimeout

    Maximum number of seconds between consecutive partial download requests that are counted towards the same download job.

    Default value: 180

  • EnablePhraseValues

    If this configuration parameter is set to yes, Stone Steps Webalizer will treat the tab character as a name/value separator when parsing two-part configuration entries, such as GroupAgent or SearchEngine and will ignore spaces embedded in the values. For example:

      Mozilla/4.0 (compatible; MSIE 6* <tab> Internet Explorer v6
      ^----------- value ------------^       ^------ name ------^
    

    Default value: no

  • GroupURLDomains

    Squid log files contain absolute URLs, along with fully- qualified domain names. GroupURLDomains may be used to group these domains in the URL report. The value of this configuration parameter is the number of domain labels, past the top-level one, to report.

    For example, if GroupURLDomains is set to 1, two labels will be reported (e.g. stonesteps.ca); if this parameter is set to 2, three labels will be reported (e.g. forums.stonesteps.ca); and so on. If GroupURLDomains is set to 0, no this type of grouping will not be performed.

    Default value: 0

  • HistoryLength

    Defines the maximum number of months reported on the main index page. The minimum number of months in the history is 12.

    Default value: 24

  • HttpPort

    Defines the TCP/IP port used by the web server to serve HTTP requests.

    Default value: 80

  • HttpsPort

    Defines the TCP/IP port used by the web server to serve HTTPS requests.

    Default value: 443

  • Include

    Instructs Stone Steps Webalizer to process the specified configuration file after the main configuration file has been processed. This parameter may optionally be followed by the domain name to make the include directive domain specific.

    For example, the following include directive will only be processed if the domain name specified with the -n option is www.a.com.

      Include   c:\tools\webalizer\a.conf    www.a.com
    

    Default value: none

  • IncludeSearchArg, ExcludeSearchArg

    Define include and exclude search arguments filters. Each configuration parameter is expected to be either a complete or a partial name of a search argument to include or exclude. A single asterisk may be used to include or exclude all search arguments. Multiple include/exclude directives may be used if more than one search argument is to be included or excluded. The include filter takes precedence over the exclude filter.

    Unlike other include/ignore filters, non-wildcard IncludeSearchArg and ExcludeSearchArg values are not treated as sub-strings and must match search argument names exactly in order for the corresponding filter to be activated.

    For example, the following two exclude filters will remove search arguments x and y, which are commonly submitted by browsers if image-based buttons are used on the page, but will not affect search arguments that contain characters x or y, such as query or xpath.

      ExcludeSearchArg   x
      ExcludeSearchArg   y
    

    Default value: none

  • IncludeAgentArgs, ExcludeAgentArgs

    These filters are implemented in the same way as are include and exclude search argument filters in that each filter without an asterisk must match the entire argument.

  • GroupAgentArgs

    This filter makes it possible to rename matching user agent arguments that are not classified as product versions or URLs. For example, this filter replaces "Windows NT 5.1" with "Windows XP" in the final output:

      GroupAgentArgs   Windows NT 5.1  Windows XP
    

    Group filters are not applied against product version and URL arguments. In other words, "Edg/*" cannot be replaced with "Edge".

    Note that if the pattern contains spaces (e.g. Windows NT 5.1), there must be a tab character between the pattern and the alias and EnablePhraseValues must be set to yes. Otherwise, the first space will end the pattern (in this example, the pattern will end after the first word Windows).

  • LanguageFile

    Specifies a fully-qualified path to the language file.

  • MonthlyTotals

    Specifies whether Stone Steps Webalizer should generate the monthly totals report or not.

    Default value: yes

  • NginxLogFormat

    Defines the format Stone Steps Webalizer will use when parsing Nginx log files. This configuration variable will only be evaluated when the current log file type is Nginx.

    Default value: none

  • NoDefaultIndexAlias

    Many web servers make it possible to configure a default document for each directory. If a user requests a URL that is a directory (e.g. http://127.0.0.1/books/), the default document from that directory will be served.

    For example, IIS is usually configured with default.htm as a default document; Apache, as well as many other Unix-originated web servers, is configured to serve index.html as a default document. Stone Steps Webalizer, by default, adds index. to the list of default documents. When processing a URL, Stone Steps Webalizer checks if the requested file matches any entries in the default document list and if it does, strips off the file name from the URL.

    This feature allows Stone Steps Webalizer to avoid fragmenting default document statistics if the same document was requested using multiple aliases (e.g. index.html, index.php, etc). However, in some cases, it is undesirable to use index. as a default alias (e.g. if there is a directory named index.ext). Setting NoDefaultIndexAlias to yes prevents Stone Steps Webalizer from adding index. to the default document list.

    Default value: no

  • SortSearchArgs

    Controls whether search arguments will be sorted alphabetically or not. Sorting search arguments helps defragmenting URL reports.

    Default value: yes

  • SpamReferrer

    Each entry lists a keyword identifying the visitor as a spammer. Multiple values may be used to specify more than one keyword. Once identified as a spammer, visitor's IP address will be remembered for the rest of the month and any requests originating from this IP address will be treated as spam.

    Default value: none

  • UpstreamTraffic

    Indicates whether to track upstream data transfers (i.e. uploads) or not. Note that upstream and downstream transfer amounts are not tracked separately - their values are added together and shown as Transfer.

    Default value: no

  • Robot

    Defines a pattern used to identify robot user agents, such as search engines. The pattern may contain a leading or a trailing wildcard character (*) to indicate that the pattern should be matched from the end or from the beginnning of the string, respectively. Otherwise, the pattern will be treated as a sub-string. Leading wildcards are not very useful for filtering user agents.

    For example, msnbot/* will match the first user agent and Googlebot/ will match the second one.

      msnbot/1.0 ( http://search.msn.com/msnbot.htm)
      Mozilla/5.0 (compatible; Googlebot/2.1;)
    

    Robot entries may contain aliases that will be used instead of patterns when grouping robot entries. For example:

      Robot   Mediapartners-Google*     Google Adsense
      Robot   msnbot*                   MSN Live Search
    

    Default value: none

  • TargetURL

    Visitors who browse a website and then purchase, download something or just visit a designated page are called converted visitors. Converted visitors may be tracked by specifying target URL patterns in the configuration file. Each TargetURL entry designates a URL pattern. For example,

      TargetURL    /orders/receipt.asp*
    

    This entry instructs Stone Steps Webalizer to interpret the URL that begins with /orders/receipt.asp to be treated as a target URL.

    Multiple TargetURL entries can be used to specify more than one pattern.

    Default value: none

  • TargetDownload

    If downloads are being tracked using DownloadPath, setting TargetDownload to yes will instruct Stone Steps Webalizer to treat those requests that matched DownloadPath as if they had a corresponding TargetURL entry. This improves overall performance by reducing the number of target URL entries and simplifies configuration.

    Default value: yes

  • ClassicKBytes

    Starting with v4.2, Stone Steps Webalizer shows transfer amounts in reports and charts as numbers with a unit suffix, such as KB, MB, GB, etc., as opposed to the traditional numbers of kilobytes. If you would like to revert to the previous behavior, set ClassicKBytes to yes in the configuration. You will also need to change the word Transfer to KBytes in the language file, so report titles look the same as before v4.2.

    Default value: no

  • DecimalKBytes

    Traditionally, the Webalizer interpreted one kilobyte as 1024 bytes. You can set DecimalKBytes to yes to compute transfer amounts in multiples of 1000 bytes.

    Default value: no

  • PageTitle

    Associates a page title with a URL pattern. If a URL in the Top N URLs and Entry/Exit URL reports match some pattern, the pattern name will be rendered instead of the URL. For example, blog pages might be set up like this:

      PageTitle     /post/12*   Wild Life Photography
      PageTitle     /post/34*   Action Photography
    

    In the All URLs report the matching pattern name will not be rendered instead of the URL text, but instead will be added as a URL title, so hovering over that entry will show the title. This way URLs are shown more consistently in a tabular report with a lot of other URLs that are not titled.

    The pattern syntax is the same as for hide/ignore options and if one is specified without an asterisk, it will match any text within the URL. Trailing asterisk, like that shown above, will match URLs from the beginning. Note, however, that the order of entries matches and longer ones should be listed first.

    For example, an entry with an ID 1 and a trailing asterisk would mask the following ID 18 and the latter would never appear in the report:

      PageTitle     /post/1*    Wild Life Photography
      PageTitle     /post/18*   Action Photography
    

    Reordering 18* before 1* will assign correct titles in the report, but will 1* will still match other sequences starting with 1 if they are not explicitly listed as page titles (e.g. /post/19). Alternatively, you can use a leading asterisk, which will work for cases above, but may misfire for unrelated URLs ending with the same sequence, such as login redirects (e.g. /login/?ref=/post/19 will be titled with the pattern */post/19).

    Note that without an asterisk the pattern will be match any emedded text, such as /post/1 in a URL /?ref=/post/123.

    URL patterns are matched against normalized URLs, so if you want to match some non-ASCII character, use the actual character and not a URL-encoded sequence for that character. See the URL Normalization section for details.

    Page titles can be styled in webalizer.css. By default report table cells containing page titles are styled to insert a Unicode page character before title text.

DNS Resolution Configuration Options

  • DNSCache

    Specifies the DNS cache filename. This name is relative to the default output directory unless an absolute name is given (ie: starts with /). If DNSCache is not specified, but GeoIPDBPath is, DNS resolver workers will still be created to lookup country codes during log file processing.

  • ASNDBPath

    A fully-qualified path to a MaxMind's Autonomous System numbers (ASN) database. A free ASN database may be downloaded from this location:

    https://dev.maxmind.com/geoip/geoip2/geolite2/

    When ASN database is configured, ASN columns will be added to the Hosts and Downloads reports and an additional Top ASN report will be generated, as well as a tab-separated file containing ASN entries. Use TopASN and DumpASN configuration parameters to control whether to generate a top ASN report and to generate a tab-separated ASN file.

  • GeoIPDBPath

    This configuration parameter is expected to be a fully- qualified path to the MaxMind's GeoIP Country or GeoIP City database file in the binary format. The database may be downloaded at the following URL:

    http://www.maxmind.com/app/geoip_country

    Less precise, but free GeoIP databases are available at this location:

    http://dev.maxmind.com/geoip/geoip2/geolite2/

    Setting GeoIPDBPath will instruct Stone Steps Webalizer to use the information in the GeoIP database to generate the country report. If this parameter is not set or if it points to a non-existent or invalid GeoIP database, Stone Steps Webalizer will use domain name suffixes, such as .ca or .jp, to generate the country report.

  • GeoIPCity

    Value yes will instruct Stone Steps Webalizer to look up country and city names in the GeoIP database, while value no will disable city look-ups. Note that you need GeoIP City database for this parameter to have any effect.

    Default value: yes

  • DNSChildren

    Number of DNS child processes to use for reverse DNS and GeoIP lookups. If specified, DNSCache or GeoIPDBPath must also be specified. If you do not wish a DNS cache file to be generated, omit DNSCache to disable DNS resolution. A value 0 will disable DNS and GeoIP look-ups.

  • DNSCacheTTL

    Specifies Time To Live (TTL) in days for DNS cache entries. In most cases it is reasonable to set this value to 30 days.

    Default value: 30

  • DNSLookups

    Specifies whether to resolve host IP addresses to domain names (yes) or not (no) if DNSCache and DNSChildren are configured to enable DNS components. When set to no, DNS resolution is disabled, but the DNSCache database is still used to preserve some IP address information, such as whether some IP address is marked as a spammer, from month to month.

    Default: yes

  • AcceptHostNames

    Specifies whether to accept host names instead of IP addresses in log files or not. Configuring your web server to resolve IP addresses to host names will slow down the server. Lack of IP addresses also will disable address-based visitor country identification.

    Default value: no

  • ExternalMapURL

    Specifies a fully-qualified URL of any map service that can interpret latitude and longitude expressed as signed decimal degrees. Use {{lat}} and {{lon}} to pass coordinates to the map service. For example, use this URL to show a map centered on the location of the IP address in Google Maps:

    https://www.google.ca/maps/@{{lat}},{{lon}},12z

Graph Configuration

  • GraphBackgroundAlpha

    Sets the transparency of the background of graph images, in percent. The value of 100 makes background completely transparent, while the value of 0 makes it completely opaque. Making graph backgrounds transparent makes it possible to use another image, such as a logo, a gradient or a pattern as a graph background. GraphTrueColor must be set to yes in order for GraphBackgroundAlpha to work.

    Default value: 0

  • GraphBorderWidth

    Defines the width of the 3D border around image graphs in pixels. If set to zero, graph images are generated without a border. This value cannot be greater than 8.

    Default value: 0

  • GraphFontNormal, GraphFontBold

    Define fully-qualified paths to TrueType font files that Stone Steps Webalizer will use when creating graphs. If these paths are not specified, Stone Steps Webalizer will use default raster fonts.

    Default value: none

  • GraphBackgroundColor

    Defines the background color for all graphs generated by Stone Steps Webalizer. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: C0C0C0

  • GraphFontMedium

    Specifies the size, in points, of the small and medium fonts used in graphs. The medium font is used for graph titles and country names in the country report.

    Default value: 9.5

  • GraphFontSmall

    Specifies the size, in points, of the small font used in graphs. The small font is used for graph legends (e.g. Hits, Visits, etc) and axis markers.

    Default value: 8

  • GraphFontSmoothing

    Specifies whether Stone Steps Webalizer will create graphs using smoothed TrueType fonts. This value is ignored if default raster fonts are used.

    Default value: yes

  • GraphGridlineColor

    Defines the color of graph gridlines. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: 808080

  • GraphHitsColor

    Defines the color of the graph and legend associated with Hits. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: 00805C

  • GraphHostsColor

    Defines the color of the graph and legend associated with Hosts. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: FF8000

  • GraphLegendColor

    Defines the base color of the X-axis legend. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: 000000

  • GraphOutlineColor

    Defines the color of the graph bar outlines. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: 000000

  • GraphPagesColor

    Defines the color of the graph and legend associated with Pages. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: 00C0FF

  • GraphTitleColor

    Defines the color of the graph title. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: 0000FF

  • GraphVisitsColor

    Defines the color of the graph and legend associated with Visits. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: FFFF00

  • GraphTransferColor

    Defines the color of the graph and legend associated with Transfer. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: FF0000

  • GraphWeekendColor

    Defines the color of the weekend days in the monthly traffic report. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: 00805C

  • GraphShadowColor

    Defines the color of the legend shadow for all graphs generated by Stone Steps Webalizer. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: 333333

  • GraphTitleColor

    Defines the color of graph titles. The value must be specified as six hexadecimal digits, two for each color - red, green and blue.

    Default value: 0000FF

  • GraphTrueColor

    Specifies whether Stone Steps Webalizer will create TrueColor or palette-based graph images. TrueColor images are larger in size, but are of better quality, especially if font smoothing is turned on.

    Default value: no

  • JavaScriptCharts

    Specifies the type the the JavaScript chart package to use. Only Highcharts is supported at the moment. By default this value is empty and all charts are rendered as PNG images.

    If you configure this value, you also need to configure the JavaScript directory via HTMLJsPath.

    Please note that Highcharts is not an Open Source package and may be used without buying a license only for non-commercial purposes. Read Highcharts` page for additional information about their licensing before enabling Highcharts in the reports:

    https://shop.highsoft.com/faq#Non-Commercial-0

    For more information about the JavaScript charts package used in this implementation see this page:

    https://www.highcharts.com/products/highstock

  • JavaScriptChartsPath

    JavaScripts charts by default will use implementation scripts from the 3rd-party chart vendor's location. If you would like to serve charts scripts from your servers or use some alternative, JavaScript charts integration, specify as many of these variables as you need in the configuration file and each will be output in the head element of each report.

Database Configuration Options

  • BatchProcessing, Batch

    Instructs Stone Steps Webalizer to run in a batch mode, which avoids generating report files at the end of each run. Once the last log file has been processed, a monthly report may be generated with the command line argument --prepare-report.

    Default value: no
    Command line argument: --batch

  • DbDirect

    Configures Berkeley DB not to use the operating system caching and use only the internal Berkeley DB caching mechanism. This may help reduce memory footprint on some systems.

    Default value: no

  • DbSeqCacheSize

    Is the number of cached DB sequence numbers used by Stone Steps Webalizer. The default value of 100 is sufficient in most cases.

  • DbCacheSize

    This configuration value is used as a guiding number to limit the amount of memory used by Stone Steps Webalizer. The specified value is not used as a hard memory limit, but rather as a base value to initialize various components, such as Berkeley DB cache and internal memory tables for log file items, such as hosts and URLs. The actual amount of memory used may be about three-four times as much, but it should level at some point. You may need to experiment with different values to arrive at an acceptable limit. Values may be suffixed with K, M or G for kilo, mega and giga multipliers. The minimum value is 1 MB.

    Default value: 50 MB

  • DbPath

    Specifies a fully-qualified directory path where the state database will be created.

    Default value: output directory

  • DbName

    Sets the name of the database file.

    Default value: webalizer

  • DbExt

    Sets the extension of the database file.

    Default value: db

Top Table Keywords

  • TopAgents

    This allows you to specify how many "Top" user agents are displayed in the "Top User Agents" table. The default is 15. If you do not want to display user agent statistics, specify a value of zero (0). The display of user agents will only work if your web server includes this information in its log file (ie: a combined log format file).

    Command line argument: -A

  • AllAgents

    Will cause a separate HTML page to be generated for all normally visable User Agents. A link will be added to the bottom of the "Top User Agents" table if enabled. Value can be either yes or no, with no being the default.

  • TopCountries

    This allows you to specify how many "Top" countries are displayed in the "Top Countries" table. The default is 30. If you want to disable the countries table, specify a value of zero (0).

    Command line argument: -C

  • TopCities

    Limits the number of cities listed in the Top Cities report. This report is generated only if GeoIP is configured and if GeoIPCity is set to yes. Set TopCities to zero if you would like to disable the city report, but show city names in other reports.

    Default value: 30

  • TopReferrers

    This allows you to specify how many "Top" referrers are displayed in the "Top Referrers" table. The default is 30. If you want to disable the referrers table, specify a value of zero (0). The display of referrer information will only work if your web server includes this information in its log file (ie: a combined log format file).

    Command line argument: -R

  • AllReferrers

    Will cause a separate HTML page to be generated for all normally visable Referrers. A link will be added to the "Top Referrers" table if enabled. Value can be either yes or no, with no being the default.

  • TopHosts

    This allows you to specify how many "Top" hosts are displayed in the "Top Hosts" table. The default is 30. If you want to disable the hosts table, specify a value of zero (0).

    Command line argument: -S

  • TopKHosts

    Identical to TopSites, except for the by KByte table. Default is 10. No command line switch for this one.

  • AllHosts

    Will cause a separate HTML page to be generated for all normally visible Sites. A link will be added to the bottom of the "Top Sites" table if enabled. Value can be either yes or no, with no being the default.

  • TopURLs

    This allows you to specify how many "Top" URL's are displayed in the "Top URL's" table. The default is 30. If you want to disable the URL's table, specify a value of zero (0).

    Command line argument: -U

  • TopKURLs

    Identical to TopURLs, except for the by KByte table. Default is 10. No command line switch for this one.

  • AllURLs

    Will cause a separate HTML page to be generated for all normally visible URLs. A link will be added to the bottom of the "Top URLs" table if enabled. Value can be either yes or no, with no being the default.

  • TopEntry

    Allows you to specify how many "Top Entry Pages" are displayed in the table. The default is 10. If you want to disable the table, specify a value of zero (0).

    Command line argument: -e

  • TopExit

    Allows you to specify how many "Top Exit Pages" are displayed in the table. The default is 10. If you want to disable the table, specify a value of zero (0).

    Command line argument: -E

  • TopSearch

    Allows you to specify how many "Top Search Strings" are displayed in the table. The default is 20. If you want to disable the table, specify a value of zero (0). Only works if using combined log format (ie: contains referrer information).

  • TopUsers

    This allows you to specify how many "Top" usernames are displayed in the "Top Usernames" table. Usernames are only available if you use http authentication on your web server. The default value is 20. If you want to disable the Username table, specify a value of zero (0).

  • AllUsers

    Will cause a separate HTML page to be generated for all normally visible user names. A link will be added to the bottom of the "Top Usernames" table if enabled. Value can be either yes or no, with no being the default.

  • AllSearchStr

    Will create a separate HTML page to be generated for all normally visable Search Strings. A link will be added to the bottom of the "Top Search Strings" table if enabled. Value can be either yes or no, with no being the default.

  • AllDownloads

    If this configuration parameter is set to yes and the number of downloads is greater than the number of lines in the download report (i.e. greater than the value of TopDownloads), Stone Steps Webalizer will generate a standalone downloads report, listing all downloads for the current month.

    Default value: no

  • AllErrors

    If this configuration parameter is set to yes and the number of HTTP errors is greater than the number of lines in the HTTP error report (i.e. greater than the value of TopErrors), Stone Steps Webalizer will generate a standalone HTTP error report, listing all HTTP errors for the current month.

    Default value: no

  • TopDownloads

    Defines the maximum number of lines in the downloads report. If the number of actual downloads is greater than this value, the rest of the downloads will either be discarded or generated as a separate downloads report, depending on the value of AllDownloads.

    Default value: 20

  • TopErrors

    Defines the maximum number of lines in the HTTP error report. If the number of actual errors is greater than this value, the rest of the errors will either be discarded or generated as a separate HTTP error report, depending on the value of AllErrors.

    Default value: 20

  • TopASN

    Defines the maximum number of rows in the top Autonomous System report.

    Default value: 30

Hide Object Keywords

These keywords allow you to hide user agents, referrers, hosts, URL's and usernames from the various "Top" tables. The value for these keywords are the same as those used in their command line counterparts. You can specify as many of these as you want without limit. Refer to the section above on "Command Line Options" for a description of the string formatting used as the value. Values cannot exceed 80 characters in length.

  • HideAgent

    This allows specified user agents to be hidden from the "Top User Agents" table. Not very useful, since there a zillion different names by which browsers go by today, but could be useful if there is a particular user agent (ie: robots, spiders, real-audio, etc..) that hits your site frequently enough to make it into the top user agent listing. This keyword is useless if 1) your log file does not provide user agent information or 2) you disable the user agent table.

    Command line argument: -a

  • HideReferrer

    This allows you to hide specified referrers from the "Top Referrers" table. Normally, you would only specify your own web server to be hidden, as it is usually the top generator of references to your own pages. Of course, this keyword is useless if 1) your log file does not include referrer information or 2) you disable the top referrers table.

    Command line argument: -r

  • HideHost

    This allows you to hide specified hosts from the "Top Hosts" table. Normally, you would only specify your own web server or other local machines to be hidden, as they are usually the highest hitters of your web site, especially if you have their browsers home page pointing to it.

    Command line argument: -s

  • HideAllHosts

    This allows hiding all individual hosts from the display, which can be useful when a lot of groupings are being used (since grouped records cannot be hidden). It is particularly useful in conjunction with the GroupDomains feature, however can be useful in other situations as well. Value can be either yes or no, with no the default.

    Command line argument: -X

  • HideURL

    This allows you to hide URL's from the "Top URL's" table. Normally, this is used to hide items such as graphic files, audio files or other non-HTML files that are transferred to the visiting user.

    Command line argument: -u

  • HideUser

    This allows you to hide Usernames from the "Top Usernames" table. Usernames are only available if you use http based authentication on your web server.

  • HideRobots

    If set to yes, this option allows you to hide all robots from the Top Hosts and Top Agents reports. Robot groups, if there are any, will still be displayed in the Top Agents report. Use the Robot configuration parameter to identify robots.

    Default value: no

  • HideGroupedItems

    Controls whether items that are being currently grouped with one of the grouping configuration variables are also added to the corresponding hide item list or not.

    This configuration value may be used multiple times between multiple sets of group configuration values. For example, in this configuration FireFox, Chrome and Opera user agents will be reported in a group and individually, but Internet Explorer will be reported only as a group.

      GroupAgent         Chrome
      GroupAgent         Firefox
    
      HideGroupedItems   yes
    
      GroupAgent         MSIE        Internet Explorer
    
      HideGroupedItems   no
    
      GroupAgent         Opera
    

    Default value: no

Group Object Keywords

The Group* keywords allow object grouping based on Host, URL, Referrer, User Agent and Usernames. Combined with the Hide* keywords, you can customize exactly what will be displayed in the Top tables.

For example, to only display totals for a particular directory, use a GroupURL and HideURL with the same value (ie: /help/*). Group processing is only done after the individual record has been fully processed, so name mangling and site total updates have already been performed. Because of this, groups are not counted in the main site total (as that would cause duplication). Groups can be displayed in bold and shaded as well. Grouped records are not, by default, hidden from the report. This allows you to display a grouped total, while still being able to see the individual records, even if they are part of the group. If you want to hide the detail records, follow the Group* directive with a Hide* one using the same value. There are no command line switches for these keywords. The Group* keywords also accept an optional label to be displayed instead of the actual value used. This label should be separated from the value by at least one whitespace character, such as a space or tab character. See the sample.conf file for examples.

GroupReferrer Allows grouping Referrers. Can be handy for some of the major search engines that have multiple host names a referral could come from.

  • GroupURL

    This keyword allows grouping URL's. Useful for grouping complete directory trees.

  • GroupHost

    This keywords allows grouping Sites. Most used for grouping top level domains and unresolved IP address for local dial-ups, etc...

  • GroupAgent

    Groups User Agents. You could use Firefox, Chrome and Edge as the values for GroupAgent and HideAgent keywords. Make sure you put Edge first because it is based on Chrome and also lists Chrome in its user agent string.

  • GroupDomains

    Allows automatic grouping of domains. The numeric value represents the level of grouping, and can be thought of as 'the number of dots' to display. A 1 will display second level domains only (xxx.xxx), a 2 will display third level domains (xxx.xxx.xxx) etc... The default value of 0 disables any domain grouping.

    Command line argument: -g

  • GroupUser

    Allows grouping of usernames. Combined with a group name, this can be handy for displaying statistics on a particular group of users without displaying their real usernames.

  • GroupShading

    Allows shading of table rows for groups. Value can be yes or no, with the default being yes.

    GroupHighlight Allows bolding of table rows for groups. Value can be yes or no, with the default being yes.

  • GroupRobots

    If set to yes, will instruct Stone Steps Webalizer to group automated user agents (robots) in the Top Agents report. Each group will be assigned a CSS class robot to distinguish them from non-robot user agents. Use the Robot configuration parameter to identify robots.

    Default value: no

Ignore/Include Object Keywords

These keywords allow you to completely ignore log records when generating statistics, or to force their inclusion regardless of ignore criteria. Records can be ignored or included based on host, URL, user agent, referrer and username. Be aware that by choosing to ignore records, the accuracy of the generated statistics become skewed, making it impossible to produce an accurate representation of load on the web server. These keywords behave identical to the Hide* keywords above, where the value can have a leading or trailing wildcard *. These keywords, like the Hide* ones, have an absolute limit of 80 characters for their values. These keywords do not have any command line switch counterparts, so they may only be specified in a configuration file. It should also be pointed out that using the Ignore/Include combination to selectively exclude an entire site while including a particular chunk is extremely inefficient, and should be avoided. Try grep'ing the records into a separate file and process it instead.

  • IgnoreHost

    This allows specified hosts to be completely ignored from the generated statistics.

  • IgnoreURL

    This allows specified URL's to be completely ignored from the generated statistics. One use for this keyword would be to ignore all hits to a temporary directory where development work is being done, but is not accessible to the outside world.

    Unlike other ignore keywords, IgnoreURL can take optional search argument names and values. Multiple IgnoreURL entries can be used to specify different search arguments for the same URL. If multiple IgnoreURL values are used, they must follow one another in the configuration file or those that are out of order will be ignored.

    The entire search argument name is matched, not a part of it, so abc will only match abc and not abcd.

    For example, if you would like to ignore index.html with search arguments abc and xyz, you would use these configuration variables:

      IgnoreURL    /index.html*   abc
      IgnoreURL    /index.html*   xyz
    

    This will ignore log records containing these URLs:

      /index.html?abc=123&x=1&y=2
      /index.html?x=1&y=2&xyz=123
    

    , but will process and report these URLs:

      /index.html?abcd=123&x=1&y=2
      /index.html?x=1&y=2
    

    IgnoreURL may also include search argument values, in which case both, name and value must match for a log line to be ignored.

    For example, if the following IgnoreURL entry is used in the configuration file:

      IgnoreURL    /index.html*   page=/test
    

    , then a log line containing this URL will be egnored:

      index.html?abc=123&page=/test&xyz=456
    

    , but a log record containing this URL will be processed and will appear in the report:

      index.html?abc=123&page=/catalog&xyz=456
    

    Note that search argument filtering is done before the ignore logic is applied, so if you filtered out the argument that is used in one of the IgnoreURL entries, log records containing excluded search arguments will not be ignored.

    See URL Normalization for additional details on what URL characters should be used in ignore patterns.

    IgnoreReferrer This allows records to be ignored based on the referrer field.

  • IgnoreAgent

    This allows specified User Agent records to be completely ignored from the statistics. Maybe useful if you really don't want to see all those hits from MSIE :)

  • IgnoreUser

    This allows specified username records to be completely ignored from the statistics. Usernames can only be used if you use http authentication on your server.

  • IncludeHost

    Force the record to be processed based on hostname. This takes precedence over the Ignore* keywords.

  • IncludeURL

    Force the record to be processed based on URL. This takes precedence over the Ignore* keywords.

    IncludeReferrer Force the record to be processed based on referrer. This takes precedence over the Ignore* keywords.

  • IncludeAgent

    Force the record to be processed based on user agent. This takes precedence over the Ignore* keywords.

  • IncludeUser

    Force the record to be processed based on username. Usernames are only available if you use http based authentication on your server. This takes precedence over the Ignore* keywords.

  • IgnoreRobots

    If set to yes, forces all records submitted by a robot user agent to be completely ignored. Use the Robot configuration parameter to identify robots.

    Default value: no

Dump Object Keywords

The Dump* Keywords allow text files to be generated that can then be used for import into most database, spreadsheet and other external programs. The file is a standard tab delimited text file, meaning that each column is separated by a tab (0x09) character. A header record may be included if required, using the DumpHeader keyword. Since these files contain all records that have been processed, including normally hidden records, an alternate location for the files can be specified using the DumpPath keyword, otherwise they will be located in the default output directory.

  • DumpPath

    Specifies an alternate location for the dump files. The default output location will be used otherwise. The value is the path portion to use, and normally should be an absolute path (ie: has a leading / character), however relative path names can be used as well, and will be relative to the output directory location.

  • DumpExtension

    Allows the dump filename extensions to be specified. The default extension is tab, however may be changed with this option.

  • DumpHeader

    Allows a header record to be written as the first record of the file. Value can be either yes or no, with the default being no.

  • DumpHosts

    Dump tab-delimited hosts file. Value can be either yes or no, with the default being no. The filename used is site_YYYYMM.tab (YYYY=year, MM=month).

  • DumpURLs

    Dump tab delimited url file. Value can be either yes or no, with the default being no. The filename used is url_YYYYMM.tab (YYYY=year, MM=month).

  • DumpReferrers

    Dump tab delimited referrer file. Value can be either yes or no, with the default being no. Filename used is ref_YYYYMM.tab (YYYY=year, MM=month). Referrer information is only available if present in the log file (ie: combined web server log).

  • DumpAgents

    Dump tab delmited user agent file. Value can be either yes or no, with the default being no. Filename used is agent_YYYYMM.tab (YYYY=year, MM=month). User agent information is only available if present in the log file (ie: combined web server log).

  • DumpUsers

    Dump tab delimited username file. Value can be either yes or no, with the default being no. FIlename used is user_YYYYMM.tab (YYYY=year, MM=month). The username data is only avilable if http authentication is used on the web server and that information is present in the log.

  • DumpSearchStr

    Dump tab delimited search string file. Value can be either yes or no, with the default being no. Filename used is search_YYYYMM.tab (YYYY=year, MM=month). the search string data is only available if referrer information is present in the log being processed and recognized search engines were found and processed.

  • DumpDownloads

    If this configuration parameter is set to yes, Stone Steps Webalizer will generate a tab-delimited file listing all downloads for the current month.

    Default value: no

  • DumpErrors

    If this configuration parameter is set to yes, Stone Steps Webalizer will generate a tab-delimited file listing all HTTP errors for the current month.

    Default value: no

  • DumpCountries

    Generate a tab-delimited data file for all countries. The file name will be country_YYYYMM.tab.

    Default value: no

  • DumpCities

    Generate a tab-delimited data file for all cities. The file name will be city_YYYYMM.tab.

    Default value: no

  • DumpASN

    Generates a tab-delimited data file for all ASN entries. The file will be named asn_YYYYMM.tab.

    Default value: no

HTML Generation Keywords

These keywords allow you to customize the HTML code that The Webalizer produces, such as adding a corporate logo or links to other web pages. You can specify as many of these keywords as you like, and they will be used in the order that they are found in the file. Values cannot exceed 80 characters in length, so you may have to break long lines up into two or more lines. There are no command line counterparts to these keywords.

  • HTMLExtension

    Allows generated pages to use something other than the default html extension for the filenames. Do not include the leading period (.) when you specify the extension.

    Command line argument: -x

  • HTMLPre

    Allows code to be inserted at the very beginning of the HTML files. Be careful not to include any HTML here, as it is inserted before the <html> tag in the file. Use it for server-side scripting capabilities, such as php3, to insert scripting files and other directives.

    It is an error to have the <!DOCTYPE> tag in any HTMLPre entries.

  • HTMLHead

    Allows you to insert HTML code between the <head></head> block. There is no default. Useful for adding scripts to the HTML page, such as JavaScript or php3, or even just for adding a few META tags to the document.

  • HTMLBody

    This keyword defines HTML code to be placed immediately after the start <body> tag of the report, just before the title and "summary period/generated on" lines. Keep in mind the placement of this code in relation to the title and other aspects of the web page. A typical use is to add a corporate logo (graphic) in the top right.

    It is an error to have the <body> tag in any HTMLBody entries.

  • HTMLPost

    This keyword defines HTML code that is placed after the title and "summary period/generated on" lines, just before the initial horizontal rule <hr> tag. Normally this keyword isn't needed, but is provided in case you included a large graphic or some other weird formatting tag in the HTMLHead section that needs to be cleaned up or terminated before the main report section.

  • HTMLTail

    This keyword defines HTML code that is placed at the bottom of the report. Normally this keyword is used to provide a link back to your home page or insert a small graphic at the bottom right of the page.

  • HTMLEnd

    This allows insertion of closing code for HTMLBody tags, at the very end of the page.

    It is an error to have the closing </body> tag in any HTMLEnd entries.

  • HTMLCssPath

    Specifies a URL path to the webalizer.css file, not including the file name. The path must be a URL path, even if it refers to a local file. You can reference one CSS file in many reports to make it easier to change report layout in one place.

    Default value: none

  • HTMLJsPath

    Specifies a URL path to the webalizer.js file, not including the file name. The path must be a URL path, even if it refers to a local file. You can reference one JavaScript file in many reports to make it easier to change report layout in one place.

    Default value: none

  • HTMLMetaNoIndex

    Controls whether Stone Steps Webalizer will generate HTML reports that may be indexed by robots or not.

    Default value: yes

  • HTMLExtensionLang

    Configures Stone Steps Webalizer to append the current language code to the generated HTML and image files, so Apache language extensions can be used to browse language- specific reports.

    For example, if the current language is Japanese, index.html will be named index.html.ja.

    Default value: no

Notes on Web Log Files

Stone Steps Webalizer supports W3C, IIS, Nginx, Apache, CLF and Squid log formats.

Avoid processing the same log files more than once because in every subsequent run Stone Steps Webalizer will use the latest processed log time stamp to skip log lines that have already been processed in previous runs and, considering that modern log files will most likely contain multiple log lines with the same time stamp value, some of those log lines had been processed before and some have not yet been processed, but will be discarded because of the matching time stamp values anyway.

When configuring your web server to rotate log files, keep in mind that as soon as a log line with the time stamp from the next month is processed, the current month will be ended and reports will be generated. If you would like to end the current month after processing the log file you know to be the last one, use --end-month switch to do so and then --prepare-report to generate the monthly report from the rolled over state database. If you choose to use this workflow and intend to process multiple log files one after another, you will achieve better performance using the --batch switch, which prevents Stone Steps Webalizer from generating intermediate monthly reports after each log file.

IIS and W3C

W3C Extended Log File Format (W3C) defines special directives describing the physical structure of the log file. Stone Steps Webalizer recognizes #Fields directives and dynamically reconfigures its parser to process log file entries following this directive in the matching order.

IIS log format mostly follows the W3C standard, with one excepion - it outputs request processing time (time-taken) in milliseconds instead of seconds.

Apache

Apache logs may be customized using LogFormat and CustomLog directives (these are Apache configuration keywords, not those used by Stone Steps Webalizer). Stone Steps Webalizer can parse the CustomLog directive, if it's specified anywhere in the configuration using the ApacheLogFormat configuration parameter.

For example (the line is broken for display purposes; it would actually appear as a single line in the configuration file):

ApacheLogFormat
      %a %l \"%u\" %t %m "%U" \"%q\" %p %>s %b %D
      \"%{Referer}i\" \"%{User-Agent}i\"

In the preceding example the user name field (%u) is enclosed in quotes because user names may contain spaces. The URL stem field (%U) is quoted as well because Apache logs URL file paths decoded and URLs may contain spaces. The query string field (%q) is quoted because it may be reported as an empty string. Numeric fields, on the other hand, such as request processing time (%D), do not need to be quoted.

It is important to understand that Apache log files do not contain log format information (unlike log files in W3C extended format) and switching log file format without renaming the current log file will result in a log file that contains log information in mixed formats. Such log files cannot be analyzed unless they are split onto multiple consistently-formatted log files.

If log formats specified in httpd.conf and ssl.conf for any shared log file are not the same, the resulting log file will contain log information in mixed formats and cannot be analyzed. We also recommend that you use the %p field (port number), as shown in the example above, to make it possible to distinguish HTTP and HTTPS requests.

Common Log Format (CLF)

The Webalizer supports CLF log formats, which should work for just about everyone. If you want User Agent or Referrer information, you need to make sure your web server supplies this information in it's log file, and in a format that the Webalizer can understand. While The Webalizer will try to handle many of the subtle variations in log formats, some will not work at all. Most web servers output CLF format logs by default.

For Apache, in order to produce the proper log format, add the following to the httpd.conf file:

LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\""

This instructs the Apache web server to produce a combined log that includes the referrer and user agent information on the end of each record, enclosed in quotes (This is the standard recommended by both Apache and NCSA).

Nginx

Nginx logs can be customized using log_format directive. Stone Steps Webalizer will recognize a limited set of Nginx variables that can be used in log_format via the NginxLogFormat configuration variable, which enables the nginx log type in -F and LogType.

Note that log_format cannot be used verbatim in NginxLogFormat, which expects only variable names listed on a single line, without the single quotes that allow log_format span multiple lines. For example, the following log_format configuration:

log_format combined '$remote_addr - $remote_user [$time_local] '
                  '"$request" $status $body_bytes_sent '
                  '"$http_referer" "$http_user_agent"';

, would appear as follows in NginxLogFormat, separated by a single space character:

NginxLogFormat  $remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"

The - represents the ident field in a CLF log format, which has no matching variable in Nginx, so a dash is used to represent a non-existent field value. Such field will be treated as an unknown field and will be ignored.

Following Nginx variables are recognized in NginxLogFormat.

  • time_iso8601, time_local
  • remote_addr
  • remote_user
  • server_port
  • request_method
  • request
  • request_uri, uri, args, query_string
  • request_time
  • status
  • request_length, bytes_received, bytes_sent
  • http_user_agent
  • http_referer

For information on log_format and what these variables represent, see Nginx documentation:

http://nginx.org/en/docs/http/ngx_http_log_module.html#log_format

http://nginx.org/en/docs/varindex.html

Referrers

Referrers are weird critters... They take many shapes and forms, which makes it much harder to analyze than a typical URL, which at least has some standardization. What is contained in the referrer field of your log files varies depending on many factors, such as what site did the referral, what type of system it comes from and how the actual referral was generated. Why is this? Well, because a user can get to your site in many ways... They may have your site bookmarked in their browser, they may simply type your sites URL field in their browser, they could have clicked on a link on some remote web page or they may have found your site from one of the many search engines and site indexes found on the web. The Webalizer attempts to deal with all this variation in an intelligent way by doing certain things to the referrer string which makes it easier to analyze. Of course, if your web server doesn't provide referrer information, you probably don't really care and are asking yourself why you are reading this section...

Most referrer's will take the form of http://somesite.com/somepage.html, which is what you will get if the user clicks on a link somewhere on the web in order to get to your site. Some will be a variation of this, and look something like file:/some/such/sillyname, which is a reference from a HTML document on the users local machine. Several variations of this can be used, depending on what type of system the user has, if he/she is on a local network, the type of network, etc... To complicate things even more, dynamic HTML documents and HTML documents that are generated by CGI scripts or external programs produce lots of extra information which is tacked on to the end of the referrer string in an almost infinite number of ways. If the user just typed your URL into their browser or clicked on a bookmark, there won't be any information in the referrer field and will take the form -.

In order to handle all these variations, The Webalizer parses the referrer field in a certain way. First, if the referrer string begins with http, it assumes it is a normal referral and converts the http:// and following hostname to lowercase in order to simplify hiding if desired.

For example, the referrer

HTTP://WWW.MyHost.Com/This/Is/A/HTML/Document.html

will become

http://www.myhost.com/This/Is/A/HTML/Document.html

Notice that only the http:// and hostname are converted to lower case. The rest of the referrer field is left alone. This follows standard convention, as the actual method (HTTP) and hostname are always case insensitive, while the document name portion is case sensitive.

Referrers that came from search engines, dynamic HTML documents, CGI scripts and other external programs usually tack on additional information that it used to create the page. A common example of this can be found in referrals that come from search engines and site indexes common on the web. Sometimes, these referrers URL's can be several hundred characters long and include all the information that the user typed in to search for your site. The Webalizer deals with this type of referrer by stripping off all the query information, which starts with a question mark ?. The Referrer http://search.yahoo.com/search?p=usa%26global%26link will be converted to just http://search.yahoo.com/search.

When a user comes to your site by using one of their bookmarks or by typing in your URL directly into their browser, the referrer field is blank, and looks like -. Most sites will get more of these referrals than any other type. The Webalizer converts this type of referral into the string - (Direct Request). This is done in order to make it easier to hide via a command line option or configuration file option. This is because the character - is a valid character elsewhere in a referrer field, and if not turned into something unique, could not be hidden without possibly hiding other referrers that shouldn't be.

Stone Steps Webalizer supports a configuration parameter, SpamReferrer, which lists referrer patterns considered as spam. Visitors submitting these requests will be red-flagged and marked in the hosts report as spammers.

Multiple SpamReferrer entries may be used to specify more than one pattern. For example, the first two entries below will red-flag all requests with the referrer URL containing words gambling or poker anywhere in the referrer URL. The third entry will match only if the referrer URL begins with the string of characters preceding the asterisk.

SpamReferrer    gambling
SpamReferrer    casino
SpamReferrer    http://www.instantlinkexchange.com*

Once a visitor is identified as a spammer, all requests from this IP address will be treated as spam for the rest of the currently- reported month. Spam requests will be counted as usual in all reports, except the referrer report, to prevent spam referrers from appearing in the report as clickable links. Spamming hosts will also highlighted in red color in the hosts report.

If you would like to change the color of the highlighting, locate the following line in webalizer.css and change the color to any other value:

td.spammer, span.spammer {color: red;}

In addition to highlighting, the all-hosts and the tab-separated host reports will have an asterisk output next to the spammer's host.

URL Normalization

In general, URLs are supposed to be uniformly encoded in such a way that keeps them simple, but still usable even if they are printed on paper, pronounced on the radio, or appear in other contexts where it may be impossible to distinguish characters from different languages. This encoding is described by the internet standard RFC-3986 and defines which characters may appear in their natural representation and which should be percent-encoded as one or more sequences of a percent character followed by two hexadecimal digits that represent that character (e.g. & is encoded as %26).

Sometimes URL characters may be encoded incorrectly, which may be because of various historical reasons, or because of bugs in user agents, or in an attempt to avoid simple spam filters that do not percent-decode URL sequences before looking for spam keywords. In either case, having the same URL encoded differently fragments reports by creating aliases (e.g. /xAy/ and /x%41y/ are counted as two URLs) and makes spam detection more difficult because all possible aliases must be filtered individually. In order to deal with these issues, Stone Steps Webalizer normalizes all URLs extracted from log files to reduce aliasing and improve report readability.

IMPORTANT: A normalized URL is not a well-formed URL in the sense that it should not be used verbatim in HTML as a copy-and-paste href link, but it could be used in the URL field in a web browser because URL normalization does not change the meaning of existing URL components, but merely makes them more readable.

For example, a normalized URL /?q="abc" is more readable than the equivalent well-formed URL /?q=%22abc%22, but if it is copy-pasted into an href attribute in HTML without proper HTML encoding, it will break that HTML because the double quotes in the URL will interfere with double quotes in the HTML attribute.

URL normalization is performed before any other work is done against all URLs, which follows the rules described below, so all configuration filters should use normalized characters in all ignore, hide and group URL patterns.

  • Following characters are not encoded and if they are found in the encoded form, such as %41 instead of the character A, they will be decoded to their original form shown below. If any of these characters is percent-encoded, it will be decoded.

    A-Z, a-z, 0-9, - . _ ~ " % < > space

  • Following characters have special meaning within URLs and will not be encoded or decoded and will remain in their current form, whatever it is (i.e. a&b and a%26b will remain exactly as they were before normalization).

    : / ? # [ ] @ ! $ & ` ( ) * + , ; = %

  • Percent-encoded control characters will not be decoded and unencoded control characters will be percent-encoded.

  • Percent-encoded multibyte UTF-8 characters will be decoded to their UTF-8 form.

  • Percent-encoded non-UTF-8 characters are not allowed in URLs and will be decoded as if they were characters from the Latin-1 (Western) alphabet and converted to UTF-8.

  • A percent character that is not a part of a percent-encoding sequence in a URL will be percent-encoded (e.g. /a%b/ will become /a%25b/).

If you intend to filter URLs with spaces, make sure to use EnablePhraseValues, so the space in the pattern wouldn't be misinterpreted as a pattern/value separator. In other words, if EnablePhraseValues is not enabled, the following pattern with a space will be interpreted as as a URL "*/ab" with a search argument "cd/", as in "/ab/?cd/".

IgnoreURL  */ab cd/

Note that once some URL path pattern is found, only search arguments of the matching path pattern will be checked, but no further. This may produce unexpected results if broader URL path patterns (e.g. *) are placed first in the list of IgnoreURL filters. Consider these filters:

IgnoreURL   /abc/*
IgnoreURL   /xyz/*   pageid=1
IgnoreURL   /xyz/*   pageid=2
IgnoreURL   /xyz/*   pageid=3
IgnoreURL   /def/*

Let's say that the current URL is /xyz/?x=1&pageid=2&y=2. First, the URL path, which is /xyz/, is checked against /abc/*, which will not match. Then /xyz/ is checked against the first occurrence of /xyz/*, which will match and will trigger a special search mode that will stop searching URL path patterns after either all /xyz/* patterns are checked or a matching search argument is found. In this example, it will be the filter with pageid=2. Because the URL path matched, the next URL path pattern, def/*, will not even be checked. This is done for performance reasons, so a long list of ignore filters wouldn't slow down log processing too much.

Now, consider a slightly different list of filters:

IgnoreURL   /abc/*
IgnoreURL   /*   pageid=1
IgnoreURL   /*   pageid=2
IgnoreURL   /*   pageid=3
IgnoreURL   /def/*

The pattern /* will match any URL, so if a URL in the log line is /def/p.html, the pattern /* will match this URL path, but the actual URL doesn't have any search arguments, so none of the three filters with pageid= will match. However, because /* matches any URL, the /def/* pattern will not ever be checked.

One way to work this around is to have those broader filters at the very end, so all other patterns are matched first, but this would be an error-prone approach if more than one catch-all patterns is needed.

Search String Analysis

The Webalizer will do a minimal analysis on referrer strings that it finds, looking for well known search string patterns. Most of the major search engines are supported, such as Yahoo!, Altavista, Lycos, etc... Unfortunately, search engines are always changing their internal/CGI query formats, new search engines are coming on line every day, and the ability to detect all search strings is nearly impossible. However, it should be accurate enough to give a good indication of what users were searching for when they stumbled across your site. Note: as of version 1.31, search engines can now be specified within a configuration file. See the sample.conf file for examples of how to specify additional search engines.

Visits/Entry/Exit Figures

The majority of data analyzed and reported on by The Webalizer is as accurate and correct as possible based on the input log file. However, due to the limitation of the HTTP protocol, the use of firewalls, proxy servers, multi-user systems, the rotation of your log files, and a myriad of other conditions, some of these numbers cannot, without absolute accuracy, be calculated. In particular, Visits, Entry Pages and Exit Pages are suspect to random errors due to the above and other conditions. The reason for this is twofold,

  1. Log files are finite in size and time interval, and
  2. There is no way to distinguish multiple individual users apart given only an IP address.

Because log files are finite, they have a beginning and ending, which can be represented as a fixed time period. There is no way of knowing what happened previous to this time period, nor is it possible to predict future events based on it. Also, because it is impossible to distinguish individual users apart, multiple users that have the same IP address all appear to be a single user, and are treated as such. This is most common where corporate users sit behind a proxy/firewall to the outside world, and all requests appear to come from the same location (the address of the proxy/firewall itself). Dynamic IP assignment (used with dial-up internet accounts) also present a problem, since the same user will appear as to come from multiple places.

For example, suppose two users visit your server from XYZ company, which has their network connected to the Internet by a proxy server fw.xyz.com. All requests from the network look as though they originated from fw.xyz.com, even though they were really initiated from two separate users on different PC`s. The Webalizer would see these requests as from the same location, and would record only 1 visit, when in reality, there were two. Because entry and exit pages are calculated in conjunction with visits, this situation would also only record 1 entry and 1 exit page, when in reality, there should be 2.

As another example, say a single user at XYZ company is surfing around your website.. They arrive at 11:52pm the last day of the month, and continue surfing until 12:30am, which is now a new day (in a new month). Since a common practice is to rotate (save then clear) the server logs at the end of the month, you now have the users visit logged in two different files (current and previous months). Because of this (and the fact that the Webalizer clears history between months), the first page the user requests after midnight will be counted as an entry page. This is unavoidable, since it is the first request seen by that particular IP address in the new month.

For the most part, the numbers shown for visits, entry and exit pages are pretty good "guesses", even though they may not be 100% accurate. They do provide a good indication of overall trends, and shouldn't be that far off from the real numbers to count much. You should probably consider them as the minimum amount possible, since the actual (real) values should always be equal or greater in all cases.

Exporting Webalizer Data

The Webalizer now has the ability to dump all object tables to tab delimited ASCII text files, which can then be imported into most popular database and spreadsheet programs. The files are not normally produced, as on some sites they could become quite large, and are only enabled by the use of the Dump* configuration keywords. The filename extensions default to .tab however may be changed using the DumpExtension keyword. Since this data contains all items, even those normally hidden, it may not be desirable to have them located in the output directory where they may be visible to normal web users.. For this reason, the DumpPath configuration keyword is available, and allows the placement of these files somewhere outside the normal web server document tree. An optional header record may be written to these files as well, and is useful when the data is to be imported into a spreadsheet.. databases will not normally need the header. If enabled, the header is simply the column names as the first record of the file, tab separated.

Language Support

Stone Steps Webalizer supports dynamic languages loaded at run time. If the language file is found, its content will be used to produce reports and progress messages. A new configuration variable, LanguageFile, can be used to specify the location of the file. For example

LanguageFile    c:\tools\webalizer\lang\webalizer_lang.german

A typical language file contains series of name/value pairs. The name identifies a text variable used by Stone Steps Webalizer and the value provides language-specific text. For example, the English version of the error message reported if a log file cannot be opened is defined as follows:

msg_log_err = Error: Can't open log file

Some language file entries, such as the list of months shown below, may contain multiple elements. In this case, individual elements must be separated by commas:

s_month = Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec

The whitespace between the end of each element and the comma is preserved and may be used for padding purposes. The whitespace following the comma is stripped off, unless the element is enclosed in double quotes.

If an individual element of a comma-separated list contains a comma, as shown in the example below, this element must be enclosed in double quotes:

fr,       France,
fx,       "France, Metropolitan",
ga,       Gabon,

The file webalizer_lang.english contains additional information about the structure of a language file.

Language files must be saved in the UTF-8 character encoding.

All existing language files have been converted to UTF-8. If you would like to convert some other character encoding to UTF-8, you can use the iconv utility.

For example, the following command converts a Japanese language file from euc-jp to utf-8:

$ iconv -f euc-jp -t utf-8 -o webalizer_lang.utf-8.japanese webalizer_lang.japanese

Stone Steps Webalizer may be configured to generate usage graphs using TrueType fonts and UTF-8 character sets. In order to configure Stone Steps Webalizer to use TrueType fonts, add GraphFontNormal and GraphFontBold directives to the webalizer.conf file. Each of these configuration variables must be a fully-qualified path to the selected TrueType font file(s).

For example, the following two lines configure Stone Steps Webalizer to use Lucida Console for all graph legends and axis markers and Tahoma Bold for all graph titles:

GraphFontNormal     c:\winnt\fonts\lucon.ttf
GraphFontBold       c:\winnt\fonts\tahomabd.ttf

If GraphFontNormal and GraphFontBold are not specified, or if the associated font files cannot be found, Stone Steps Webalizer will use the default raster fonts to generate text for the graphs. Note that raster fonts may not have suitable character representation for non-Latin characters.

You can control the appearance of the generated text using three configuration variables shown below. The first two variables define the size of the normal and bold fonts (in points). The third one instructs Stone Steps Webalizer whether to smooth font edges or not.

GraphFontSmall      8
GraphFontMedium     9.5
GraphFontSmoothing  y

If you would like to use non-Latin UTF-8 characters in your language files, make sure that the TrueType font you selected contains the characters you need.

For example, Lucida Console shipped with the English version of Windows does not have Japanese characters and if used to generate graphs will result in unusable graphs.

Robots

Robots are identified before user agents are mangled. Some robot related features, such as highlighting robots in the Top Agents report, may be disabled if agent mangling is active.

Log records matching IgnoreRobot entries, are completely ignored and none of the robot-related entries are updated in this case.

Hosts are marked as robots when user agent matches one of the Robot entries and only when a host is seen for the first time (i.e. when a database host entry is created). If a human and a robot share the same IP address, this address will be marked as robot or non-robot depending on which user agent was active when the first hit was logged by the web server.

Active visits are marked as robot visits when user agents matches one of the Robot entries, regardless whether the corresponding hosts are marked as robots or not. Visit robot flag is used when user agents are classified as robots or not and when website and country totals are updated. Country totals do not include robot activity.

Known Issues

  • Country Totals

    Stone Steps Webalizer computes country totals at when ending visits. Consequently, in the incremental mode active visit data is not included into country totals until the last log file for the month is processed. The net effect of this is that the pie chart of all intermediate reports will show the Others slice bigger than it really is, because visit totals used as a 100% when computing pie slices are those of started visits. All active visits are terminated at the end of the month, so that the final pie chart accurately depicts the percentage of other countries.

  • Memory Usage

    The Webalizer makes liberal use of memory for internal data structures during analysis. Lack of real physical memory will noticeably degrade performance by doing lots of swapping between memory and disk. One user who had a rather large log file noticed that The Webalizer took over 7 hours to run with only 16 Meg of memory. Once memory was increased, the time was reduced to a few minutes.

  • Performance

    The Hide*, Group*, Ignore*, Include* and IndexAlias configuration options can cause a performance decrease if lots of them are used. The reason for this is that every log record must be scanned for each item in each list.

    For example, if you are Hiding 20 objects, Grouping 20 more, and Ignoring 5, each record is scanned, at most, 46 times (20+20+5+ an IndexAlias scan).

    On really large log files, this can have a profound impact. It is recommended that you use the least amount of these configuration options that you can, as it will greatly improve performance.

Final Notes

A lot of time and effort went into making The Webalizer, and to ensure that the results are as accurate as possible. If you find any abnormalities or inconsistent results, bugs, errors, omissions or anything else that doesn't look right, please let me know so I can investigate the problem or correct the error. This goes for the minimal documentation as well. Suggestions for future versions are also welcome and appreciated.

Visit Stone Steps Webalizer project on GitHub if you would like to log a bug:

https://github.com/StoneStepsInc/StoneStepsWebalizer