Update docs

czcorpus · Dec 10, 2024 · 877b914 · 877b914
1 parent ae3021b
commit 877b914
Show file tree

Hide file tree

Showing 3 changed files with 195 additions and 78 deletions.
diff --git a/README.md b/README.md
@@ -1,101 +1,89 @@
 # Klogproc
 
-![build status](https://travis-ci.org/czcorpus/klogproc.svg?branch=master)
 
-*Klogproc* is a utility for processing/archiving logs generated by applications
-run by the Institute of the Czech National Corpus.
+*Klogproc* is a service for processing and archiving logs generated by applications
+run by the Institute of the Czech National Corpus (CNC).
 
-In general *Klogproc* reads an application-specific log record format from a file
-or a Redis queue, parses individual lines and converts them into a target format
-which is then stored to ElasticSearch or InfluxDB (both at the same time can be used).
+In general, *Klogproc* reads continuously an application-specific log record format from a file,
+parses individual lines and converts them into a target format which is then stored to ElasticSearch
+database.
 
-*Klogproc* replaces LogStash as a less resource-hungry and runtime environment demanding
-alternative. All the processing (reading, writing, handling multiple files) is performed
-concurrently which makes it quite fast.
+In the CNC, *Klogproc* replaced LogStash as a less resource-hungry alternative. All the processing
+(reading, writing, handling multiple files) is performed concurrently which makes it quite fast.
 
 ## Overview
 
 ### Supported applications
 
-| Name       | config code | versions | note                                  |
-|------------|-------------|----------|---------------------------------------|
-| Akalex     | akalex      | `N/A` | a Shiny app with a custom log (*)     |
-| APIGuard   | apiguard    | `N/A` | CNC's internal API proxy and watchdog |
-| Calc       | calc        | `N/A` | a Shiny app with a custom log (*)     |
-| CNC-VLO    | vlo         | `N/A` | a custom CNC node for the [Clarin VLO](https://vlo.clarin.eu/) (JSONL log)  |
-| Gramatikat | gramatikat  | `N/A` | a Shiny app with a custom log (*)     |
-| KonText    | kontext     | `0.13`, `0.14`, `0.15`, `0.16`, `0.17`, `0.18` |                                      |
-| KorpusDB   | korpus-db   | `N/A` |                                       |
-| Kwords     | kwords      | `1`, `2` |                                      |
-| Lists      | lists       | `N/A` |  a Shiny app with a custom log (*)     |
-| Mapka      | mapka       | `1`, `2`, `3` | using Nginx/Apache access log         |
-| Morfio     | morfio      | `N/A` |                                      |
-| MQuery-SRU | mquery-sru  | `N/A` | a [Clarin FCS](https://www.clarin.eu/content/federated-content-search-clarin-fcs-technical-details) endpoint (JSONL log)     |
-| QuitaUP    | quita-up    | `N/A` | a Shiny app with a custom log (*)     |
-| SkE        | ske         | `N/A` |using Nginx/Apache access log         |
-| SyD        | syd         | `N/A` | a custom app log                      |
-| Treq       | treq        | current, `v1-api` | a custom app log                      |
-| WaG        | wag         | `0.6`, `0.7` | web access log, currently without user credentials  |
-
-(*) All the Shiny apps use the same log fromat.
-
-The program supports three operation modes - *batch*, *tail*, *redis*
-
-### Batch processing of a directory or a file
-
-For non-regular imports e.g. when migrating older data, *batch* mode allows
-importing of multiple files from a single directory. The contents of the directory
+| Name       | config code | versions | scripting | note                       |
+|------------|-------------|----------|-----------|----------------------------|
+| Akalex     | akalex      | :x: | :x: | a Shiny app with a custom log (:asterisk:)     |
+| APIGuard   | apiguard    | :x: | :x: | CNC's internal API proxy and watchdog |
+| Calc       | calc        | :x: | :x: | a Shiny app with a custom log (:asterisk:)     |
+| CNC-VLO    | vlo         | :x: | :x: | a custom CNC node for the [Clarin VLO](https://vlo.clarin.eu/) (JSONL log)  |
+| Gramatikat | gramatikat  | :x: | :x: | a Shiny app with a custom log (:asterisk:)     |
+| KonText    | kontext     | `0.13`, `0.14`, `0.15`, `0.16`, `0.17`, `0.18` | :white_check_mark: |
+| KorpusDB   | korpus-db   | :x: | :x: |  |
+| Kwords     | kwords      | `1`, `2` |  :white_check_mark: |    |
+| Lists      | lists       | :x: | :x: |  a Shiny app with a custom log (:asterisk:)     |
+| Mapka      | mapka       | `1`, `2`, `3` | :white_check_mark: (v3) | using Nginx/Apache access log         |
+| Morfio     | morfio      | :x: | :x: |                           |
+| MQuery-SRU | mquery-sru  | :x: | :x: | a [Clarin FCS](https://www.clarin.eu/content/federated-content-search-clarin-fcs-technical-details) endpoint (JSONL log)     |
+| QuitaUP    | quita-up    | :x: | :x: | a Shiny app with a custom log (:asterisk:)     |
+| SkE        | ske         | :x: | :x: | using Nginx/Apache access log         |
+| SyD        | syd         | :x: | :x: | a custom app log                      |
+| Treq       | treq        | current, `v1-api` | :white_check_mark: | a custom app log                      |
+| WaG        | wag         | `0.6`, `0.7` | :white_check_mark: | web access log, currently without user credentials  |
+
+(:asterisk:) All the Shiny apps use the same log fromat.
+
+The program can work in two modes - `batch` and `tail`
+
+### Batch - ad-hoc processing of a directory or a file
+
+For non-regular imports e.g. when migrating older data or when debugging a log processing routines,
+`batch` mode allows importing of multiple files from a single directory. The contents of the directory
 can be even changed over time by adding **newer** log records and *klogproc* will
 be able to import only new items as it keeps a worklog with the newest record
 currently processed.
 
-### Batch processing of a Redis queue (deprecated)
-
-Note: On the application side, this is currently supported only in KonText
-and SkE (with added special Python module *scripts/redislog.py* which is part of
-the *klogproc* project).
-
-In this case, an application writes its log to a Redis queue (*list* type) and
-*klogproc* regularly takes N items from the queue (items are removed from there),
-transforms them and stores them to specified destinations.
-
-### Tail-like listening for changes in multiple files
+### Tail - listening for changes in multiple files
 
 This is the mode which replaces CNC's LogStash solution and it is a typical
-mode to use. One or more log file listeners can be configured to read newly
+mode of use. One or more log file listeners can be configured to read newly
 added lines. The log files are checked in regular intervals (i.e. the change is
-not detected immediately). *Klogproc* remembers current inode and current seek position
+not detected immediately). Klogproc remembers current inode and current seek position
 for watched files so it should be able to continue after outages etc. (as long as
-the log files are not overwritten  in the meantime due to log rotation).
+the log files are not overwritten in the meantime due to log rotation).
 
 
 ## Installation
 
 [Install](https://golang.org/doc/install) *Go* language if it is not already
 available on your system.
 
-Clone the *klogproc* project:
+Clone the `klogproc` project:
 
 `git clone https://klogproc.git`
 
 Build the project:
 
-`go build`
+`make`
 
 Copy the binary somewhere:
 
-`sudo cp klogproc /usr/local/bin`
+`sudo cp klogproc /opt/klogproc/bin`
 
-Create a config file (e.g. in /usr/local/etc/klogproc.json):
+Create a config file (e.g. in `/opt/klogproc/etc/klogproc.json`):
 
 ```json
 {
   "logging": {
-    "path": "/var/log/klogproc/klogproc.log"
+    "path": "/opt/klogproc/var/log/klogproc.log"
   },
   "logTail": {
     "intervalSecs": 15,
-    "worklogPath": "/var/opt/klogproc/worklog-tail.log",
+    "worklogDir": "/opt/klogproc/var/worklog-tail",
     "files": [
       {"path": "/var/log/ucnk/syd.log", "appType": "syd"},
       {"path": "/var/log/treq/treq.log", "appType": "treq"},
@@ -112,20 +100,20 @@ Create a config file (e.g. in /usr/local/etc/klogproc.json):
     "scrollTtl": "3m",
     "reqTimeoutSecs": 10
   },
-  "geoIPDbPath": "/var/opt/klogproc/GeoLite2-City.mmdb",
+  "geoIPDbPath": "/opt/klogproc/var/GeoLite2-City.mmdb",
   "anonymousUsers": [0, 1, 2]
 }
 ```
 
-notes:
+Notes:
 
 - Do not forget to create directory for logging, worklog and also
 download and save GeoLite2-City database.
 - The applied `tzShift` for the *kwords* app is just an example; it should be applied iff the stored
 datetime values provide incorrect time-zone (e.g. if it looks like UTC time but the actual
 values reprezent local time) - see the section Time-zone notes for more info.
 
-Configure systemd (/etc/systemd/system/klogproc.service):
+Configure systemd (`/etc/systemd/system/klogproc.service`):
 
 ```ini
 [Unit]
@@ -134,7 +122,7 @@ After=network.target
 
 [Service]
 Type=simple
-ExecStart=/usr/local/bin/klogproc tail /usr/local/etc/klogproc.json
+ExecStart=/opt/klogproc/bin/klogproc tail /opt/klogproc/etc/klogproc.json
 User=klogproc
 Group=klogproc
 
@@ -164,7 +152,7 @@ For the tail action, the config is as follows:
 {
   "logTail": {
     "intervalSecs": 5,
-    "worklogPath": "/path/to/tail-worklog",
+    "worklogDir": "/path/to/tail-worklog",
     "numErrorsAlarm": 0,
     "errCountTimeRangeSecs": 15,
     "files": [
@@ -184,37 +172,40 @@ For the batch mode, the config is like this:
 {
   "logFiles": {
     "appType": "korpus-db",
-    "worklogPath": "/path/to/batch-worklog",
+    "worklogDir": "/path/to/batch-worklog",
     "srcPath": "/path/to/log/files/dir",
     "tzShift": 120,
     "partiallyMatchingFiles": false
   }
 }
 ```
 
+Note: `partiallyMatchingFiles` set to `true` will allow processing files which are partially older
+than requested minimum datetime (but still - only the matching records will be accepted)
+
 ## ElasticSearch compatibility notes
 
-Because ElasticSearch underwent some backward incompatible changes between versions 5.x.x and 6.x.x ,
-the configuration contains the *majorVersion* key which specifies how *klogproc* stores the data.
+Because ElasticSearch underwent some backward incompatible changes between versions `5` and `6`,
+the configuration contains the `majorVersion` key which specifies how Klogproc stores the data.
 
 ### ElasticSearch 5
 
 This version supports multiple data types ("mappings") per index which was also
 the default approach how CNC applications were stored - single index, multiple document
-types (one per application). In this case, the configuration directive *elasticSearch.index*
-specifies directly the index name *klogproc* works with. Individual document types
-can be distinguished either via ES internal *_type* property or via normal property *type*
-which is created by *klogproc*.
+types (one per application). In this case, the configuration directive `elasticSearch.index`
+specifies directly the index name Klogproc works with. Individual document types
+can be distinguished either via ES internal `_type` property or via normal property `type`
+which is created by Klogproc.
 
 ### ElasticSearch 6
 
-Here, multiple data mappings per index are being removed. *Klogproc* in this case
-uses its *elasticSearch.index* key as a prefix for index name created for an individual
-application. E.g. index = "log_archive" with configured "treq" and "morfio" apps expects
-you to have two indices: *log_archive_treq* and *log_archive_morfio". Please note
-that *klogproc* does not create the indices for you. The property *type* is still present
-in documents.
+In ES6, multiple data mappings per index has been removed. Klogproc in this case
+uses its `elasticSearch.index` key as a **prefix for index name** created for an individual
+application. E.g. index `log_archive` with configured `treq` and `morfio` apps expects
+you to have two indices: `log_archive_treq` and `log_archive_morfio`. Please note
+that Klogproc does not create the indices for you. The property `type` is still present in documents
+for backward compatibility.
 
 ## Customizing log processing with Lua scripts
 
-🚧
+See the [docs/scripting.md](docs/scripting.md) page.
diff --git a/docs/scripting.md b/docs/scripting.md
@@ -0,0 +1,126 @@
+# Scripting Klogproc with Lua
+
+For some application logs, Klogproc allows customization of their log
+processing without recompiling the whole application. The key principle
+is that a user defines two functions which are then applied repeatedly
+to each processed log record of the specified type.
+
+## Input record
+
+
+The input record provides access to its attributes in the same way as in the Go language - i.e. the attributes use camel case and start with an uppercase letter.
+
+As each application may provide different date and time encodings, input record has a `GetTime` method that returns the RFC3339 encoded date and time.
+
+
+## Output record
+
+Output record represents a normalized record shared by all logged applications. Klogproc typically provides a default way  of converting the application's own input log into the output format. If a Lua
+script is configured for the application, Klogproc will only call the Lua-defined transformation function which means that the default conversion is omitted. For use cases where the default conversion is still required and the purpose of the Lua script is just to customise the default conversion, it can be called explicitly:
+
+```lua
+local out = transform_default(input_rec, tz_offset_min)
+-- modify the output
+-- ...
+```
+
+For cases where a new empty record is needed for the script, just use:
+
+```lua
+local out = new_out_record()
+-- set output properties
+-- ...
+```
+
+To set a property in output record, Klogproc requires using `set_out_prop` function:
+
+```lua
+set_out_prop(out, name, value)
+```
+
+In case the attribute cannot be set (typically because it does not exist),
+the script ends with and error.
+
+To test whether a record (input or output) has a property:
+
+```lua
+if record_prop_exists(any_rec, name, value) then
+    -- we're sure the attribute `name` can be set
+end
+```
+
+Once the output record is set, it is necessary to set an ID that will be used as the database ID. Klogproc prefers deterministic IDs, which allow repeated data import without duplicating data with the same content but different IDs. To obtain a deterministic ID for an output record, use:
+
+```lua
+local id = out_rec_deterministic_id(out)
+```
+
+The function can be called repeatedly and for the same attributes (the ID itself is always ignored), it will return the same ID (hash).
+
+
+## Debugging, logging
+
+For printing contents of a value, use:
+
+```lua
+dump = require('dump')
+print(dump(my_value))
+```
+
+For logging:
+
+```lua
+log.info("message", map_with_args)
+-- other available logging levels are: "warn", "debug", "error"
+```
+The second argument is optional.
+
+
+## Global variables
+
+* `app_type` - an id representing a logged application (`kontext`, `wag`, `korpusdb` etc.),
+* `app_version` - a string representing a variant of the application; for some applications, versions are not defined,
+* `anonymous_users` - a list of CNC database IDs defined in Klogproc configuration
+
+## Function preprocess()
+
+The preprocess function is called before an input record is transformed into
+an output record. Its purpose is to provide the following options:
+
+1. decide whether to process the input_rec at all
+   (just return `{}` to skip the record)
+1. For applications, where a "query request" is hard to define (e.g. `mapka`),
+   it allows to generate "virtual" input records that are somehow derived from the real ones. E.g. in the `mapka v3` application, we search for
+   activity clusters and for each input cluster we generate a single record.
+
+Since all the application transformers written in Go require the `preprocess`
+function to be defined, in Lua it is possible to call it manually as
+Klogproc will not do it once a Lua script is defined. To do this, use
+function `preprocess_default(input_rec, buffer)`. In case no preprocessing
+is required, simply return the original record in a table:
+
+```lua
+function preprocess(input_rec, buffer)
+    return {input_rec}
+end
+```
+
+### Buffer access
+
+`TODO`
+
+## Function transform()
+
+The `transform` function converts an input record into a normalized form.
+As mentioned above, if a Lua script is defined for an application, Klogproc will not automatically call the hardcoded version of the transform function. So if it is needed, it has to be called explicitly.
+
+```lua
+-- transform function processes the input record and returns an output record
+function transform(input_rec)
+    local out = transform_default(input_rec, 0)
+    -- now we modify the Path property already set by transform_default
+	set_out_prop(
+        out, "Path", string.format("%s/modified/value", input_rec.Path))
+    return out
+end
+```
diff --git a/load/batch/fileselect.go b/load/batch/fileselect.go
@@ -156,8 +156,8 @@ func importTimeFromLine(lineStr string, tzShiftMin int) (int64, error) {
 
 // LogFileMatches tests whether the log file specified by filePath matches
 // in terms of its first record (whether it is older than the 'minTimestamp').
-// If strictMatch is false then in case of non matching file, also its mtime
-// is tested.
+// If strictMatch is true, then partially matching file (i.e. its first
+// record datetime is older than minTimestamp) is not accepted.
 //
 // The function expects that the first line on any log file contains proper
 // log record which should be OK (KonText also writes multi-line error dumps