Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
tomachalek committed Dec 10, 2024
1 parent ae3021b commit 877b914
Show file tree
Hide file tree
Showing 3 changed files with 195 additions and 78 deletions.
143 changes: 67 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,101 +1,89 @@
# Klogproc

![build status](https://travis-ci.org/czcorpus/klogproc.svg?branch=master)

*Klogproc* is a utility for processing/archiving logs generated by applications
run by the Institute of the Czech National Corpus.
*Klogproc* is a service for processing and archiving logs generated by applications
run by the Institute of the Czech National Corpus (CNC).

In general *Klogproc* reads an application-specific log record format from a file
or a Redis queue, parses individual lines and converts them into a target format
which is then stored to ElasticSearch or InfluxDB (both at the same time can be used).
In general, *Klogproc* reads continuously an application-specific log record format from a file,
parses individual lines and converts them into a target format which is then stored to ElasticSearch
database.

*Klogproc* replaces LogStash as a less resource-hungry and runtime environment demanding
alternative. All the processing (reading, writing, handling multiple files) is performed
concurrently which makes it quite fast.
In the CNC, *Klogproc* replaced LogStash as a less resource-hungry alternative. All the processing
(reading, writing, handling multiple files) is performed concurrently which makes it quite fast.

## Overview

### Supported applications

| Name | config code | versions | note |
|------------|-------------|----------|---------------------------------------|
| Akalex | akalex | `N/A` | a Shiny app with a custom log (*) |
| APIGuard | apiguard | `N/A` | CNC's internal API proxy and watchdog |
| Calc | calc | `N/A` | a Shiny app with a custom log (*) |
| CNC-VLO | vlo | `N/A` | a custom CNC node for the [Clarin VLO](https://vlo.clarin.eu/) (JSONL log) |
| Gramatikat | gramatikat | `N/A` | a Shiny app with a custom log (*) |
| KonText | kontext | `0.13`, `0.14`, `0.15`, `0.16`, `0.17`, `0.18` | |
| KorpusDB | korpus-db | `N/A` | |
| Kwords | kwords | `1`, `2` | |
| Lists | lists | `N/A` | a Shiny app with a custom log (*) |
| Mapka | mapka | `1`, `2`, `3` | using Nginx/Apache access log |
| Morfio | morfio | `N/A` | |
| MQuery-SRU | mquery-sru | `N/A` | a [Clarin FCS](https://www.clarin.eu/content/federated-content-search-clarin-fcs-technical-details) endpoint (JSONL log) |
| QuitaUP | quita-up | `N/A` | a Shiny app with a custom log (*) |
| SkE | ske | `N/A` |using Nginx/Apache access log |
| SyD | syd | `N/A` | a custom app log |
| Treq | treq | current, `v1-api` | a custom app log |
| WaG | wag | `0.6`, `0.7` | web access log, currently without user credentials |

(*) All the Shiny apps use the same log fromat.

The program supports three operation modes - *batch*, *tail*, *redis*

### Batch processing of a directory or a file

For non-regular imports e.g. when migrating older data, *batch* mode allows
importing of multiple files from a single directory. The contents of the directory
| Name | config code | versions | scripting | note |
|------------|-------------|----------|-----------|----------------------------|
| Akalex | akalex | :x: | :x: | a Shiny app with a custom log (:asterisk:) |
| APIGuard | apiguard | :x: | :x: | CNC's internal API proxy and watchdog |
| Calc | calc | :x: | :x: | a Shiny app with a custom log (:asterisk:) |
| CNC-VLO | vlo | :x: | :x: | a custom CNC node for the [Clarin VLO](https://vlo.clarin.eu/) (JSONL log) |
| Gramatikat | gramatikat | :x: | :x: | a Shiny app with a custom log (:asterisk:) |
| KonText | kontext | `0.13`, `0.14`, `0.15`, `0.16`, `0.17`, `0.18` | :white_check_mark: |
| KorpusDB | korpus-db | :x: | :x: | |
| Kwords | kwords | `1`, `2` | :white_check_mark: | |
| Lists | lists | :x: | :x: | a Shiny app with a custom log (:asterisk:) |
| Mapka | mapka | `1`, `2`, `3` | :white_check_mark: (v3) | using Nginx/Apache access log |
| Morfio | morfio | :x: | :x: | |
| MQuery-SRU | mquery-sru | :x: | :x: | a [Clarin FCS](https://www.clarin.eu/content/federated-content-search-clarin-fcs-technical-details) endpoint (JSONL log) |
| QuitaUP | quita-up | :x: | :x: | a Shiny app with a custom log (:asterisk:) |
| SkE | ske | :x: | :x: | using Nginx/Apache access log |
| SyD | syd | :x: | :x: | a custom app log |
| Treq | treq | current, `v1-api` | :white_check_mark: | a custom app log |
| WaG | wag | `0.6`, `0.7` | :white_check_mark: | web access log, currently without user credentials |

(:asterisk:) All the Shiny apps use the same log fromat.

The program can work in two modes - `batch` and `tail`

### Batch - ad-hoc processing of a directory or a file

For non-regular imports e.g. when migrating older data or when debugging a log processing routines,
`batch` mode allows importing of multiple files from a single directory. The contents of the directory
can be even changed over time by adding **newer** log records and *klogproc* will
be able to import only new items as it keeps a worklog with the newest record
currently processed.

### Batch processing of a Redis queue (deprecated)

Note: On the application side, this is currently supported only in KonText
and SkE (with added special Python module *scripts/redislog.py* which is part of
the *klogproc* project).

In this case, an application writes its log to a Redis queue (*list* type) and
*klogproc* regularly takes N items from the queue (items are removed from there),
transforms them and stores them to specified destinations.

### Tail-like listening for changes in multiple files
### Tail - listening for changes in multiple files

This is the mode which replaces CNC's LogStash solution and it is a typical
mode to use. One or more log file listeners can be configured to read newly
mode of use. One or more log file listeners can be configured to read newly
added lines. The log files are checked in regular intervals (i.e. the change is
not detected immediately). *Klogproc* remembers current inode and current seek position
not detected immediately). Klogproc remembers current inode and current seek position
for watched files so it should be able to continue after outages etc. (as long as
the log files are not overwritten in the meantime due to log rotation).
the log files are not overwritten in the meantime due to log rotation).


## Installation

[Install](https://golang.org/doc/install) *Go* language if it is not already
available on your system.

Clone the *klogproc* project:
Clone the `klogproc` project:

`git clone https://klogproc.git`

Build the project:

`go build`
`make`

Copy the binary somewhere:

`sudo cp klogproc /usr/local/bin`
`sudo cp klogproc /opt/klogproc/bin`

Create a config file (e.g. in /usr/local/etc/klogproc.json):
Create a config file (e.g. in `/opt/klogproc/etc/klogproc.json`):

```json
{
"logging": {
"path": "/var/log/klogproc/klogproc.log"
"path": "/opt/klogproc/var/log/klogproc.log"
},
"logTail": {
"intervalSecs": 15,
"worklogPath": "/var/opt/klogproc/worklog-tail.log",
"worklogDir": "/opt/klogproc/var/worklog-tail",
"files": [
{"path": "/var/log/ucnk/syd.log", "appType": "syd"},
{"path": "/var/log/treq/treq.log", "appType": "treq"},
Expand All @@ -112,20 +100,20 @@ Create a config file (e.g. in /usr/local/etc/klogproc.json):
"scrollTtl": "3m",
"reqTimeoutSecs": 10
},
"geoIPDbPath": "/var/opt/klogproc/GeoLite2-City.mmdb",
"geoIPDbPath": "/opt/klogproc/var/GeoLite2-City.mmdb",
"anonymousUsers": [0, 1, 2]
}
```

notes:
Notes:

- Do not forget to create directory for logging, worklog and also
download and save GeoLite2-City database.
- The applied `tzShift` for the *kwords* app is just an example; it should be applied iff the stored
datetime values provide incorrect time-zone (e.g. if it looks like UTC time but the actual
values reprezent local time) - see the section Time-zone notes for more info.

Configure systemd (/etc/systemd/system/klogproc.service):
Configure systemd (`/etc/systemd/system/klogproc.service`):

```ini
[Unit]
Expand All @@ -134,7 +122,7 @@ After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/klogproc tail /usr/local/etc/klogproc.json
ExecStart=/opt/klogproc/bin/klogproc tail /opt/klogproc/etc/klogproc.json
User=klogproc
Group=klogproc

Expand Down Expand Up @@ -164,7 +152,7 @@ For the tail action, the config is as follows:
{
"logTail": {
"intervalSecs": 5,
"worklogPath": "/path/to/tail-worklog",
"worklogDir": "/path/to/tail-worklog",
"numErrorsAlarm": 0,
"errCountTimeRangeSecs": 15,
"files": [
Expand All @@ -184,37 +172,40 @@ For the batch mode, the config is like this:
{
"logFiles": {
"appType": "korpus-db",
"worklogPath": "/path/to/batch-worklog",
"worklogDir": "/path/to/batch-worklog",
"srcPath": "/path/to/log/files/dir",
"tzShift": 120,
"partiallyMatchingFiles": false
}
}
```

Note: `partiallyMatchingFiles` set to `true` will allow processing files which are partially older
than requested minimum datetime (but still - only the matching records will be accepted)

## ElasticSearch compatibility notes

Because ElasticSearch underwent some backward incompatible changes between versions 5.x.x and 6.x.x ,
the configuration contains the *majorVersion* key which specifies how *klogproc* stores the data.
Because ElasticSearch underwent some backward incompatible changes between versions `5` and `6`,
the configuration contains the `majorVersion` key which specifies how Klogproc stores the data.

### ElasticSearch 5

This version supports multiple data types ("mappings") per index which was also
the default approach how CNC applications were stored - single index, multiple document
types (one per application). In this case, the configuration directive *elasticSearch.index*
specifies directly the index name *klogproc* works with. Individual document types
can be distinguished either via ES internal *_type* property or via normal property *type*
which is created by *klogproc*.
types (one per application). In this case, the configuration directive `elasticSearch.index`
specifies directly the index name Klogproc works with. Individual document types
can be distinguished either via ES internal `_type` property or via normal property `type`
which is created by Klogproc.

### ElasticSearch 6

Here, multiple data mappings per index are being removed. *Klogproc* in this case
uses its *elasticSearch.index* key as a prefix for index name created for an individual
application. E.g. index = "log_archive" with configured "treq" and "morfio" apps expects
you to have two indices: *log_archive_treq* and *log_archive_morfio". Please note
that *klogproc* does not create the indices for you. The property *type* is still present
in documents.
In ES6, multiple data mappings per index has been removed. Klogproc in this case
uses its `elasticSearch.index` key as a **prefix for index name** created for an individual
application. E.g. index `log_archive` with configured `treq` and `morfio` apps expects
you to have two indices: `log_archive_treq` and `log_archive_morfio`. Please note
that Klogproc does not create the indices for you. The property `type` is still present in documents
for backward compatibility.

## Customizing log processing with Lua scripts

🚧
See the [docs/scripting.md](docs/scripting.md) page.
126 changes: 126 additions & 0 deletions docs/scripting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Scripting Klogproc with Lua

For some application logs, Klogproc allows customization of their log
processing without recompiling the whole application. The key principle
is that a user defines two functions which are then applied repeatedly
to each processed log record of the specified type.

## Input record


The input record provides access to its attributes in the same way as in the Go language - i.e. the attributes use camel case and start with an uppercase letter.

As each application may provide different date and time encodings, input record has a `GetTime` method that returns the RFC3339 encoded date and time.


## Output record

Output record represents a normalized record shared by all logged applications. Klogproc typically provides a default way of converting the application's own input log into the output format. If a Lua
script is configured for the application, Klogproc will only call the Lua-defined transformation function which means that the default conversion is omitted. For use cases where the default conversion is still required and the purpose of the Lua script is just to customise the default conversion, it can be called explicitly:

```lua
local out = transform_default(input_rec, tz_offset_min)
-- modify the output
-- ...
```

For cases where a new empty record is needed for the script, just use:

```lua
local out = new_out_record()
-- set output properties
-- ...
```

To set a property in output record, Klogproc requires using `set_out_prop` function:

```lua
set_out_prop(out, name, value)
```

In case the attribute cannot be set (typically because it does not exist),
the script ends with and error.

To test whether a record (input or output) has a property:

```lua
if record_prop_exists(any_rec, name, value) then
-- we're sure the attribute `name` can be set
end
```

Once the output record is set, it is necessary to set an ID that will be used as the database ID. Klogproc prefers deterministic IDs, which allow repeated data import without duplicating data with the same content but different IDs. To obtain a deterministic ID for an output record, use:

```lua
local id = out_rec_deterministic_id(out)
```

The function can be called repeatedly and for the same attributes (the ID itself is always ignored), it will return the same ID (hash).


## Debugging, logging

For printing contents of a value, use:

```lua
dump = require('dump')
print(dump(my_value))
```

For logging:

```lua
log.info("message", map_with_args)
-- other available logging levels are: "warn", "debug", "error"
```
The second argument is optional.


## Global variables

* `app_type` - an id representing a logged application (`kontext`, `wag`, `korpusdb` etc.),
* `app_version` - a string representing a variant of the application; for some applications, versions are not defined,
* `anonymous_users` - a list of CNC database IDs defined in Klogproc configuration

## Function preprocess()

The preprocess function is called before an input record is transformed into
an output record. Its purpose is to provide the following options:

1. decide whether to process the input_rec at all
(just return `{}` to skip the record)
1. For applications, where a "query request" is hard to define (e.g. `mapka`),
it allows to generate "virtual" input records that are somehow derived from the real ones. E.g. in the `mapka v3` application, we search for
activity clusters and for each input cluster we generate a single record.

Since all the application transformers written in Go require the `preprocess`
function to be defined, in Lua it is possible to call it manually as
Klogproc will not do it once a Lua script is defined. To do this, use
function `preprocess_default(input_rec, buffer)`. In case no preprocessing
is required, simply return the original record in a table:

```lua
function preprocess(input_rec, buffer)
return {input_rec}
end
```

### Buffer access

`TODO`

## Function transform()

The `transform` function converts an input record into a normalized form.
As mentioned above, if a Lua script is defined for an application, Klogproc will not automatically call the hardcoded version of the transform function. So if it is needed, it has to be called explicitly.

```lua
-- transform function processes the input record and returns an output record
function transform(input_rec)
local out = transform_default(input_rec, 0)
-- now we modify the Path property already set by transform_default
set_out_prop(
out, "Path", string.format("%s/modified/value", input_rec.Path))
return out
end
```
4 changes: 2 additions & 2 deletions load/batch/fileselect.go
Original file line number Diff line number Diff line change
Expand Up @@ -156,8 +156,8 @@ func importTimeFromLine(lineStr string, tzShiftMin int) (int64, error) {

// LogFileMatches tests whether the log file specified by filePath matches
// in terms of its first record (whether it is older than the 'minTimestamp').
// If strictMatch is false then in case of non matching file, also its mtime
// is tested.
// If strictMatch is true, then partially matching file (i.e. its first
// record datetime is older than minTimestamp) is not accepted.
//
// The function expects that the first line on any log file contains proper
// log record which should be OK (KonText also writes multi-line error dumps
Expand Down

0 comments on commit 877b914

Please sign in to comment.