Skip to content

streamthoughts/kafka-connect-transform-grok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kafka Connect Grok Transformation

https://github.com/streamthoughts/kafka-connect-transform-grok/blob/master/LICENSE GitHub issues GitHub Repo stars

Description

The Apache Kafka® SMT io.streamthoughts.kafka.connect.transform.Grok allows parsing unstructured data text into a data Struct from the entire key or value using a Grok expression.

Installation

Grok SMT can be installed using the Confluent Hub Client with:

  1. Download the distribution ZIP file for the latest available version.

    export VERSION=1.1.0
    export GITHUB_HUB_REPO=https://github.com/streamthoughts/kafka-connect-transform-grok
    $ curl -sSL $GITHUB_HUB_REPO/releases/download/v$VERSION/streamthoughts-kafka-connect-transform-grok-$VERSION.zip
  2. Use the confluent-hub CLI for installing it.

    $ confluent-hub install streamthoughts-kafka-connect-transform-grok-$VERSION.zip

Alternatively, you can extract the ZIP file into one of the directories that is listed on the plugin.path worker configuration property.

Grok Basics

The syntax for a grok pattern is %{SYNTAX:SEMANTIC} or %{SYNTAX:SEMANTIC:TYPE}

The SYNTAX is the name of the pattern that should match the input text value.

The SEMANTIC is the field name for the data field that will contain the piece of text being matched.

The TYPE is the target type to which the data field must be converted.

Supported types are:
  • SHORT

  • INTEGER

  • LONG

  • FLOAT

  • DOUBLE

  • BOOLEAN

  • STRING

The Kafka Connect Grok Transformer ships with a lot of reusable grok patterns. See the complete list of patterns.

Debugging Grok Expression

You can build and debug your patterns using the useful online tools: Grok Debug and Grok Constructor.

Regular Expressions

Grok sits on top of regular expressions, so any regular expressions are valid in grok as well.

The Grok SMT uses the regular expression Joni library which is the Java port of Oniguruma regexp library used by the Elastic stack (i.e Logstash).

Custom Patterns

Sometimes, the patterns provided by the Grok SMT will not be sufficient to match your data. Therefore, you have a few options to define custom patterns.

Option #1

You can use the Oniguruma syntax for named capture which allows you to match a piece of text and capture it as a field:

(?<field_name>the regex pattern)

For example, if you need to capture parts of an email we can you the following pattern :

(?<EMAILADDRESS>(?<EMAILLOCALPART>^[A-Z0-9._%+-]+)(?<HOSTNAME>@[A-Z0-9.-]+\.[A-Z]{2,6})$)
Configuration
transforms=Grok
transforms.Grok.type=io.streamthoughts.kafka.connect.transform.Grok$Value
transforms.Grok.pattern=(?<EMAILADDRESS>(?<EMAILLOCALPART>[A-Za-z0-9._%+-]+)@(?<HOSTNAME>[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}))

Note: The pattern EMAILADDRESS is already provided by the Grok SMT.

Option #2

You can create a custom patterns file that will loaded the first time the Grok SMT is used :

For example, defining the pattern needed to parse NGINX access logs
  • Create a directory (e.g. grok-patterns) with a file in it called nginx.

  • Then, write the pattern you need in that file as: <the pattern name><a space><the regexp for that pattern>.

$ mkdir ./grok-patterns
$ cat <<EOF > ./grok-patterns/nginx
NGINX_ACCESS %{IPORHOST:remote_addr} - %{USERNAME:remote_user} \[%{HTTPDATE:time_local}\] \"%{DATA:request}\" %{INT:status} %{NUMBER:bytes_sent} \"%{DATA:http_referer}\" \"%{DATA:http_user_agent}\"
EOF
Configuration
transforms=Grok
transforms.Grok.type=io.streamthoughts.kafka.connect.transform.Grok$Value
transforms.Grok.pattern=%{NGINX_ACCESS}
transforms.Grok.patternsDir=/tmp/grok-patterns

Grok Configuration

Property Description Type Importance Default

breakOnFirstPattern

If true break on the first successful matching. Otherwise

the transformation will try all configured grok patterns

boolean

true

pattern

The grok expression to match and extract named captures (i.e data fields) with.

string

High

-

patterns.<id>

An ordered list of grok expression to match and extract named captures (i.e data fields) with.

string

High

-

patternDefinitions

Custom pattern definitions

list

Low

-

patternsDir

List of user-defined pattern directories

list

Low

-

namedCapturesOnly

If true then only store named captures from grok

boolean

Medium

true

About

Originally, most of the source code used by the Apache Kafka® SMT io.streamthoughts.kafka.connect.transform.Grok was developed within the Kafka Connect File Pulse connector plugin.

Licence

Copyright 2020-2021 StreamThoughts.

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.