Skip to content

Latest commit

 

History

History
1850 lines (1514 loc) · 52.5 KB

user_guide.adoc

File metadata and controls

1850 lines (1514 loc) · 52.5 KB

Kaitai Struct User Guide

1. Introduction

Kaitai Struct is a domain-specific language (DSL) that is designed with one particular task in mind: dealing with arbitrary binary formats.

Parsing binary formats is hard, and that’s a reason for that: such formats were designed to be machine-readable, not human-readable. Even when one’s working with a clean, well-documented format, there are multiple pitfalls that await the developer: endianness issues, in-memory structure alignment, variable size structures, conditional fields, repetitions, fields that depend on other fields previously read, etc, etc, to name a few.

Kaitai Struct tries to isolate the developer from all these details and allow to focus on the things that matter: the data structure itself, not particular ways to read or write it.

2. Installation and invocation

Kaitai Struct has somewhat diverse infrastructure around it, this chapter will give an overview of the options available.

2.1. Web IDE

If you’re going to try Kaitai Struct for the first time, then probably it’s the easiest way to get started. Just open Kaitai Struct Web IDE and you’re ready to go:

Kaitai Struct Web IDE (sample PNG file + png.ksy loaded)
Figure 1: Kaitai Struct Web IDE (sample PNG file + png.ksy loaded)

2.2. Desktop / console version

If you don’t fancy using a hex dump in a browser, or played around with that and want to integrate Kaitai Struct into your project build process automation, you’d want a desktop / console solution. Of course, Kaitai Struct offers that as well.

2.2.1. Installation

Please refer to official website for installation instructions. After installation, you’re expected to have:

  • ksc (or kaitai-struct-compiler) - command-line Kaitai Struct compiler, a program that translates .ksy into parsing libraries in a chosen target language.

  • ksv (or kaitai-struct-visualizer, optional) - console visualizer

Note
ksc shorthand might be not available if your system doesn’t support symbolic links - just use the full name then.

If you’re going to invoke ksc frequently, you’d probably want to add it to your executable searching PATH, so you don’t have to type full path to it every time. You’d get that automatically on .deb package and Windows .msi install (provided you don’t disable that option) - but it might take some extra manual setup if you use a generic .zip package.

2.2.2. Invocation

Invoking ksc is easy:

ksc [options] <file>...

Common options:

  • <file>…​ — source files (.ksy)

  • -t <language> | --target <language> — target languages (cpp_stl, csharp, java, javascript, perl, php, python, ruby, all)

    • all is a special case: it compiles all possible target languages, creating language-specific directories (as per language identifiers) inside output directory, and then creating output module(s) for each language starting from there

  • -d <directory> | --outdir <directory> — output directory (filenames will be auto-generated)

Language-specific options:

  • --dot-net-namespace <namespace> — .NET namespace (C# only, default: Kaitai)

  • --java-package <package> — Java package (Java only, default: root package)

  • --php-namespace <namespace> — PHP namespace (PHP only, default: root package)

Misc options:

  • --verbose — verbose output

  • --help — display usage information and exit

  • --version — output version information and exit

3. Workflow overview

As you might have already read on the project’s website, the main idea of Kaitai Struct is that you create a description of a binary data structure format using a formal language, save it as .ksy file, and then you compile it with KS compiler into target programming language

TODO

4. Kaitai Struct language

With the workflow issues out of the way, let’s concentrate on the Kaitai Struct language itself.

4.1. Fixed-size structures

Probably the simplest thing KS can do is reading fixed-size structures. You might know them as C struct definitions - consider something like this fictional database entry that keeps track of the dog show participants:

struct {
    char uuid[16];       /* 128-bit UUID */
    char name[24];       /* Name of the animal */
    uint16_t birth_year; /* Year of birth, used to calculate the age */
    double weight;       /* Current weight in kg */
    int32_t rating;      /* Rating, can be negative */
} animal_record;

And here is how it would look in .ksy:

meta:
  id: animal_record
  endian: be
seq:
  - id: uuid
    size: 16
  - id: name
    type: str
    size: 24
    encoding: UTF-8
  - id: birth_year
    type: u2
  - id: weight
    type: f8
  - id: rating
    type: s4

It’s the YAML-based format, plain and simple. Every .ksy file is a type description. Everything starts with a meta section: this is where we specify top-level info on the whole structure we describe. There are two important things here:

  • id specifies name of the structure

  • endian specifies default endianness:

    • be for big-endian (AKA "network byte order", AKA Motorola, etc)

    • le for little-endian (AKA Intel, AKA VAX, etc)

With that out of the way, we use seq element with an array (ordered sequence of elements) in it to describe which attributes this structure consists of. Every attribute includes several keys, namely:

  • id is used to give attribute a name

  • type designates attribute type:

    • no type means that we’re dealing with just a raw byte array; size is to be used to designate number of bytes in this array

    • s1, s2, s4, u1, u2, s4, etc for integers

      • "s" means signed, "u" means unsigned

      • number is the number of bytes

      • if you need to specify non-default endianness, you can force it by appending be or le - i.e. s4be, u8le, etc

    • f4 and f8 for IEEE 754 floating point numbers; 4 and 8, again, designate the number of bytes (single or double precision)

    • str is used for strings; that is almost the same as "no type", but string has a concept of encoding, which must be specified using encoding

YAML-based syntax might look a little more verbose than C-like struct, but there are a few good reasons to use it. It is consistent, it is easily extendable, and it’s easy to parse, so it’s easy to make your own programs/scripts that work with .ksy specs.

A very simple example is that we can add docstrings to every attribute, using syntax like that:

  - id: rating
    type: s4
    doc: Rating, can be negative

These docstrings are just the comments in .ksy, they’ll actually get exported into target language as well (for example, in Java they’ll become JavaDoc, in Ruby they’ll become RDoc/YARD, etc). This, in turn, is super helpful when editing the code in various IDEs that will generate reminder popups for intelligent completion, when you browse through class attributes:

JavaDoc is generated automatically from `doc`
Figure 2: JavaDoc is generated automatically from doc
Note
You can use YAML folded style strings for longer documentation that spans multiple lines:
  - id: opcode
    type: u1
    doc: >
      Operation code that defines which operation should be performed
      by a virtual machine. Subsequent parameters for operation depend
      on the value of opcode.

4.2. Checking for "magic" signatures

Many file formats use some sort of safeguard measure against using completely different file type in place of the required file type. The simple way to do so is to include some "magic" bytes (AKA "file signature"): for example, checking that first bytes of file are equal to their intended values provides at least some degree of protection against such blunders.

To specify "magic" bytes (i.e. fixed content) in structures, KS includes a special contents key. For example, this is the beginning of a seq for Java .class file:

seq:
  - id: magic
    contents: [0xca, 0xfe, 0xba, 0xbe]

This reads first 4 bytes and compare them to 4 bytes CA FE BA BE. If there would be any mismatch (or there would be less than 4 bytes read), that’ll throw an exception and stop parsing at early stage, before any damage (pointless allocation of huge structures, waste of CPU cycles) would be done.

Note that contents is very flexible and you can specify there:

  • An UTF-8 string - bytes from such a string would be used to check against

  • An array with:

    • integers in decimal representation

    • integers in hexadecimal representation, starting with 0x

    • UTF-8 strings

In case of using an array, all elements' byte representations would be concatenated and expected in sequence. Few examples:

  - id: magic1
    contents: JFIF
    # expects bytes: 4A 46 49 46
  - id: magic2
    # we can use YAML block-style arrays as well
    contents:
      - 0xca
      - 0xfe
      - 0xba
      - 0xbe
    # expects bytes: CA FE BA BE
  - id: magic3
    contents: [CAFE, 0, BABE]
    # expects bytes: 43 41 47 45 00 42 41 42 45

More extreme examples to illustrate the idea (i.e. possible, but definitely not recommended in real-life specs):

  - id: magic4
    contents: [foo, 0, A, 0xa, 42]
    # expects bytes: 66 6F 6F 00 41 0A 2A
  - id: magic5
    contents: [1, 0x55, '▒,3', 3]
    # expects bytes: 01 55 E2 96 92 2C 33 03
Note
There’s no need to specify type or size for fixed contents data - it all comes naturally from the contents.

4.3. Variable-length structures

Many protocols and file formats tend to conserve bytes, especially for strings. Sure, it’s stupid to have a fixed 512-byte buffer for a string that typically is 3-5 bytes long and only rarely can be up to 512 bytes.

One of the most common methods used to mitigate this problem is to use some integer to designate length of the string, and store only designated number of bytes in the stream. Unfortunately, this yields variable-length structure, and it’s impossible to describe such thing using C-style structs. However, it’s not a problem for KS:

seq:
  - id: my_len
    type: u4
  - id: my_str
    type: str
    size: my_len
    encoding: UTF-8

Note the size field: we use not a constant, but a reference to a field that we’ve just parsed from a stream. Actually, you can do much more than that - you can use a full-blown expression language in size field. For example, what if we’re dealing with UTF-16 string and my_len value designates not a number of bytes, but number of byte pairs?

seq:
  - id: my_len
    type: u4
  - id: my_str
    type: str
    size: my_len * 2
    encoding: UTF-16LE

One can just multiply my_len by 2 - and voila - here’s our UTF-16 string. Expression language is very powerful, we’ll be talking more about it later.

Last, but not least, we can specify a size that spans automatically to the end of the stream. For that one, we’ll use slightly different syntax:

seq:
  - id: some_int
    type: u4
  - id: string_spanning_to_the_end_of_file
    type: str
    encoding: UTF-8
    size-eos: true

4.4. Enums (named integer constants)

The nature of binary format encoding dictates use that in many cases we’ll be using some kind of integer constants to encode certain entities. For example, a widely known IP packet uses 1-byte integer to encode protocol type for the payload: 6 would mean "TCP" (which gives us infamous TCP/IP), 17 would mean "UDP" (which yields UDP/IP), and 1 means "ICMP".

It is possible to live with just raw integers, but most programming language actually provide a way to program using meaningful string names instead. This approach is usually dubbed "enums" and it’s totally possible to generate a enum in KS:

seq:
  - id: protocol
    type: u1
    enum: ip_protocol
enums:
  ip_protocol:
    1: icmp
    6: tcp
    17: udp

There are two things that should be done to declare a enum:

  1. We add enums key on the type level (i.e. on the same level as seq and meta). Inside that key, we add a map, keys of it being names of enum (in this example, there’s only one enum declared, ip_protocol) and values being yet another map, which maps integer values into identifiers.

  2. We add enum: …​ parameter to every attribute that’s going to be represented by that enum, instead of just being a raw integer. Note that such attributes must have some sort of integer type in the first place (i.e. type: u* or type: s*).

4.5. Substructures (subtypes)

What do we do if we need to use many of the strings in such a format? Writing so many repetitive my_len- / my_str-style pairs would be so bothersome and error-prone. Fear not, we can define another type, defining it in the same file, and use it as a custom type in a stream:

seq:
  - id: track_title
    type: str_with_len
  - id: album_title
    type: str_with_len
  - id: artist_name
    type: str_with_len
types:
  str_with_len:
    seq:
      - id: len
        type: u4
      - id: value
        type: str
        encoding: UTF-8
        size: len

Here we define another type named str_with_len, which we reference just by doing type: str_with_len. The type itself is defined using types: key on top-level type. That’s a map, inside it we can define as many subtypes as we want. We define just one, and inside it we just nest the exact same syntax as we use for the type description on the top level - i.e. the same seq designation.

Note
There’s no need for meta here, as type name is derived from types key name here.

Of course, one can actually have more levels of subtypes:

TODO

4.6. Accessing attributes in other types

Expression language (used, for example, in size key) allows you to refer not only attributes in current type, but also in other types. Consider this example:

seq:
  - id: header
    type: main_header
  - id: body
    size: header.body_len
types:
  main_header:
    seq:
      - id: magic
        contents: MY-SUPER-FORMAT
      - id: body_len
        type: u4

If body_len attribute was in the same type as body, we could just use size: body_len. However, in this case we’ve decided to split the main header into separate subtype, so we’ll have to access it using . operator - i.e. size: header.body_len.

Obviously, one can chain attributes with . to dig deeper into type hierarchy - i.e. size: header.subheader_1.subsubheader_1_2.field_4. But sometimes we need just the opposite: how do we access upper-level elements from lower-level types? KS provides two options here:

4.6.1. _parent

One can use special pseudo-attribute _parent to access parent structure:

TODO

4.6.2. _root

In some cases, it would be way too impractical to write tons of _parent._parent._parent._parent…​ or just plain impossible (if you’re describing an type which might be used on several different levels, thus different number of _parent would be needed). In this case, we can use special pseudo-attribute _root to just start navigating from the very top-level type:

TODO

seq:
  - id: header
    type: main_header
types:
  main_header:
    seq:
      - id: magic
        contents: MY-SUPER-FORMAT
      - id: body_len
        type: u4
      - id: subbody_len
        type: u4

4.7. Conditionals

Some protocols and file formats have optional fields, which only exist on some conditions. For example, one can have some byte first that designates if some field exists (1) or not (0). In KS, you can do that using if key:

seq:
  - id: has_crc32
    type: u1
  - id: crc32
    type: u4
    if: has_crc32 != 0

In this example, we again use expression language to specify a boolean expression in if key. If that expression is true, field is parsed and we’ll get a result. If that expression is false, field will be skipped and we’ll get a null or it’s closest equivalent in our target programming language if we’ll try to get it.

At this point, you might wonder how that plays together with enums. After you mark some integer as "enum", it’s no longer just an integer, so you can’t compare it directly with the number. Instead you’re expected to compare it to other enum values:

seq:
  - id: my_animal
    type: u1
    enum: animal
  - id: dog_tag
    type: u4
    # Comparing to enum literal
    if: my_animal == animal::dog
enums:
  animal:
    1: cat
    2: dog

There are other enum operations available, we’ll cover them in expression language guide later.

4.8. Repetitions

Most real-life file formats do not contain only one copy of some element, but might contain several copies, i.e. they repeat the same pattern over and over. Repetition might be:

  • element repeated up to the very end of the stream

  • element repeated a pre-defined number of times

  • element repeated while some condition is satified (or until some condition won’t become true)

KS supports all these types of repetitions. In all cases, it will create a resizable array (or nearest equivalent available in target language) and populate it with elements.

4.8.1. Repeat until end of stream

This is the simplest kind of repetition, done by specifying repeat: eos. For example:

seq:
  - id: numbers
    type: u4
    repeat: eos

This yields an array of unsigned integers, each is 4 bytes long, which spans till the end. Note that if we’ve got a number of bytes left in the stream that’s not divisible by 4 (for example, 7), we’ll end up reading as much as possible, and then parsing procedure will throw an end-of-stream exception. Of course, you can do that with any type, including user-defined types (subtypes):

seq:
  - id: filenames
    type: filename
    repeat: eos
types:
  filename:
    seq:
      - id: name
        type: str
        size: 8
        encoding: ASCII
      - id: ext
        type: str
        size: 3
        encoding: ASCII

This one defines an array of records of type filename. Each individual filename consists of a 8-byte name and 3-byte ext strings in ASCII encoding.

4.8.2. Repeat for a number of times

One can repeat an element a certain number of times. For that, we’ll need an expression that will give us number of iterations (which would be exactly the number of items in resulting array). It could be a simple constant to read exactly 12 numbers:

seq:
  - id: numbers
    type: u4
    repeat: expr
    repeat-expr: 12

Or we might reference some attribute here to have an array with length specified inside the format:

seq:
  - id: num_floats
    type: u4
  - id: floats
    type: f8
    repeat: expr
    repeat-expr: num_floats

Or, using expression language, we can even do some more complex math on it:

seq:
  - id: width
    type: u4
  - id: height
    type: u4
  - id: matrix
    type: f8
    repeat: expr
    repeat-expr: width * height

This one specifies width and height of the matrix first, then parses as many matrix elements as needed to fill a width × height matrix (although note that it won’t be a true 2D matrix: it would still be just a regular 1D array, and you’ll need to convert (x, y) coordinate to address in that 1D array manually).

4.8.3. Repeat until condition is met

Some formats don’t specify the number of elements in array, but instead just use some sort of special element as a terminator that signifies end of data. KS can do that as well using repeat-until syntax, for example:

seq:
  - id: numbers
    type: s4
    repeat: until
    repeat-until: _ == -1

This one reads 4-byte signed integer numbers until encountering -1. On encountering -1, the loop will stop and further sequence elements (if any) will be processed. Note that -1 would still be added to array.

Underscore (_) is used as a special variable name that refers to the element that we’ve just parsed. When parsing an array of user types, it is possible write a repeat-until expression that would reference some attribute inside that user type:

seq:
  - id: records
    type: buffer_with_len
    repeat: until
    repeat-until: _.len == 0
types:
  buffer_with_len:
    seq:
      - id: len
        type: u1
      - id: value
        size: len

4.9. Typical TLV implementation (switching types on an expression)

"TLV" stands for "type-length-value", and it’s a very common staple in many formats. The basic idea is that we do modular and reverse-compatible format. On the top level, it’s very simple: we know that the whole format is just an array of records (repeat: eos or repeat: expr). Each record starts the same: there is some marker that specifies type of the record and an integer that specifies record’s length. After that, record’s body follows, and the body format depends on the type marker. One can easily specify that basic record outline in KS like that:

seq:
  - id: rec_type
    type: u1
  - id: len
    type: u4
  - id: body
    size: len

However, how do we specify the format for body that depends on rec_type? One of the approaches if using conditionals, as we’ve seen before:

seq:
  - id: rec_type
    type: u1
  - id: len
    type: u4
  - id: body_1
    type: rec_type_1
    size: len
    if: rec_type == 1
  - id: body_2
    type: rec_type_2
    size: len
    if: rec_type == 2
  # ...
  - id: body_unidentified
    size: len
    if: rec_type != 1 and rec_type != 2 # and ...

However, it’s easy to see why it’s not a very good solution:

  • We end up writing lots of repetitive lines

  • We create lots of body_* attributes in a type, while in reality only only body would exist - everything else would fail the if comparison and thus would be null

  • If we want to catch up the "else" branch, i.e. match everything not matched with our `if`s, we have to write an inverse of sum of `if`s manually. For anything more than 1 or 2 types it quickly becomes a mess.

That is why KS offers an alternative solution. We can use switch type operation:

seq:
  - id: rec_type
    type: u1
  - id: len
    type: u4
  - id: body
    size: len
    type:
      switch-on: rec_type
      cases:
        1: rec_type_1
        2: rec_type_2

This is much more concise and easier to maintain, isn’t it? And note that size is specified on attribute level, thus it applies to all possible type values, setting us a good hard limit. What’s ever better - even if you’re missing the match, as long as you have size specified, you would still parse body of a given size, but instead of interpreting it with some user type, it will be treated as having no type, thus yielding a raw byte array. This is super useful, as it allows you to work on TLV-like formats step-by-step, starting with support of only 1 or 2 types of records, and gradually adding more and more types.

4.10. Instances: data beyond the sequence

So far we’ve done all the data specifications in seq - thus they’ll get parsed immediately from the beginning of the stream, one-by-one, in strict sequence. But what if the data you want is located at some other position in the file, or comes not in sequence?

"Instances" are the Kaitai Struct’s answer for that. They’re specified in a key instances on the same level as seq. Consider this example:

meta:
  id: big_file
  endian: le
instances:
  some_integer:
    pos: 0x400000
    type: u4
  a_string:
    pos: 0x500fff
    type: str
    size: 0x11
    encoding: ASCII

Inside instances we need to create a map: keys in that map would be attribute names, and values specify attribute in the very same manner as we would have done it in seq, but there is one important additional feature: using pos: …​ one can specify a position to start parsing that attribute from (in bytes from the beginning of the stream). Just as in size, one may use expression language and reference other attributes in pos. This is used very often to allow accessing file body inside a container file when we have some file index data: file position in container and length:

seq:
  - id: file_name
    type: str
    size: 8 + 3
    encoding: ASCII
  - id: file_offset
    type: u4
  - id: file_size
    type: u4
instances:
  body:
    pos: file_offset
    size: file_size

Another very important difference between seq attribute and instances attribute is that instances are lazy by default. What does it mean? Unless someone would call that body getter method programmatically, no actual parsing of body would be done. This is super useful for parsing larger files, such as images of filesystems. It is impractical for a filesystem user to load all the filesystem data into memory at once: one usually finds a file by its name (traversing file index somehow), and then can access file’s body right away. If that’s the first time this file is being accessed, body will be loaded (and parsed) into RAM. Second and all subsequent times will just return a cached copy from the RAM, avoiding any unnecessary re-loading / re-parsing, thus conserving both RAM and CPU time.

Note that from the programming point of view (from the target programming languages and from internal Kaitai Struct’s expression language), seq attributes and instances are exactly the same.

4.11. Calculated ("value") instances

Sometimes, it is useful to transform the data (using expression language) and store it as a named value. There’s another sort of instances for that - calculated (AKA "value") instances. They’re very simple to use, there’s only one key in it - value - that specifies expression to calculate:

seq:
  - id: length_in_feet
    type: f8
instances:
  length_in_m:
    value: length_in_feet * 0.3048

Value instance does no actual parsing, and thus do not require pos key, or type key (type will be derived automatically).

4.12. Bit-sized integers

Important
Feature available since v0.6.

Quite a few protocols and file formats, especially those who aim to conserve space, pack multiple integers into same byte, using integer sizes less that 8 bits. For example, IPv4 packet starts with a byte that packs both version and header length:

76543210
vvvvllll
  |   |
  |   +- header length
  +----- version

Here’s how it can be parsed with KS:

seq:
  - id: version
    type: b4
  - id: header_len
    type: b4
Note
By convention, KS starts parsing bits from most significant to least significant, so "version" comes first here, and "header_len" second.

Using type: bX (where X is a number of bits to read) is very versatile and can be used to read byte-unaligned data. A more complex example of packing, where value spans two bytes:

76543210 76543210
aaaaabbb bbbbbbcc
seq:
  - id: a
    type: b5
  - id: b
    type: b9
    # 3 bits + 6 bits
  - id: c
    type: b2

Or it can be used to parse completely unaligned bit streams with repetitions. In this example, we parse an arbitrary number of 3-bit values:

           76543210 76543210 76543210 76543210
           nnnnnnnn 00011122 23334445 55666777 ...
           ----+--- ---___----___---____
               |     |  |  |   |  |   |
num_threes ----+     |  |  |   |  |   |
threes[0]  ----------+  |  |   |  |   |
threes[1]  -------------+  |   |  |   |
threes[2]  ----------------+   |  |   |
threes[3]  --------------------+  |   |
threes[4]  -----------------------+   |
threes[5]  ---------------------------+
  ...
seq:
  - id: num_threes
    type: u1
  - id: threes
    type: b3
    repeat: expr
    repeat-expr: num_thress
Important

By default, if you’ll mix "normal" byte-sized integers (i.e. uX, sX) and bit-sized integers (i.e. bX), byte-sized integers will be kept byte-aligned. That means if you do:

seq:
  - id: foo
    type: b6
  - id: bar
    type: u1

two bytes will get parsed like that:

    76543210 76543210
    ffffff   bbbbbbbb
    --+---   ---+----
      |         |
foo --+         |
bar ------------+

i.e. two least significant bits of the first byte would be lost and not parsed due to alignment.

Last, but not least, note that it’s also possible to parse bit-packed integers using old-school methods with value instances. Here’s the very first example with IPv4 packed start, unpacked manually:

seq:
  - id: packed_1
    type: u1
instances:
  version:
    value: packed_1 & 0b00001111
  header_len:
    value: packed_1 >> 4

Such method is useful when you need to do more intricate bit combinations, like a value with its bits scattered across several bytes sparsely.

4.13. Streams and substreams

TODO

4.14. Processing: dealing with compressed, obfuscated and encrypted data

Some formats obscure the data fully or partially with techniques like compression, obfuscation or encryption. In this cases, incoming data should be pre-processed before actual parsing would take place, or we’ll just end up with the garbage getting parsed. All such pre-processing algorithms has one thing in common: they’re done by some function that takes a stream of bytes and return the stream of bytes (note that number of incoming and resulting bytes might be different, especially in case of decompression). While it might be possible to do such transformation in declarative manner, it is usually impractical to do so.

KS allows to plug-in some predefined "processing" algorithms that allow to do mentioned de-compression, de-obfuscation and de-cryption to get a clear stream, ready to be parsed. Consider parsing a file, in which the main body is obfuscated by applying XOR with 0xaa for every byte:

seq:
  - id: body_len
    type: u4
  - id: body
    size: body_len
    process: xor(0xaa)
    type: some_body_type # defined normally later

Note that:

  • Applying process: …​ in available only to raw byte arrays or user types.

  • One might use expression language inside xor(…​), thus referencing XOR obfuscation key read in the same format into some other field previously

5. Expression language

Expression language is a powerful internal tool inside Kaitai Struct. In a nutshell, it is a simple object-oriented, statically-type language that gets translated/compiled (AKA "transpiled") into any supported target programming language.

The language is designed to follow the principle of least surprise, so it borrows tons of elements from other popular languages, like C, Java, C#, Ruby, Python, JavaScript, Scala, etc.

5.1. Basic data types

Expression language operates on the following primitive data types:

Type Attribute specs Literals

Integers

type: uX, type: sX, type: bX

1234, -789, 0xfc08, 0b1101

Floating point numbers

type: fX

123.0, -456.78, 4.1607804e+72

Booleans

type: b1

true, false

Byte arrays

size: XXX, size-eos: true

[0x20, 65, 66, 67]

Strings

type: str, type: strz

'foo bar', "baz\nqux"

Enums

(type: uX or type: sX) and enum: XXX

opcode::jmp

Streams

N/A

N/A

Integers come from uX, sX, bX type specifications in sequence or instance attributes (i.e. u1, u4le, s8, b3, etc), or can be specified literally. One can use:

  • normal decimal form (i.e. 123)

  • hexadecimal form using 0x prefix (i.e. 0xcafe - both upper case and lower case letters are legal, i.e. 0XcAfE or 0xCAfe will do as well)

  • binary form using 0b prefix (i.e. 0b00111011)

  • octal form using 0o prefix (i.e. 0o755)

It’s possible to use _ as a visual separator in literals — it would be completely ignored by parser. This could be useful, for example, to:

  • visually separate thousands in decimal numbers: 123_456_789

  • show individual bytes/words in hex: 0x1234_5678_abcd

  • show nibbles/bytes in binary: 0b1101_0111

Floating point numbers also follow the normal notation used in vast majority of languages: 123.456 will work, as well as expontential notation: 123.456e-55. Use 123.0 to enforce floating point type to an otherwise integer literal.

Booleans can be specified as literal true and false values as in most languages, but also can be derived by using type: b1. This method parses a single bit from a stream and represents it as a boolean value: 0 becomes false, 1 becomes true. This is very useful to parse flag bitfields, as you can omit flag_foo != 0 syntax and just use something more concise, such as is_foo.

Byte arrays are defined in the attribute syntax when you don’t specify anything as type, but specify size or size-eos instead. Byte array literals use typical array syntax like the one used in Python, Ruby and JavaScript: i.e. [1, 2, 3]. There is a little catch here: the same syntax is used for "true" arrays of objects (see below), so if you’ll try to do stuff like [1, 1000, 5] (1000 obviously won’t fit in a byte), you won’t get a byte array, you’ll get array of integers instead.

Strings normally come from using type: str or type: strz. Literal strings can be specified using double quotes or single quotes. The meaning of single and double quotes is similar to those of Ruby, PHP and Shell script:

  • Single quoted strings are interpreted literally, i.e. backslash \, double quotes " and other possible special symbols carry no special meaning, they would be just considered a part of the string. Everything between single quotes is interpreted literally, i.e. there is no way one can include a single quote inside a single quoted string.

  • Double quoted strings support escape sequences and thus allow to specify any characters. The supported escape sequences are as following:

Escape seq Code (dec) Code (hex) Meaning

\a

7

0x7

bell

\b

8

0x8

backspace

\t

9

0x9

horizontal tab

\n

10

0xa

newline

\v

11

0xb

vertical tab

\f

12

0xc

form feed

\r

13

0xd

carriage return

\e

27

0x1b

escape

\"

34

0x22

double quote

\'

39

0x27

single quote (technically not required, but supported)

\\

92

0x5c

backslash

\123

ASCII character with octal code 123; one can specify 1..3 octal digits

\u12bf

Unicode character with code U+12BF; one must specify exactly 4 hex digits

Note
One of the most widely used control characters, ASCII zero character (code 0) can be specified as \0 - exactly as it works in most languages.
Caution
Octal notation is prone to errors: due to its flexible length, it can swallow decimal digits that appear after the code as part of octal specification. For example, a\0b is three characters: a, ASCII zero, b. However, 1\02 is interpreted as two characters: 1 and ASCII code 2, as \02 is interpreted as one octal escape sequence.

TODO: Enums

Streams are internal objects that track the byte stream that we parse and state of parsing (i.e. where’s the pointer at). There is no way to declare a stream-type attribute directly by parsing instructions or specify it as a literal. Typical way to get stream objects is to query _io attribute from a user-defined object: that will give us a stream associated with this particular object.

5.2. Composite data types

There are two composite data types in the expression language (i.e. data types which include other types are components).

5.2.1. User-defined types

Basically, that’s the types one defines using .ksy syntax - i.e. top-level structure and all substructures defined in types key.

Normally, they are translated into classes (or their closest available equivalent - i.e. storage structure with members + access members) in target language.

5.2.2. Arrays

Array types are just what one might expect from all-purpose, generic array type. Arrays come from either using the repetition syntax (repeat: …​) in attribute specification, or by specifying a literal array. In any case, all KS arrays have underlying data type that they store, i.e. one can’t put strings and integers into the same array. One can do arrays based on any primitive data type or composite data type.

Note
"True" array types (described in this section) and "byte arrays" share the same literal syntax and lots of method API, but they are actually very different types. This is done on purpose, because many target languages use very different types for byte arrays and arrays of objects for performance reasons.

One can use array literals syntax to declare an array (very similar to syntax used in JavaScript, Python and Ruby). Type will be derived automatically based on types of values inside brackets, for example:

  • [123, 456, -789] - array of integers

  • [123.456, 1.234e+78] - array of floats

  • ["foo", "bar"] - array of strings

  • [true, true, false] - array of booleans

  • [a0, a1, b0] - given that a0, a1 and b0 are all the same objects of user-defined type some_type, this would be array of user-defined type some_type

Warning
Mixing multiple different types in a single array literal would trigger a compile-time error, for example, this is illegal: [1, "foo"]

5.3. Operators

Literals can be connected using operators to make meaningful expressions. Operators are type-dependent: for example, same + operator applied to two integers would mean arithmetic addition, and applied to two strings would mean string concatentation.

5.3.1. Arithmetic operators

Can be applied to integers and floats:

  • a + b - addition

  • a - b - subtraction

  • a * b - multiplication

  • a / b - division

  • a % b - modulo; note that it’s not a remainder: -5 % 3 is 1, not -2; the result is undefined for negative b.

  • a ** b - exponentiation (a in power of b)

Note
If both operands are integer, result of arithmetic operation is integer, otherwise it is floating point number. For example, that means that 7 / 2 is 3, and 7 / 2.0 is 3.5.

Can be applied to strings:

  • a + b - string concatenation

5.3.2. Relational operators

Can be applied to integers, floats and strings:

  • a < b - true if a is strictly less than b

  • a ⇐ b - true if a is less or equal than b

  • a > b - true if a is strictly greater than b

  • a >= b - true if a is greater or equal than b

Can be applied to integers, floats, strings, booleans and enums (does proper string value comparsion):

  • a == b - true if a is equal to b

  • a != b - true if a is not equal to b

5.3.3. Bitwise operators

Can be only applied to integers.

  • a << b - left bitwise shift

  • a >> b - right bitwise shift

  • a & b - bitwise AND

  • a | b - bitwise OR

  • a ^ b - bitwise XOR

5.3.4. Logical (boolean) operators

Can be only applied to boolean values.

  • not x - boolean NOT

  • a and b - boolean AND

  • a or b - boolean OR

5.4. Methods

Just about every value in expression language is an object (including literals), and it’s possible to call methods on it. The common syntax to use is obj.method(param1, param2, …​), which can be abbreviated to obj.method if no parameters are required.

Note that then obj in question is a user-defined type, you can access all its attributes (both sequence and instances) using the same obj.attr_name syntax. Obviously, one can chain that to traverse a chain of substructures: obj.foo.bar.baz (given that obj is a user-defined type that has foo field, which points to user-defined type that has bar field, and so on).

There are a few pre-defined methods that form kind of a "standard library" for expression language.

5.4.1. Integers

Method name Return type Description

to_s

String

Converts integer into a string using decimal representation

to_s(radix)

String

Converts integer into a string using specified radix (i.e. use 16 to get hexadecimal representation, use 8 to get octal, etc)

5.4.2. Floating point numbers

TODO

5.4.3. Byte arrays

TODO

5.4.4. Strings

TODO

5.4.5. Enums

TODO

5.4.6. User-defined types

All user-defined types can be queried to get attributes (sequence attributes or instances) by their name. In addition to that, there are a few pre-defined internal methods (they all start with an underscore _, so they can’t clash with regular attribute names):

Method name Return type Description

_root

User-defined type

Top-level user-defined structure in current file

_parent

User-defined type

Structure that produced this particular instance of user-defined type

_io

Stream

Stream associated with this object of user-defined type

5.4.7. Array types

Method name Return type Description

first

Array base type

Gets first element of the array

last

Array base type

Gets last element of the array

5.4.8. Streams

Method name Return type Description

eof

Boolean

true if we’ve reached end of the stream (no more data can be read from it), false otherwise

size

Integer

Total size of the stream in bytes

pos

Integer

Current position in the stream, in bytes from the beginning of the stream

6. Advanced techniques

6.1. Importing types from other files

As your project grows in complexity, you might want to have multiple .ksy files: for example, for different file formats, structures, substructures, or to reuse same subformat in several places. As most programming languages, Kaitai Struct allows you to have multiple source files and has imports functionality for that.

Using multile files is very easy. For example, given that you have a date.ksy file that describes the date structure:

meta:
  id: date
seq:
  - id: year
    type: u2le
  - id: month
    type: u2le
  - id: day
    type: u2le

and you want to use it in a file listing specification filelist.ksy. Here’s how to do that:

meta:
  id: filelist
  # this will import "date.ksy"
  imports:
    - date
seq:
  - id: entries
    type: entry
    repeat: eos
types:
  entry:
    seq:
      - id: filename
        type: strz
        encoding: ASCII
      # just use "date" type from date.ksy as if it was declared in
      # current file
      - id: timestamp
        type: date
      # you can access its members too!
      - id: historical_data
        size: 160
        if: timestamp.year < 1970

Generally, you just add an array in meta/imports and list all you want to import there. There are 2 ways to address the files:

Relative

Uses path given as relative path to the file, starting with the same directory as main .ksy file resides. It’s useful to include files in the same directory or to navigate to somewhere in your project. Examples include: foo, foo/bar, ../foo/bar/baz, etc.

Absolute

Looks like /foo or /foo/bar (i.e. starting with a slash), and searches for the given .ksy file in module search path(s). This is usually used to modules from centralized repositories / ksy libraries. Module search paths are determined by (in order of decreasing priority):

  • Paths given using command-line -I switch.

  • Paths given using KSPATH environment variable (multiple paths can be specified separated with : on Linux/OS X and with ; on Windows)

  • Default Platform-dependent search paths, determined in compiler build time and/or during installation

    In Web IDE you obviously don't have environment and command-line
    switches, so absolute path imports are used to reference modules in
    preloaded "kaitai.io" library.
Caution
Please use only forward slashes / in import paths for consistency. Kaitai Struct will convert them automatically to proper platform-dependent path separator (/ or \).

6.2. Opaque types: plugging in external code

Sometimes you’d want KS-generated code to call a code in your application to do the parsing, for example, to parse some text- or state-based format. For that, you can instruct ksc to generate code with so-called "opaque" types.

Normally, if a compiler encounters a type which is not declared either in current file or in one of the imported files, for example:

meta:
  id: doc_container
seq:
  - id: doc
    type: custom_encrypted_object
  1. it will output an error:

    /seq/0: unable to find type 'custom_encrypted_object', searching from doc_container

If we want to provide our own implementation of custom_encrypted_object type, first we need to compile our .ksy file with --opaque-types=true option. This will avoid the error, and compiler will consider all unknown types to be "opaque", i.e. treat it will them as existing in some external space.

Note
Of course, compiler don’t know anything about opaque types, so trying to access any attributes of it (i.e. using expression language) will fail.

This will generate the following code (for example, in Java):

public class DocContainer extends KaitaiStruct {
    // ...
    private void _read() {
        this.doc = new CustomEncryptedObject(this._io);
    }
}

As you see, CustomEncryptedObject is instantiated here with a single argument: IO stream. All that’s left is to create a class with a compatible constructor that will allow a call with single argument. For statically typed languages, note that constructor’s argument is of type KaitaiStream.

An example of what can be done (in Java):

public class CustomEncryptedObject {
    byte[] buf;

    public CustomEncryptedObject(KaitaiStream io) {
        // read all remaining bytes into our buffer
        buf = io.readBytesFull();

        // implement our custom super Caesar's cipher
        for (int i = 0; i < buf.length; i++) {
            byte b = buf[i];
            if (b >= 'A' && b <= 'Z') {
                int letter = b - 'A';
                letter = (letter + 7) % 26;
                buf[i] = (byte) (letter + 'A');
            }
        }
    }
}
Tip
Alternatively, opaque types can be (ab)used to connect several KS-generated types together without importing. If one type instantiates other, but does not use it in any other way (i.e. doesn’t access its inner attributes using expression language), one can just compile two .ksy files separately, throw them into the same project and they shall use each other without a problem.

6.3. Enforcing parent type

Every object (except for the top-level object) in a .ksy file has a parent, and that parent has a type, which is some sort of user-defined type. What happens if two or more objects use the same type?

two parents
types:
  opcode_jmp:
    seq:
      - id: target
        type: arg
  opcode_push:
    seq:
      - id: value
        type: arg
  arg:
    seq:
      - id: arg_type
        type: u1
      - id: arg_value
        type: u1

In this example, both opcodes use same type arg. Given that these are different types, KS infers that the only thing they have in common is that they are objects generated by Kaitai Struct, and thus they usually implement KaitaiStruct API, so the best common type that will be ok for both parents is KaitaiStruct. Here’s how it looks in any statically-typed language, i.e, in Java:

public static class OpcodeJmp extends KaitaiStruct {
    // ...
    private void _read() {
        this.target = new Arg(this._io, this, _root);
    }
    // ...
}
public static class OpcodePush extends KaitaiStruct {
    // ...
    private void _read() {
        this.value = new Arg(this._io, this, _root);
    }
    // ...
}
public static class Arg extends KaitaiStruct {
    public Arg(KaitaiStream _io, KaitaiStruct _parent, TopLevelClass _root) {

Note that both OpcodeJmp and OpcodePush supply this as _parent argument in Arg constructor, and, as it is declared as KaitaiStruct. As both opcode classes are declared with extends KaitaiStruct, this code will compile properly.

6.3.1. Replacing parent

However, in some situations, you might want to replace default this passed as _parent with something else. In some situations this will provide you a clean and elegant solution to relatively complex problems. Consider the following data structure that loosely represents a binary tree:

types:
  tree:
    seq:
      - id: chunk_size
        type: u4
      - id: root_node
        type: node
  node:
    seq:
      - id: chunk
        size: ??? # <= need to reference chunk_size from tree type here
      - id: has_left_child
        type: u1
      - id: has_right_child
        type: u1
      - id: left_child
        type: node
        if: has_left_child != 0
      - id: right_child
        type: node
        if: has_right_child != 0

Everything is pretty simple here. Main tree type has chunk_size and a root_node, which is of node type. Each individual node of this tree carries a chunk of information (of size determined in tree type), some flags (has_left_child and has_right_child) and then calls itself again to parse either left or right child nodes for current node if they exist, according to the flags.

The only problem is how to access chunk_size in each node. You can’t access tree object starting from _root here, as there could be many different trees in our file, so you need to access current one. Using _parent directly is just impossible. True, given that node type is used both by tree and node itself, it got two different parents, so Kaitai Struct compiler downgrades node’s parent type to KaitaiStruct, thus trying to access _parent.chunk_size would result in a compile-time error.

TODO: add more about the error

This situation can be resolved easily by using parent overriding. We modify our code this way:

types:
  tree:
    seq:
      - id: chunk_size
        type: u4
      - id: root_node
        type: node
  node:
    seq:
      - id: chunk
        size: _parent.chunk_size # <= now one can access `tree` with _parent
      - id: has_left_child
        type: u1
      - id: has_right_child
        type: u1
      - id: left_child
        type: node
        parent: _parent # <= override parent to be be parent's parent
        if: has_left_child != 0
      - id: right_child
        type: node
        parent: _parent # <= override parent here too
        if: has_right_child != 0

We’ve changed only three lines. We’ve enforced parent of the node in left_child and right_child attributes to be passed as _parent, not this. This, effectively, continues passing reference to original node’s parent, which is a tree type object, over and over the whole recursive structure. This way one can access structure’s root by just using _parent. Naturally, we’ve done exactly that to get ourselves chunk_size by just using size: _parent.chunk_size.

6.3.2. Omitting parent

In some cases, you’d rather want some object to don’t have any parent at all. Primary use case for that is to make sure that some instantiation it does not affect parent type. In many cases, resorting to this method is a sign that you need to stop and rethink your design, but for some formats, it’s unavoidable and in fact simplifies things a lot.

To omit parent (i.e. pass null reference or something similar as a parent in one particular case), use parent: false.

Note

Language design explanation: while it might seem logical to specify parent: null, there are two catches:

  • KSY is a YAML-based language, and YAML treats parent: null as literally null value, i.e. totally the same as parent:. So, just to allow passing solitary null to as a value, you’d need to wrap it into quotes: parent: 'null'. This would be very awkward for beginners, as we can’t even generate a good error message here, as we can’t distinguish these two.

  • Omitting parent is actually a special case, not just a matter of passing null. In fact, some languages do not have a concept of null, or do not allow passing null as an object reference, so we need to treat it distinctly anyway, and emphasize that.

TODO: an example where omitting the parent comes useful

7. Common pitfalls

This section illustrates problems that are encountered frequently by beginner Kaitai Struct users.

7.1. Specifying size creates a substream

TODO

7.2. Applying process without a size

In any cases, size of the data to process must be defined (either with size: …​ or with size-eos: true), i.e. this is legal (without process - size takes will be determined by some_body_type, reading normally from a stream):

seq:
  - id: body
    type: some_body_type

And this is not:

seq:
  - id: body
    process: xor(0xaa)
    type: some_body_type
    # will not compile - lacks size

This is because most processing algorithms require to know size of data to process beforehand, and final size of some_body_type might be determined only in run-time, after parsing took place.