A simple code generator for serializing and deserializing data.
CatBuffer is a program to generate source code for serializing/deserializing between raw binary buffers and user defined data structures. The data structures can be defined in a very simple JSON format. The serialization and deserialization is done in a very memory efficient maner, where no extra information or padding is written or read, apart from what is defined by the user (unlike Protobuf, Cap'n Proto, FlatBuffers etc).
The catbuffer-generator takes the generated .yaml file as input and generates C++ files which are compiled into a library. The process is shown below:
.yaml file --> catbuffer-generator --> .cpp/.h files --> C++ lib file
For now only C++ is supported, but RUST and Python implementations are planned.
CatBuffer is especially optimized for reading blockchain data. Currently .yaml files are provided for the Bitcoin, Symbol and NEM blockchains such that their blockchain files (e.g. bitcoin blt*.dat files) can be read and written, and supports serializing and deserializing data structures for sending and receiving data between network clients.
Below is shown a simple CatBuffer data structure definition:
- name: Coordinate
type: struct
layout:
- name: x
type: uint32
- name: y
type: uint32
- name: z
type: uint32
The generated code will allow serialization and manipulation of the above data structure in C++ like so:
// Read vector 'a' (deserialize from file)
RawBuffer dataA = read_file( "vector_a.raw" ); // read_file returns a raw byte buffer
Coordinate a;
a.Deserialize( dataA ); // 'Deserialize()' initializes 'Coordinate' members from a raw buffer
// Read vector 'b' (deserialize from file)
RawBuffer dataB = read_file( "vector_b.raw" );
Coordinate b;
b.Deserialize( dataB );
// Compute cross product (initialize 'Coordinate' Catbuffer 'c')
Coordinate c;
c.x = a.y*b.z − a.z*b.y;
c.y = a.z*b.x − a.x*b.z;
c.z = a.x*b.y − a.y*b.x;
// Write vector 'c' (serialize to file)
RawBuffer dataC;
c.Serialize( dataC ); // 'Serialize()' writes 'Coordinate' members to a raw buffer
write_file( dataC, "vector_c.raw" );
For more elaborate data structures see e.g. the bitcoin blockchain .yaml file 'here'
To generate the C++ library file, do the following:
- Clone the
catbuffer-generators
repository:
git clone https://github.com/0x31313030/CatBuffer
- Generated the .cpp/.h files:
python3 -m generator input_file.yaml output_directory/
- Enter the 'output_directory' where files have been generated:
cd output_directory
- Create a directory to build library:
mkdir _build && cd _build
- Generate CMake files:
cmake ..
- Compile library
make
You should now see a file called libcatbuffer.a which you can link to your program in order to serialize/deserialize the data structures you defined in your .yaml file.
generator
: The python source code for parsing an input YAML file and outputting C++ code.cpp_source
: Static C++ source code needed for serialization/deserialization, which is independent of an input YAML file.cpp_build_files
: C++ build files for compiling the code generated by the generator.unit_tests
: Unit tests to test the code in the generator/ folder.yaml_test_inputs
: YAML input files for testing.test_vectors
: test vector corresponding to the yaml test inputs in the yaml_test_inputs/ folder.end_to_end_test
: Contains end to end tests where serialized inputs are deserialized and then serialized again to check that the output is equal to the input. The test takes the yaml inputs in the 'yaml_test_inputs' folder, generates C++ outputs, takes the test vectors in 'test_vectors', uses the generated code to deserialize input vectors and then serializes again to compare the result with the initial input vectors.
The generator includes multiple unit tests located in the unit_tests folder. They can be run by using 'python3 -m unittest' like so:
python3 -m unittest -v unit_tests/TestYamlFieldErrorDetection.py
To test the correct deserializtion/serialization of the generated code, some yaml input tests are included in the folder yaml_test_inputs. To run these tests, execute the following commands while at the base folder:
mkdir output-symbol
python3 -m generator yaml_test_inputs/symbol.yaml output-symbol
cd output-symbol
mkdir _build && cd _build
cmake ..
make
cd ../../end-to-end-tests
mkdir _build && cd _build
cmake ..
make
./main
The generator also supports generating optional C++ code for printing out deserialized data. It is also possible to generate a command line interface (cli) for deserializing raw files and hex strings. To add support for prettyprinting and cli, use the '--generate-print' option:
python3 -m generator input_file.yaml output_directory --generate-print
This will add a 'Print()' method to the 'ICatbuffer' interface and an executable called 'cmd' which can be used to deserialize hex strings and raw files like so:
$./cmd --hex Coordinate 0D0000000E0000000F000000
Coordinate (12 bytes)
{
uint32_t x: 13 (4 bytes)
uint32_t y: 14 (4 bytes)
uint32_t z: 15 (4 bytes)
}
Data deserialized successfully!
The generator accepts YAML files and outputs C++ files. An example of a simple data structure defined in YAML is shown below:
- name: Coordinate
type: struct
comments: a structure for storing a 3D coordinate
layout:
- name: x
type: int32
comments: the x coordinate
- name: y
type: int32
comments: the y coordinate
- name: z
type: int32
comments: the z coordinate
The above markup defines a data structure called Coordinate with 3 fields called x, y and z of type int32. This markup will be explained in detail below.
Catbuffer supports the following builtin datatypes: 'int8', 'uint8', 'int16', 'uint16', 'int32', 'uint32', 'int64', 'uint64', 'varint'
The 'varint' type is a variable length integer used widely in Bitcoin. It is explained here
Apart from the builtin types, alias types can be defined like so:
- name: FeeMultiplier
type: alias uint16
or
- name: Address
size: 24
type: alias array uint8
In C++, the above two examples would be equivalent to:
using FeeMultiplier = uint32_t;
and
using Address = struct Address_t { uint8_t data[24]; };
Custom types can be useful for type checking and can be used when defining data structures.
An enum can be defined like so:
- name: NetworkType
comments: enumeration of network types
type: enum uint8
values:
- name: MAINNET
comments: public network
value: 104
- name: TESTNET
comments: public test network
value: 152
which would be equivalent to this in C++:
/**
* enumeration of network types
*/
enum class NetworkType : uint8_t
{
MAINNET = 104, //< public network
TESTNET = 152, //< public test network
};
Structs are the most elaborate custom defined types in Catbuffer and can contain multiple fields including other structs. A struct has to define at least three keys: 'name', 'type' and 'layout'. An optional 'comment' key can also be added. The 'type' key has to be set to 'struct'. 'layout' defines the fields in the struct. An example of a 'struct' was shown here. There are in total 9 field types that can appear inside 'layout'. They are listed below:
builtin types
alias types
condition
reserved
const
inline
array
array_sized
array_fill
The subsections below will explain the above field types in more detail.
Builtin types are the simplest types supported in Catbuffer. Builtin type fields can be added like so:
- name: time_elapsed
type: uint64
Some custom types such as NetworkType, FeeMultiplier, Coordinate were defined here, here and here. Below they are shown as fields in a struct:
- name: network
type: NetworkType
- name: multiplier
type: FeeMultiplier
- name: coordinate
type: Coordinate
Condition fields can be used to add optional fields. If a condition is met then the field is serialized, otherwise it is ignored. Below is an example of a condition field:
- name: msg
type: Message
condition: MessageIncluded
condition_operation: not equals
condition_value: 0
The name of the field is 'msg' and is of alias type 'Message'. It is only serialized/deserialized if MessageIncluded != 0. Note that MessageIncluded has to be a field in the same struct, defined before 'msg'. The only condition operations supported at the moment are equals and not equals.
Note that the field itself can also be used as a condtion:
- name: Flag
type: uint16
condition: Flag
condition_operation: equals
condition_value: 256
In this case 'Flag' is deserialized/serialized, if and only if, it is equal to 256. This type of field is used in Bitcoin.
Reserved fields are useful for when a field is reserved for future use and should have a specific value that can not be set by the user. It is also useful for adding padding. Reserved fields are defined like builtin fields but with the keyword reserved added to the type field and a value field. Below is an example of how to define a reserved field for padding:
- name: padding
type: reserved uint32
value: 0
comments: reserved padding to align next field on 8-byte boundary
Note that when serializing/deserializing, if the value read for a reserved field does not equal the value key, it is considered an error and the serialization/deserialization will fail.
Inline fields can be used to inline structs into other structs, so that instead of doing OuterStruct.InnerStruct.my_variable, one can do OuterStruct.my_variable. An example of how to do an inline field is shown below
- type: inline Coordinate
It is possible to define constants in Catbuffer. Although they are not read or written when serializing, they are included as class members when generating code. They can be defined like so:
- name: VERSION
type: const unit8
value: 14
Which would generate a C++ class member similar to this:
const uint8_t VERSION = 14;
An array field is just a normal fixed sized array with elements of a fixed type.
- name: amounts
size: amount_size
type: array uint64
Note that amount_size has to be a field in the same struct which appears before the array field. Furthermore, note that type can also be a struct defined type. Finally note that if the same size field is used for multiple arrays, the arrays have to be of equal size when serializing/deserializing.
An array_sized field is an array where the number of elements is not known, but where the total array size in bytes is known. This is useful for arrays where each element is of a different type and size. The array elements in this case are of user defined custom types, however, all elements must share a common header struct, which in turn contains a field which indicates what the element type is. The header type is indicated with the header key and the field within the header containing the element type, is indicated with the header_type_field key. An example of this is shown below:
- name: Transactions
type: array_sized EmbeddedTransaction #<--- Header common to all elements in array.
header_type_field: elem_type #<--- Name of field in 'EmbeddedTransaction' which contains type of element.
size: numElements
align: 8 #<--- Optional alignment of array elements in bytes.
In the above example the total size of the array in bytes is given in the size key. 'EmbeddedTransaction' is the header which is common to all elements in the 'Transactions' array. The field in the 'EmbeddedTransaction' which indicates the type of an element is called 'elem_type'. The type of the 'elem_type' field itself has to be an enum. An optional alignment for the array elements can be indicated by specifying an align field. This will add padding at the end of each array element so that the subsequent element is memory aligned. If alignment is specified then the size field must includes the size of the paddings.
Given an array_sized field, Catbuffer will automagically know how to serialize and deserialize. For this to happen, the elements in the array also need to be defined with a specific field called struct_type as shown below:
- type: struct_type TransactionType #<---'TransactionType' is an Enum defined somewhere else.
value: MOSAIC_DEFINITION @3 #<--- MOSAIC_DEFINITION is an Enumerator in 'TransactionType', and '@3' is the version.
header: EntityBody #<--- header where version and type of struct is stored.
version_field: version #<--- the name of the field within the header defining the version.
type_field: type #<--- the name of the field within the header defining the type.
MOSAIC_DEFINITION is an enumerator in the TransactionType enum, which also has to be the type of the elem_type field mentioned above. This enum gives the type of the struct within the group TransactionType. The @3 part indicates the version of the struct. This way Catbuffer can support evolving structures over time.
//: #( TODO: header, version_field and type_field should not be defined here since its the same for all structs of type TransactionType )
//: #( TODO: what should happen if a version and type combination does not exist? )
//: #( TODO: Need to think about this a bit more. What if there are two fields of type EmbeddedTransaction? What if its not inline? )
An 'array fill' is a normal array with elements of the same type, but where the number of elements is computed based on how much data is still pending to be serialized/deserialized. So for example if the total amount of data to deserialize is 'n' bytes and 'm' bytes of data is still pending to be serialized, then the size of the array is 'n-m' bytes. An 'array fill' field is defined like so:
- name: signatures
type: array_fill Signature
Note that an 'array fill' field has to be the last field in the outermost struct, otherwise it is an error.
//: #(TODO: implement check)
The generator code, defined in the generator/ folder, contains multiple classes to convert YAML inputs to C++ code. There are two types of classes, the ones that generate C++ declaration code, which goes into .h files, and definition code which goes into .cpp files. Below is a quick overview of the main classes:
Declaration Classes | Description |
---|---|
CppClassMemberGenerator | Takes fields defined in YAML and converts them to C++ class members. |
CppClassDeclarationGenerator | Generates C++ class declarations which go into .h files. |
CppTypesGenerator | Converts enums and alias types defined in YAML and outputs them in types.h. |
Definition Classes | Description |
---|---|
CppSerializationGenerator | Takes a field defined in YAML and generates C++ code to serialize it into a raw byte buffer. |
CppDeserializationGenerator | Takes a field defined in YAML and generates C++ code to deserialize it from a raw byte buffer. |
CppClassDefinitionGenerator | Generates C++ class definitions which go into .cpp files. |
CppEnumeratorToClassGenerator | Generates C++ functions to convert from enums to class instances. |
Yaml Checker Classes | Description |
---|---|
YamlFieldChecker | Contains checks to ensure that the different fields contain the necessary YAML keys |
YamlDependencyChecker | Contains checks to ensure that the dependencies defined in the YAML fields are valid |
The above classes are documented in more detail in the source code.
When done parsing a YAML input file, three different C++ files are generated. First a types.h file is generated, which contains all alias types and enums. Then for each defined struct type, C++ class files are generated in .cpp/.h, which contain the defined fields as class members and implement the ICatbuffer interface which enable serialization/deserialization. The ICatbuffer interface is explained below. Lastly the files converters.h/.cpp contain the functions necessary to convert an enumerator to an instance of a struct (represented as an ICatbuffer pointer) as explained here.
The ICatBuffer interface declares methods for serializing and deserializing raw byte buffers. It also declares a method for getting the total size of all fields in serialized form. All structs declared in the input YAML file are converted to C++ classes that inherit from ICatbuffer. This allows structs to be initialized by deserialization. The 'ICatBuffer.h' header file is defined in the cpp_source/ folder.
Rawbuffer is the buffer which is declared in the ICatBuffer interface as input for the serializer and deserializer methods. It is therefore compiled and added in the output C++ library file. Rawbuffer implements a simple buffer handling functionality with out of bounds protection.