-
Notifications
You must be signed in to change notification settings - Fork 6
XML Specification Reference Manual
The XML specification for a Myriad-based data generator project is located under src/config/${dgen-name}-prototype.xml
, where ${dgen-name}
is the name of the Myriad project as configured in the .myriad-settings
file1. When you create a new project, the src/config
folder is initially empty. In order to bootstrap a new XML specification, you can run the following assistant task:
./myriad-assistant initialize:prototype $prototype_name
This will populate the src/config
folder with a sample XML specification identified by the given $prototype_name
. Currently, the only supported prototype names are empty
- which initializes an empty specification, and customer
- which contains the sample specification of a customer domain type that is used for the example snippets in this manual.
In order to invoke the Myriad prototype compiler you have to execute the compile:prototype task in the assistant CLI tool:
./myriad-assistant compile:prototype
If you are working from the build
folder, you can use the enclosing make target shortcut instead:
make prototype
When the Myriad compiler is invoked for the first time, it will generate three groups of C++ sources:
- a family of domain types (located under
src/cpp/record
), - an associated family of PRDG functions (also called setter chains, located under
src/cpp/runtime/setter
), and - a generator configuration that manages all information and auxiliary objects required by the PRDG functions (located under
src/cpp/config
).
All generated C++ classes follow implement a hierarchy consisting of a base parent and an actual derived class, and all auto-generated code is placed in the base class. This mechanism allows the user to implement code level extensions directly in the generated C++ classes by simply extending the appropriate methods in the actual class. Subsequent invocations of the compiler will not touch already existing main classes, which means that users can modify and re-compile the XML specification even after adding custom logic at the code level. Code-level extensions therefore present not an alternative, but rather a complementary way to specify your data generator programs.
When you are ready with the XML and C++ modifications and want to build or re-build your project, go to the build
folder and execute the following commands
make prototype # rebuild the prototype extensions
make cleanall # clean all artifacts from previous compilations
make all # build the whole project from scratch
This section contains a introduction to the base structure of a Myriad XML specification and the set of core types supported by the Myriad compiler.
The base structure of a Myriad XML specification consists of four main sections:
<?xml version="1.0" encoding="UTF-8"?>
<generator_prototype xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.dima.tu-berlin.de/myriad/prototype">
<!-- configurable parameters -->
<parameters>
...
</parameters>
<!-- function configuration -->
<functions>
...
</functions>
<!-- enumerated sets specification -->
<enum_sets>
...
</enum_sets>
<!-- record sequences specification -->
<record_sequences>
...
</record_sequences>
</generator_prototype>
The <parameters>
section contains a list of parameter values (e.g. base cardinalities) that can be used in the other sections. The <functions>
section specifies the probability distribution functions used to populate the values of the different domain model fields. The <enum_sets>
section specifies domain specific enumerated sets (e.g. customer gender), and the record_sequences
section contains the specification of the structure (i.e. the type) and the PRDG function (i.e. the setter chain) of all domain types comprising the generated domain model.
Many of the objects types contained in the <functions>
and <record_sequences>
sections depend on arguments with simple types. The Myriad Toolkit currently supports the following set of simple types:
Simple Type | Description |
---|---|
I16 | 16-bit signed integer |
I32 | 32-bit signed integer |
I64 | 64-bit signed integer |
I16u | 16-bit unsigned integer |
I32u | 32-bit unsigned integer |
I64u | 64-bit unsigned integer |
Enum | the type of all enumerated sets |
Decimal | decimal numbers |
Date | Gregorian calendar date, formatted as YYYY-MM-DD |
String | null-terminated ASCII string |
In the next sections, we discuss the syntax and semantics the four main XML specification sections in detail. To illustrate the function and the interplay between the various syntactic elements, we will use an intuitive retailer use-case where the domain model consists of four types - customers, orders, products and lineitems.
The <parameters>
section contains a list of parameter values (e.g. base cardinalities) that can be used in the specification of the other generator prototype elements. Per convention, parameter keys are prefixed with the key of the record sequence that they describe, e.g.
<parameters>
...
<parameter key="customer.sequence.base_cardinality">1000</parameter>
...
</parameters>
defines a base cardinality parameter for the customer record sequence.
Parameters contained in the <parameters>
section of an XML specification are also referred to as explicit parameters, because they are an explicit part of the XML specification. Besides those, the system also offers a number of implicit parameters. Although these are not explicitly listed in the <parameters>
section, you can use them in parametrized expressions the same way you would use explicit parameters.
Here is a list of the implicit parameters you might want to use in your prototype specifications.
Parameter Name | Type | Description |
---|---|---|
ENV.config-dir | String | The configuration directory used by the data generator executable. This parameter is always available at runtime. If you do not use a custom `config-dir` runtime parameter, the contents of this directory will be copied from `src/config`. |
{seq_key}.sequence.cardinality | I64u | The concrete cardinality of the record sequence identified by {seq_key}. This parameter is always available at runtime. |
In order to refer to parameters (explicit or implicit) in the XML specification, the user has to use the dedicated %parameter_name%
syntax. For example, to pass the value of the customer.sequence.base_cardinality
parameter to the base_cardinality
argument in the cardinality estimator for the customer record sequence, use the following code fragment (the exact semantics of the cardinality_estimator
components will be explained later):
<cardinality_estimator type='linear_scale_estimator'>
<argument key='base_cardinality' type='I64u' value='%customer.sequence.base_cardinality%' />
</cardinality_estimator>
Besides the simple value expressions that refer to the value of a specific parameter, you can also combine multiple parameters in an arithmetic expression, which in Myriad is denoted by ${ <parameter_expression> }
brackets. Assume that you want to generate a fixed number of orders per customer, and let this value be given by the order.sequence.orders_per_customer
parameter. You can multiply the two parameters customer.sequence.base_cardinality
and order.sequence.orders_per_customer
to derive the correct base_cardinality
value for the order cardinality estimator:
<cardinality_estimator type='linear_scale_estimator'>
<argument key='base_cardinality' type='I64u' value='${ %(I64u)customer.sequence.base_cardinality% *
%(I64u)order.sequence.orders_per_customer% }' />
</cardinality_estimator>
Note that when we expand parameter values inside an expression, we use an optional cast hint inside round brackets directly after the opening %
literal, e.g. %(I16u)foo.param%
to cast the foo.param
parameter as 16-bit unsigned integer. This is important as otherwise the XML compiler will not know how to convert the parameters and will always interpret them as strings, even if the expression context (e.g. multiplication) might expect some other type.
The <functions>
section contains declarations of the probability distribution functions (PDFs) that are used to sample value populations for the generated record fields. In order to use a particular value distribution in a setter chain, you first have to declare it as a PDF in the functions
section.
To illustrate the base syntax of a <function>
declaration, let us assume that the sequence of customers you want to generate will have an age
field and the values of this field should be normally distributed with a mean of 48 years and a standard deviation of 12 years. You can specify the corresponding probability distribution function like that:
<function key='Pr[customer.age]' type='normal_probability[Decimal]'>
<argument key='mean' type='Decimal' value='48' />
<argument key='stddev' type='Decimal' value='12' />
</function>
The function can be later referenced by its unique key - Pr[customer.age]
. In general, each <function>
declaration has a key
and type
attributes, and one or more argument
children that specify the parameters of the function. The following table gives a brief overview of the supported functions and their expected arguments:
Function Type | Allowed T-s | Arguments (type: name) |
---|---|---|
normal_probability[T] | Decimal |
T: mean - the distribution mean T: stddev - the distribution standard deviation |
pareto_probability[T] | Decimal |
T: alpha - the distribution shape T: x_min - the distribution scale |
uniform_probability[T] |
I(16|32|64)u? Enum Decimal Date |
T: x_min - the minimal support value (inclusive) T: x_max - the maximal support value (exclusive) |
combined_probability[T] |
I(16|32|64)u? Enum Decimal Date |
String: path - path to the file containing the distribution specification |
conditional_combined_probability[T1;T2] |
I(16|32|64)u? Enum Decimal Date |
String: path - path to the file containing the distribution specification |
Note that the domain of the distribution function, and consequently either the type, of the arguments or the values provided in the *.distribution
file, is a template parameter T
. When you use a template function type, you have to substitute the template parameter T
with a concrete simple type, e.g. uniform_probability[I32u]
. Please note that the possible template parameter substitutions might be more or less restricted depending on the actual functions type.
You might have already noticed that along some well-known analytical probability distribution functions, the Myriad Toolkit also provides two combined PDF types that allow you to specify custom probabilities using point-wise and bucket-wise probability mass mappings.
Assume that you want to specify a distribution of the product prices, and you know the exact probabilities for some prices and average probabilities for the remaining value ranges. You can use a combined_probability[Decimal]
function to capture this information:
<function key='Pr[product.price]' type='combined_probability[Decimal]'>
<argument key='path' type='String' value='${%ENV.config-dir% + "/distributions/price.distribution"}' />
</function>
Combined PDFs read their configuration from a PDF specification file located at the given path
. Per convention, all PDF specifications have a .distribution
extension and are placed in the %ENV.config-dir%/config/distributions
folder or one of its subfolders. Since the contents of src/config
folder are transparently copied to the default %ENV.config-dir%
when you build your Myriad project, developers can (and in most cases should) maintain all *.distribution
specifications as part of the source tree under src/config/distributions
. In the above example, the PDF specification should be located at src/config/distributions/price.distribution
.
Combined probabilities have a specification format that consists of three blocks: a header, a list of point-wise mappings, and a list of bucket-wise mappings. So, for instance, a specification of the Pr[product.price]
function could look like that:
# header
@numberofexactvals = 2
@numberofbins = 3
@nullprobability = 0.04
# exact probabilities block
p(X) = 0.10 for X = { 1.0 } # exactly 1 euro
p(X) = 0.12 for X = { 2.0 } # exactly 2 euro
# bucket probabilities block
p(X) = 0.24 for X = { x in [3.0 , 10.0 ) } # between 3 and 10 euro
p(X) = 0.15 for X = { x in [10.0 , 100.0) } # between 10 and 100 euro
p(X) = 0.35 for X = { x in [100.0 , 600.0) } # between 100 and 300 euro
The header block assigns the NULL value probability <null_p>
and specifies the number of entries in the second and the third blocks (N1
and N2
respectively):
@nullprobability = <null_p>
@numberofexactvals = <N1>
@numberofbins = <N2>
The exact probabilities block contains exactly N1
entries (lines) and comes the header. Each line assigns a probability mass <p>
to a point from the T
domain <x>
using the syntax:
p(X) = <p> for X = { <x> }
The PDF specification ends with N2
bucket probability entries, each one assigning a probability mass <p>
to the half-open T
-interval [<x_min>, <x_max>)
using the syntax:
p(X) = <p> for X = { x in [<x_min>, <x_max>) }
An optional comment can be placed at the end of each line can with the special comment delimiter #
.
The simple combined probabilities covered above describe independent univariate distributions. Sometimes however, you want to describe dependencies between random variables in terms of a conditional distribution. The standard way to describe custom dependencies between values is to use the conditional_combined_probability[T1,T2]
PDF type.
Conceptually, a conditional_combined_probability[T1,T2]
specification consists of a family of combined_probability[T1]
specifications, where each member describes the PDF for a certain range of the conditioned domain T2
.
Consider a scenario where you want to condition the product price on the product class, and you have an enumerated set of product classes with two elements { 0: commodity, 1: high-end }
(you will learn more about the Myriad support for enumerated sets in the next section). Naturally, the prices of the commodity goods should be less than the prices of the high-end goods. You can achieve this by by changing the type of the Pr[product.price]
function to conditional_combined_probability[Decimal,Enum]
and providing the following PDF specification:
# conditional probability header
@numberofconditions = 2
# case 1: "0: commodity"
@condition = [0, 1)
# header
@numberofexactvals = 0
@numberofbins = 3
@nullprobability = 0.05
# bucket probabilities block
p(X) = 0.70 for X = { x in [1.0 , 100.0 ) } # between 1 and 100 euro
p(X) = 0.20 for X = { x in [100.0 , 1000.0 ) } # between 100 and 1000 euro
p(X) = 0.05 for X = { x in [1000.0 , 10000.0) } # between 1000 and 10000 euro
# case 2: "1: high-end"
@condition = [1, 2)
# header
@numberofexactvals = 0
@numberofbins = 3
@nullprobability = 0.05
# bucket probabilities block
p(X) = 0.05 for X = { x in [1.0 , 100.0 ) } # between 1 and 100 euro
p(X) = 0.20 for X = { x in [100.0 , 1000.0 ) } # between 100 and 1000 euro
p(X) = 0.70 for X = { x in [1000.0 , 10000.0) } # between 1000 and 10000 euro
As you can see, a conditional combined probability specification contains single-line header that declares the number of distinct cases for the conditioned variable
@numberofconditions = <N>
followed by a sequence of combined_probability[T1]
specifications - one for each case. The range of conditioned variable values covered by a particular combined probability is given additionally as a half-open interval in a condition statement preceding each of the N
specification blocks:
@condition = [<x2_min>, <x2_max>)
The core simple types listed above might be enough to describe simple domains, but most of the time you would want to use custom simple types. In Myriad, this can be achieved by adding enumerated sets to the <enum_sets>
section. An enumerated set is declared with an <enum_set>
element associated with a unique key
attribute. Each <enum_set>
declaration has a single path
argument of type String
that defines the path to the file containing the enumerated set definition.
To illustrate the <enum_set>
syntax, assume that the generated customers have a gender
field that assumes values from the { male
, female
} domain. To make the custom demographics.gender
domain available in your prototype specification, you should add the following lines to the <enum_sets>
section:
<enum_set key='demographics.gender'>
<argument key='path' type='String' value='${%ENV.config-dir% + "/domains/gender.domain"}' />
</enum_set>
In addition, you also need to specify the demographics.gender
set contents in a plain text file located at %ENV.config-dir%/domains/gender.domain
(again, the straight forward way to *.domain src/config
will be copied to the default %ENV.config-dir%
when you build the project):
@numberofvalues = 2
0 ... "M" # Male
1 ... "F" # Female
Each enumerated set specification starts with a header consisting of a single line - @numberofvalues = X
, where X is the number of elements in the set. The body of the file consists of exactly X
lines, each one associating an integer i
from the [0, X-1]
interval to a unique quoted value x
from the enumerated set. The association follows the syntax i .. "x"
, where the integer and the element are separated by at least two dots, similar to an index file of a book. An optional comment at the end of the line can be placed with the special comment delimiter #
.
Per convention, enumerated set specifications have a .domain
extension and are placed in the config/domains
folder or one of its subfolders.
Please note that if you define distributions over Enum
types, the domain of your distribution are the integers encoding the enumerated set, and not the actual domain values. So a 60-40 distribution over the demographics.gender
domain might have the following specification:
@numberofexactvals = 2
@numberofbins = 0
@nullprobability = 0.00
p(X) = 0.60 for X = { 0 } # probability for 'M'
p(X) = 0.40 for X = { 1 } # probability for 'F'
Until now, we have discussed the parameters, probability distribution functions, and the enumerated sets components of a Myriad specification, or in other words - all auxiliary components managed by the generator configuration. Next, we turn our focus to the core of the data generator specification - the <record_sequences>
section that defines the generated domain model as a set of domain-specific record sequences.
The specification of each sequence is provided in a corresponding element, which is uniquely identified by its key
attribute value. Per convention, record sequence keys should be expressed in singular and written in lowercased and underlined style.
Currently, Myriad supports only sequences consisting of pseudo-random records, but in future versions we plan to add support for other sequence types (e.g. a fixed sequence of statically defined records). Each pseudo-random record sequence is defined in a <random_sequence>
element that has the following main components:
<random_sequence key='my_type'>
<!-- record type specification -->
<record_type>
...
</record_type>
<!-- setter chain (PRDG) specification -->
<setter_chain>
...
</setter_chain>
<!-- cardinality estimator specification -->
<cardinality_estimator type='linear_scale_estimator'>
<argument key='base_cardinality' type='I64u' value='%customer.sequence.base_cardinality%' />
</cardinality_estimator>
<!-- sequence iterator specification (optional) -->
<sequence_iterator type='partitioned_iterator' />
</random_sequence>
The <record_type>
section defines the structure of the records in the enclosing sequence. The specification consists of a sequence of <field>
elements followed by a sequence of <reference>
elements. A <field>
describes a record field (i.e. an attribute), while a <reference>
describes a reference (i.e. a N:1 relationship) to a specific record from another record sequence.
Both elements have the same mandatory attributes - name
and type
. Similar to the sequence keys, the <field>
and <reference>
names should also be written in a lowercased and underscored style. Field entries may use any Myriad simple type as type
, while references may only reference an existing <random_sequence>
key. For Enum
fields, the additional attribute enumref
referencing the key of the actual enum_set
is also required.
The following (abridged) code fragment shows how the structure of a domain model with four types - customers, orders, lineitems, and products, can be encoded in a Myriad XML specification:
<record_sequences>
<!-- customers sequence -->
<random_sequence key='customer'>
<record_type>
<field name='pk' type='I64u' />
<field name='first_name' type='Enum' enumref='demographics.first_name' />
<field name='last_name' type='Enum' enumref='demographics.last_name' />
<field name='gender' type='Enum' enumref='demographics.gender' />
<field name='age' type='I16u' />
<field name='country' type='Enum' enumref='demographics.country' />
<field name='orders_count' type='I32u' />
</record_type>
...
</random_sequence>
<!-- orders sequence -->
<random_sequence key='order'>
<record_type>
<field name='pk' type='I64u' />
<field name='status' type='Enum' enumref='retail.order.status' />
<field name='total_price' type='Decimal' />
<field name='order_date' type='Date' />
<field name='lineitems_count' type='I32u' />
<reference name='customer' type='customer' />
</record_type>
...
</random_sequence>
<!-- products sequence -->
<random_sequence key='product'>
<record_type>
<field name='pk' type='I64u' />
<field name='name' type='String' />
<field name='type' type='Enum' enumref='retail.product.type' />
<field name='retail_price' type='Decimal' />
</record_type>
...
</random_sequence>
<!-- lineitems sequence -->
<random_sequence key='lineitem'>
<record_type>
<field name='pk' type='I64u' />
<field name='quantity' type='I16u' />
<field name='price' type='Decimal' />
<field name='tax' type='Decimal' />
<field name='discount' type='Decimal' />
<field name='ship_date_offset' type='I16u' />
<reference name='order' type='order' />
<reference name='product' type='product' />
</record_type>
...
</random_sequence>
</record_sequences>
In Myriad, pseudo-random sequences of domain records are generated by user-defined pseudo-random domain generator (PRDG) functions. Since PRDGs are constructed as a chain of elementary setter functions, they are also referred to as setter chains.
Each <random_sequence>
defines its own PRDG in the corresponding <setter_chain>
subsection. A <setter_chain>
specification contains configurations of one or more <setter>
components, each one attaching a modular data generation logic snippet to a particular component in the generated record. Depending on the type of the attached component, the setter type
can be either field_setter
or reference_setter
.
Field setters have the following general syntax snippet (attribute values written in braces, like {var_name}, denote syntax snippet variables):
<setter key='set_{field_name}' type='field_setter'>
<argument key='field' type='field_ref' ref='{type_key}:{field_name}' />
<argument key='value' type='{value_provider_type}'>
<!-- value provider arguments -->
</argument>
</setter>
while reference setters have slightly different syntax:
<setter key='set_{reference_name}' type='reference_setter'>
<argument key='reference' type='reference_ref' ref='{type_key}:{reference_name}' />
<argument key='value' type='{reference_provider_type}'>
<!-- reference provider arguments -->
</argument>
</setter>
As you can see from the code snippets above, field and reference setters have a similar structure, with the actual data generating logic encapsulated in the value and reference providers configured as value
arguments. Harnessing the expressive power of Myriad therefore is possible only if the user understands the semantics of the value and reference provider types supported by the toolkit. In the remainder of this section we present the syntax and the semantics of each provider type.
Value providers do what their name suggests - they provide the values to be bound to the record fields in the enclosing setter chain. The returned values might depend on the current context record (i.e. the currently generated record), the position on the underlying pseudo-random number sequence, or both.
The most simple value provider is the const_value_provider
, which independent on the execution context always returns the same value. Assume that you want all order:status
fields in the generated orders to have the value 0
. The configured field setter will look like that:
<setter key='set_status' type='field_setter'>
<argument key='field' type='field_ref' ref='order:status' />
<argument key='value' type='const_value_provider'>
<argument key='value' type='Enum' value='0' />
</argument>
</setter>
The most useful value provider for simple domains with a lot of independence assumptions is the random_value_provider
. A random_value_provider
will consume one seed from the underlying PRNG sequence and transform it to a sample from the specified probability distribution. Assume that you have specified a probability distribution function Pr[lineitem.quantity]
that describes the relative frequencies of the lineitem:quantity
attribute values. The configured field setter will look as follows:
<setter key='set_quantity' type='field_setter'>
<argument key='field' type='field_ref' ref='lineitem:quantity' />
<argument key='value' type='random_value_provider'>
<argument key='probability' type='function_ref' ref='Pr[lineitem.quantity]' />
</argument>
</setter>
If you use conditional distribution in your random_value_provider
, you'll have to specify an additional condition_field
argument that tells the provider from which context field to get the evidence for the conditioned variable. Assume that the lineitem:quantity
distribution should on the lineitem:product:type
value. The above field setter configuration should be adapted as follows:
<setter key='set_quantity' type='field_setter'>
<argument key='field' type='field_ref' ref='lineitem:quantity' />
<argument key='value' type='random_value_provider'>
<argument key='probability' type='function_ref' ref='Pr[lineitem.quantity]' />
<argument key='condition_field' type='field_ref' ref='lineitem:product:type' />
</argument>
</setter>
In order ensure that the referenced records fulfills certain conditions, reference providers might have to perform a value based selection on the referenced sequence before they select the appropriate reference. For example, in an online shop, expensive gadgets are more likely to be ordered by male customers, whereas expensive cosmetics is more likely to be ordered by females. A customer reference provider in the order setter chain should therefore pick a customer with a suitable gender based on the order contents. For each order, the reference provider will first determine the required gender, and then pick a random referenced record from all customers with that gender.
Eager evaluation of the required selection is a feasible, but very inefficient strategy to implement the above logic. To enable fast value based selections, Myriad offers a special value provider type, whose inverse can be used to efficiently evaluate selections over the associated field. Invertability in this setting essentially means that given a specific value x
for a field A.a
, the value provider inverse will tell you the range of sequence IDs [i,j]
for which A[i].a = x
holds.
Currently, the only non-trivial value provider that is invertible is the clustered_value_provider
. Assume that you want to generate the lineitem:quantity
values with the same distribution as above, but you also want to have quantity
-based selection of referenced lineitems in the order_return
sequence. You can specify this behavior like that:
<setter key='set_quantity' type='field_setter'>
<argument key='field' type='field_ref' ref='lineitem:quantity' />
<argument key='value' type='clustered_value_provider'>
<argument key='probability' type='function_ref' ref='Pr[lineitem.quantity]' />
<argument key='cardinality' type='const_range_provider'>
<argument key='min' type='I64u' value='0' />
<argument key='max' type='I64u' value='%lineitem.sequence.cardinality%' />
</argument>
</argument>
</setter>
Besides the required probability distribution, the clustered_value_provider
expects as an additional argument cardinality
a range provider that provides the subsequence range over which the probability
function domain should be clustered. The most common cardinality
configuration for clustered value providers is a const_range_provider
configured with an interval [0, X)
, where X
is the cardinality of the enclosing sequence.
Note that unlike random_value_provider
, the clustered_value_provider
does not support conditional probabilities.
To illustrate the functionality that a clustered_value_provider
implements, assume that the %lineitem.sequence.cardinality%
is 300, and let Pr[lineitem.quantity]
be specified as P(1) = 0.50, P(2) = 0.30, P(3) = 0.20. The clustered_value_provider
will spread the PDF domain in a clustered way over all lineitems occurring between position 0 (inclusive) and 300 (exclusive) - that is, over the whole lineitem subsequence. Due to the configuration of Pr[lineitem.quantity]
, lineitems with position in the [0, 150) interval will have quantity = 1
, lineitems in [150, 240) will have quantity = 2
, and lineitems [240, 300) will have quantity = 3
.
A context field value provider can be used to provide values obtained from a field path reachable from the current context. Consider that your customer has an associated address, and you want the value of the customer:country
field to match the value of the country in this address. You can achieve this with a context_field_value_provider
as follows:
<setter key='set_country' type='field_setter'>
<argument key='field' type='field_ref' ref='customer:country' />
<argument key='value' type='context_field_value_provider'>
<argument key='field' type='field_ref' ref='customer:address:country' />
</argument>
</setter>
Although context_field_value_provider
types are not commonly used in field setters, they can often be seen as reference provider arguments in a reference setter. We discuss this usage pattern in the next section.
The last value provider type we discuss is the callback_value_provider
. Callback value providers provide you with a mechanism to declare code-level extensions for value generation at the XML specification level. This might be required if none of the other value providers can do what your application requires. For example, consider the total price of a lineitem, which should be computed using the fixed formula
${lineitem:price} = ${lineitem.quantity} *
${lineitem:product:price} * (1 + ${lineitem:tax} - ${lineitem:discount})
Since Myriad currently does not provide a value provider for this functionality, you can use callback_value_provider
as a fallback solution. To do so, define the set_price
field setter as follows
<setter key='set_price' type='field_setter'>
<argument key='field' type='field_ref' ref='lineitem:price' />
<argument key='value' type='callback_value_provider'>
<argument key='type' type='String' value='Decimal' />
<argument key='name' type='String' value='setLineitemPrice' />
<argument key='arity' type='I16u' value='0' />
</argument>
</setter>
This configuration will generate a pure virtual callback method with signature
virtual Decimal setLineitemPrice(const AutoPtr<Lineitem>& recordPtr, RandomStream& random) = 0;
in the generated BaseLineitemSetterChain
C++ class. You will then have to implement this virtual method in the derived LineitemSetterChain
class before you can compile your project. The implementation for the example above could look like that:
virtual Decimal setLineitemPrice(const AutoPtr<Lineitem>& recordPtr, RandomStream& random)
{
const Decimal quantity = recordPtr->quantity();
const Decimal priceFactor = 1 + recordPtr->tax() - recordPtr->discount();
const Decimal productPrice = recordPtr->product()->price();
return quantity * priceFactor * productPrice;
}
Callback value provider configurations expect three arguments. The type
argument is a string that matches the return type of the callback function. The name
argument is the name of the callback function that you will implement as part of the enclosing setter chain. The arity
argument is an integer that indicates the exact number of PRNG streams that your callback function will consume from the given random
parameter in each invocation.
WARNING: you must make sure that for each code path in your callback you consume exactly the given arity
amount of random numbers, otherwise you will break the parallelization features integrated in the Myriad runtime. If your callback function logic consumes a variable number of pseudo-random numbers, give an arity
for the upper bound and use the RandomStream::skip(I64u pos)
method to skip all unconsumed numbers until this bound before you return from the callback.
When a new record is generated, all of its referenced records are initially set to NULL. In order to instantiate them, you need to add reference providers to the setter chain of the generated record. In this section, we give an overview of the supported reference provider types and discuss some common usage patterns for each one. In the following discussion, we refer to referenced records as parents and referencing records as children.
The random_reference_provider
type generates associations between the current child and a parent chosen randomly from a set that qualifies for the configured selection predicate.
As an example, consider a scenario where the product referenced by a lineitem should be selected uniformly at random from the set of all products. A random_reference_provider
specification that implements this logic will look as follows:
<setter key='set_product' type='reference_setter'>
<argument key='reference' type='reference_ref' ref='lineitem:product' />
<argument key='value' type='random_reference_provider'>
<argument key='predicate' type='equality_predicate_provider'>
<argument key='binder' type='predicate_value_binder'>
<argument key='field' type='field_ref' ref='product:pk' />
<argument key='value' type='random_value_provider'>
<argument key='probability' type='function_ref' ref='Pr[lineitem.product_pk]' />
</argument>
</argument>
</argument>
</argument>
</setter>
During execution, a random_reference_provider
evaluates a predicate-based selection of record indexes. To that purpose, an equality_predicate_provider
is used to construct disjunctive predicates consisting of equality predicate atoms: the atom literals xi are obtained by the configured value providers and bound to the specified Xi parent fields. Upon evaluation, one of the indexes from the result set is picked uniformly at random, the corresponding record is instantiated using its setter chain, and set as parent reference in the current child.
To ensure the semantic correctness of a random_reference_provider
configuration, the following conditions must hold:
- the value providers for the parent fields occurring in the
predicate
argument must be invertible, and - the domain of the value providers used in the predicate binders must not contain values outside the domain of the corresponding parent field.
The first condition ensures that the required selections can be computed, and the second that the result sets will never be empty (i.e. that there will always be at least one reference candidate).
In the above example, we sample values from a Pr[lineitem.product_pk]
PDF and bind them to the product:pk
field in the selection predicate. Assuming that the product:pk
field is invertible, and that the Pr[lineitem.product_pk]
domain is a subset of the Pr[product.pk]
domain, the specification is semantically correct.
Note also that in the above example we have a selection over an unique key from the parent sequence. This means that for each possible binding the result set will contain exactly one record, and this record will be picked as the reference for the child. This pattern is useful if you want to have a reference pattern that does not depend on any property of the parent sequence.
In general, the random_reference_provider
can generate more realistic associations using descriptive fields from the parent sequence and more complex value providers for the predicate binders. As an alternative to the above example, consider a case where you have three types of products - gadgets, cosmetics, and other, and you want to correlate the product types purchased by a customers to their gender. You can specify the following conditional_combined_probability[Enum;Enum]
configuration
# configuration for Pr[lineitem.product_type]
@numberofconditions = 2
# case 1: gender = male
@condition = [0, 1)
@numberofexactvals = 3
@numberofbins = 0
@nullprobability = 0.00
p(X) = 0.35 for X = { 0 } # gadgets
p(X) = 0.05 for X = { 1 } # cosmetics
p(X) = 0.60 for X = { 2 } # other
# case 2: gender = female
@condition = [1, 2)
@numberofexactvals = 3
@numberofbins = 0
@nullprobability = 0.00
p(X) = 0.05 for X = { 0 } # gadgets
p(X) = 0.35 for X = { 1 } # cosmetics
p(X) = 0.60 for X = { 2 } # other
and use it in a random_reference_provider
that selects the lineitem products based on their type, conditioned on the customer gender:
<setter key='set_product' type='reference_setter'>
<argument key='reference' type='reference_ref' ref='lineitem:product' />
<argument key='value' type='random_reference_provider'>
<argument key='predicate' type='equality_predicate_provider'>
<argument key='binder' type='predicate_value_binder'>
<argument key='field' type='field_ref' ref='product:type' />
<argument key='value' type='random_value_provider'>
<argument key='probability' type='function_ref' ref='Pr[lineitem.product_type]' />
<argument key='condition_field' type='field_ref' ref='lineitem:order:customer:gender' />
</argument>
</argument>
</argument>
</argument>
</setter>
While the random_reference_provider
type is sufficient in most scenarios, sometimes you might want to have more control over the association between a certain parent and its children. This might be the case if the number of children for each parent is fixed a priori (e.g. order #1 should have 6 lineitems, oder #2 - 3 lineitems, and so forth), or if the parent has to access all of its children (e.g. if the total price of the order should be computed as the sum over the lineitem prices).
To support this functionality the Myriad Toolkit implements another type of reference provider - the clustered_reference_provider
- which generates the sequence of children clustered on their parent.
More precisely, for each parent the clustered_reference_provider
reserves a block of exactly N children in the referenced sequence, where N is the maximum number of children per parent. Each parent has also a dedicated field of type I32u
that holds the exact number of associated children M (obviously, M <= N is a sequence invariant). The clustered_reference_provider
then iterates through each N-slot block and instantiates children at the first M positions, depending on current M value from the associated parent. The remaining N - M positions are skipped and the generation process continues from the beginning of the next block.
As an example, consider a configuration that generates the association between a customer and its orders:
<setter key='set_customer' type='reference_setter'>
<argument key='reference' type='reference_ref' ref='order:customer' />
<argument key='value' type='clustered_reference_provider'>
<argument key='children_count' type='context_field_value_provider'>
<argument key='field' type='field_ref' ref='customer:orders_count' />
</argument>
<argument key='children_count_max' type='const_value_provider'>
<argument key='value' type='I32u' value='%customer.sequence.max_orders_per_customer%' />
</argument>
</argument>
</setter>
As illustrated in the above code snippet, the clustered_reference_provider
has two value provider arguments. The children_count
argument is a context_field_value_provider
executed with the parent as context to get the current number of children, and the children_count_max
is a const_value_provider
that provides the maximum number of children per parent.
Before we conclude the discussion of the clustered_reference_provider
, we warn users to use clustered_value_providers
together with clustered_reference_provider
with caution. The reason for this warning is that the value distributions generated by clustered_value_providers
are guaranteed only if the whole sequence is generated. Since the use of clustered_reference_providers
enforces skips of certain positions, the value distributions generated by clustered_value_providers
will most probably differ from the ones that are specified. An important exception here are uniform unique value distributiosn (e.g. for primary keys), which can be safely used together with clustered_reference_provider
components.
TODO
TODO
TODO
[1]: For details about project setup, please read the Getting Started Guide.