Skip to content
aalexandrov edited this page Jan 22, 2013 · 56 revisions

Myriad Data Generator Toolkit

Myriad is a development toolkit for scalable data generators. Generating large, synthetic datasets with a certain schema and a set of statistical properties is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems designed for management of web-scale data or parallel RDBMS (e.g. Hadoop, DB2). The Myriad Toolkit aims to simplify this process by offering a fast and easy way to develop data generators that can generate dependent data in parallel with a set of independently running nodes.

Core Features

The Myriad Toolkit consists of two main components: a generic C++ runtime library for scalable data generation, and a Python prototype compiler that generates library extensions from a user-defined prototype specification data generator written in XML.

The XML specification contains the structure of the generated domain model as a family of user-defined domain types, and the data generation logic as a corresponding family of pseudo-random domain type generators (PRDGs) - functions that generate a sequence of pseudo-random domain records from an underlying sequence of pseudo-random numbers. PRDGs are realized as chains of setter functions. Applying a setter to a generated record assigns (i.e. sets) a specific value to one or more of its components. The Myriad Toolkit provides a range of primitive setters that implement various statistical properties (e.g. value distributions in a record fields or value dependencies between several record fields).

Besides the simple specification language, the Myriad runtime library transparently builds-in parallelization support in all compiled data generators. To do so, the framework makes sure that the following two conditions always hold. First, each domain record is identified by a unique position (i.e. a concrete seed) in the generating pseudo-random number sequence. Second, the sequence of pseudo-random numbers is generated by a pseudo-random number generator (PRNG) function that supports arbitrary skips to any position on the sequence in constant time. These runtime-level decisions are critical for efficient parallelization, as they allow us to (A) partition the generated PRDG sequences across arbitrary number of data generator nodes in a shared-nothing environment, and (B) use function shipping (i.e. re-compute) instead of data shipping (i.e. transfer over the network) to get the contents of a referenced record generated on a remote node.

First Steps

If you want to learn more about the Myriad Toolkit, please read the Quick Start Guide and the XML Specification Reference Manual.

To get a running demo of a simple data generator, please check the vldb-demo package.

Publications

Here is a list of publications that describe the Myriad Toolkit:

Contact

For further questions about the Myriad Data Generator Toolkit or any other related questions please use the mailing list.

Acknowledgements

The Myriad Toolkit is developed as part of the Stratosphere Project at the Fachgebiet Datenbanksysteme und Informationsmanagement, TU Berlin under the supervision of Prof. Dr. rer. nat. Volker Markl.

The project is funded by the Deutsche Forschungsgemeinschaft, the European Institute of Innovation and Technology, and the IBM Centre for Advanced Studies, Toronto.

Clone this wiki locally