-
Notifications
You must be signed in to change notification settings - Fork 6
Home
Myriad is a development toolkit for scalable data generation. Generating large synthetic datasets with a certain schema and a set of statistical properties is a challenging yet increasingly important task, especially in the context of benchmarking and testing systems designed for management of web-scale data like Hadoop or parallel RDBMS like DB2. The Myriad Toolkit aims to ease this process by offering a fast and easy way to develop data generators that can generate dependent data on independently running nodes.
The Myriad Toolkit has two main components: a generic C++ runtime library for scalable data generation, and a Python specification compiler that generates library extensions from a user-defined data generator specification written in XML.
An XML specification contains the structure of the generated domain model - a family of user-defined domain types, as well as a corresponding family of pseudo-random domain type generators (PRDGs).
Essentially, a PRDG is a function that generates a sequence of pseudo-random domain type records from an underlying sequence of pseudo-random numbers. In the XML specification PRDGs are realized as a chain of setter functions. Applying a setter to a generated record assigns (i.e. sets) a specific value to one or more of its components. The Myriad Toolkit provides a range of primitive setters that can be used to construct PRDGs with different statistical properties (e.g. value distributions for specific record fields or value dependencies between several record fields).
Besides the simple specification language, the Myriad runtime library also transparently provides parallelization support for all compiled data generators. The underlying sequence of pseudo-random numbers is partitioned in a way that identifies each pseudo-random record with a unique position (i.e. a concrete seed) in the number sequence. In addition, the sequence of pseudo-random numbers is generated by a pseudo-random number generator (PRNG) that supports arbitrary skips to any position in constant time. These runtime-level decisions are critical for efficient parallelization, as they allow us to (A) partition the generated PRDG sequences across an arbitrary number of data generator nodes in a shared-nothing environment, and (B) use function shipping (i.e. re-compute) instead of data shipping (ship over the network) to get the contents of a referenced record generated on a remote node.
If you want to learn more about the Myriad Toolkit, please read the Getting Started Guide and the Quick Tour.
To get a running demo of a simple data generator, please check the vldb-demo package.
Here is a list of publications that describe the Myriad Toolkit:
- Myriad: Scalable and Expressive Data Generation - Alexander Alexandrov, Kostas Tzoumas, Volker Markl; PVLDB, 5(12), 2012: pp. 1890-1893
- Myriad - Parallel Data Generation on Shared-Nothing Architectures - Alexander Alexandrov, Berni Schiefer, John Poelman, Stephan Ewen, Thomas Bodner, Volker Markl; Proceedings of the First Workshop on Architectures and Systems for Big Data (ASBD), 2011
For further questions about the Myriad Data Generator Toolkit or any other related questions please use the mailing list.
- Prof. Dr. rer. nat. Volker Markl, FG DIMA, TU Berlin - principal investigator
- Alexander Alexandrov, FG DIMA, TU Berlin - lead developer
- Thomas Bodner, FG DIMA, TU Berlin - general assistance
- Christoph Brücke, FG DIMA, TU Berlin - general assistance
The Myriad Toolkit is developed as part of the Stratosphere Project at the Fachgebiet Datenbanksysteme und Informationsmanagement, TU Berlin under the supervision of Prof. Dr. rer. nat. Volker Markl.
The project is funded by the Deutsche Forschungsgemeinschaft, the European Institute of Innovation and Technology, and the IBM Centre for Advanced Studies, Toronto.