Just define your data DNA and generate your dataset.
Users just need to define the schema model configuration in a json file. For each field of the schema a name, a type and a set of rules need to be defined.
It follows an example of `schema.json definition:
{
"instances": 10,
"output": "result",
"format": "csv",
"fields":[
{
"name":"name",
"dataType":"string",
"rules":[
{
"type": "set",
"values": ["dave","simon"],
"distribution": 1.0
}
]
},
{
"name":"age",
"dataType":"int",
"rules":[
{
"type":"set",
"values":[
100
],
"distribution":0.1
},
{
"type":"range",
"min":10,
"max":99,
"distribution":0.9
}
]
},
{
"name":"budget",
"dataType":"decimal",
"rules":[
{
"type":"set",
"values":[
100
],
"distribution":0.5
},
{
"type":"range",
"min":1,
"max":10,
"distribution":0.5
}
]
},
{
"name":"married",
"dataType":"boolean",
"rules":[
{
"type":"boolean",
"false":0.0,
"true":1.0
}
]
}
]
}
result.csv
dave,64,1.3272667719937015,true
dave,66,100.0,true
simon,16,7.887171701724464,true
simon,100,100.0,true
dave,50,4.378826132850798,true
simon,48,100.0,true
simon,24,1.2484780989173947,true
simon,100,100.0,true
dave,37,100.0,true
dave,48,100.0,true
simon,81,9.032302178134143,true
- string
- int
- decimal
- boolean
- date (TODO)
Every field comes with a set of rules, and every rule comes with a distribution. The distribution you define is used within the generation engine to understand how to model your data. The sum of the rule distributions for a single rule should be equal to 1.
- String set: the string to be generated is randomly selected from
values
set
{
"name": "first name",
"dataType": "string",
"rules": [
{
"type": "set",
"values": ["dave","simon"],
"distribution": 1.0
}
]
}
In the example above, all the first names will be equal to dave
or simon
.
- Integer set: the integer to be generated is randomly selected from
values
set - Range: the integer to be generated is randomly selected between
min
and `max
{
"name":"age",
"dataType":"int",
"rules":[
{
"type":"set",
"values":[
100
],
"distribution":0.1
},
{
"type":"range",
"min":10,
"max":99,
"distribution":0.9
}
]
}
In the example above ~10% of your ages will be equal to 100 and ~90% of your ages will be between 10 and 99.
Same as integer rules (specialisation will be implemented soon).
- Boolean: just define the false and true distribution values
{
"name":"married",
"dataType":"boolean",
"rules":[
{
"type":"boolean",
"false":0.0,
"true":1.0
}
]
}
In the example above, all the married
rows will be equal to false
- CSV (with
,
separator) - AVRO
- JSON (TODO)
- JDBC (TODO)
- REST (TODO)
The only supported mode now is standalone.
More interactive running mode will be developed soon.
git clone https://github.com/holydrinker/chromosoma.git
cd chromosoma
sbt assembly
java -jar target/scala-2.12/chromosoma-assembly-0.1.0.jar <path-to-schema>.json