SimRunner is a tool that binds:
- a powerful data generator for MongoDB
- a declarative and highly scalable workload generator
You can think of SimRunner as (a bit more than) the sum of mgeneratejs and POCDriver. Generate documents as you would with mgenerate, and inject them to MongoDB with a super-fast multithreaded workload framework.
Workloads are declarative in that you describe them in JSON. Just write your queries, your threading options, and you're set. No code to write. Just make sure you close all your curly brackets. Of course, since MongoDB queries are themselves BSON Documents, you can use the same expression language as in the data generator to introduce some variability. Workloads have a few tricks up their sleeve - for example you can build a "dictionary" of known or generated values that you can reuse in your queries.
Thanks to all this, SimRunner can reproduce fairly closely realistic workloads on MongoDB - so you can model as accurately as possible a given workload to test infrastructure changes, or workload changes, before going live.
It should be considered a "work in progress" and comes without any support or guarantee, either from MongoDB, Inc. or myself.
- SimRunner
Build with mvn package
and run with java -jar SimRunner.jar <config file>
. Needs at least Java 11 (tested with 17 as well).
The config file specifies:
- a connection string to MongoDB - if it starts with '$' SimRunner will use environment variables
- a number of
templates
which specify document shapes and map them to collections - a number of
workloads
which specify actions to take on those templates
Optionnally, the config file may also specify:
- a reporting interval (default 1000ms)
- an HTTP reporting server configuration
Look at the provided sample.json
for a more or less complete example of what you can do.
If you enable the HTTP interface in the config file, point your browser at http://(host):(port) to view a dynamic graph of throughput and latency.
For distributed metrics collection (aggregate results from multiple SimRunners, if you have a very intensive workload) take a look at https://github.com/schambon/SimRunner-Collector
For easy setup in EC2, a quick and dirty script to provision a machine etc. is at https://github.com/schambon/launchSimRunner
If you want to run it as a Docker container a Dockerfile is provided. In this case, you need to create a config file in the bin
directory named config.json
and then build your Docker image.
Date | |
---|---|
2023-09-08 | %toInt /%toLong /%toDouble expressions. float is now aliased to double as convenience. |
2023-09-01 | Timeseries support |
2023-08-22 | Enviroment variables support for connection strings |
The config file is parsed as Extended JSON - so you can specify things like dates or specific numeric types if you want to. In addition, any string value starting with $
gets substituted with an environment variable lookup (if the lookup returns nothing then the original string, $ included, is returned).
Let's look at an example:
{
"connectionString": "$MONGO_URI",
"reportInterval": 1000,
"http": {
"enabled": false,
"port": 3000,
"host": "localhost"
},
"mongoReporter": {
"enabled": true,
"connectionString": "mongodb://localhost:27017",
"database": "simrunner",
"collection": "report",
"drop": false,
"runtimeSuffix": false
},
"templates": [
{
"name": "person",
"database": "test",
"collection": "people",
"drop": false,
"template": {
"_id": "%objectid",
"first": "%name.firstName",
"last": "%name.lastName",
"birthday": "%date.birthday"
},
"remember": ["_id", { "field": "first", "preload": false}],
"indexes": []
}
],
"workloads": [
{
"name": "Insert some people",
"template": "person",
"op": "insert",
"threads": 1,
"pace": 100,
"batch": 1000
},
{
"name": "Find people by key",
"template": "person",
"op": "find",
"params": {
"filter": { "_id": "#_id" }
},
"threads": 4
}
]
}
This config specifies a connection to a local MongoDB without authentication, a simple "person" template that maps to the test.people
collection, and two workloads: one that inserts new people in the database at a rate of one batch of 1000 every 100ms, and one that looks up a single person by _id
.
A few things can be seen already:
- templates are named and referenced by name in the workloads.
- workloads are also named. This is for collecting statistics.
- templates are pretty flexible. You can use all normal EJSON expressions (like
{"$date": "2021-10-27T00:00:00Z"}
) as well as generator expressions prefixed by%
. Generator expressions allow you to randomly generate objectids, dates, and almost everything that is supported by JavaFaker. - templates can
remember
fields it has generated, in order to create libraries of valid data. This is useful for generating workloads later on. When the system starts, theremember
ed fields are pre-populated from the existing collection (if it exists) by default. See further down for advanced remember features. - templates can specify indexes (use normal MongoDB syntax) to create at startup.
- workloads run independently in their own threads. They can also be multi-threaded, if you want to model specific parallelism condition. If omitted,
threads
defaults to 1. - workloads can be
pace
d, that is, you can specify that the operation should run everyn
milliseconds. For instance, if you want an aggregation to run every second and it takes 300ms, the thread will sleep for 700ms before running again. Note that pacing is on a per-thread basis: if you have 4 threads running ops at a 100ms pace, you should expect more or less 40 operations per second (10 per thread). If omitted,pace
defaults to 0 - ie the thread will never sleep. - workloads can use the same template language as templates. They can also refer to
remember
ed fields.
The following template expressions are supported:
Binary values
%objectid
: generate a new ObjectId%binary
: create random blob of bytes. Use this form:{"%binary": {"size": 1024}}
to specify the size (default 512 bytes). Use"as": "hex"
to encode in a hex string rather than a binary array%uuidString
: random UUID, as String%uuidBinary
: random UUID, as native MongoDB UUID (binary subtype 4)
Numbers
%integer
/%number
: generate an int. Optionally use this form:{"%integer": {"min": 0, "max": 50}}
to specify bounds%natural
: generate a positive int. Optionally use this form:{"%natural": {"min": 400, "max": 5000}}
to specify bounds%long
,%double
, and%decimal
work as%number
and yield longs, doubles, and decimals. Note that BSON doesn't have a 32 bit float type, so we don't support floats. Instead,%float
is an alias for%double
.%gaussian
: generate a number following an approximate Gaussian distribution. Specifymean
,sd
for the mean / standard deviation of the Gaussian. Optionally, settype
toint
orlong
for integer values (any other value is understood as double)%product
: product of number array specified byof
. Parametertype
(eitherlong
ordouble
, defaultlong
) specifies how to cast the result%sum
: like%product
but sums theof
array%abs
: absolute value{"%mod": {"of": number, "by": number}}
: modulus (of
modby
)%toInt
,%toLong
,%toDouble
(or%toFloat
): parse a string and return a number
Strings and text
{"%stringTemplate": {"template": "some string}}
: string based on a template, where&
is a random digit,?
is a random lowercase letter and!
is a random uppercase letter. All other characters in the template are copied as-is.{"%stringConcat": {"of": [x, y, z, ...], "sep": " "}}
: concatenate as string the list of inputs.{"%toString": {"of": template}}
: make "template" into a string (eg long -> string){"%stringTrim": {"of": "string expression"}}
: removes leading and trailing spaces fromstring expression
%name.firstName
,%name.lastName
,%name.femaleFirstName
,%name.maleFirstName
,%name.name
: generate names%address.state
,%address.latitude
,%address.longitude
,%address.zipCode
,%address.country
,%address.city
,%address.fullAddress
: generate addresses{"%ngram": {"of": "string expression", "min": 3, "split": true, "lowercase": true}}
: compute ngrams of min size "min" (default 3)
Dates
%now
: current date%date
: create a date between the Unix Epoch and 10 years in the future, or specifymin
/max
bounds in a subdocument, either as ISO8601 or as EJSON dates (hint:{$date: "2021-01-01"}
works but"2021-01-01"
doesn't as it's not valid ISO8601).{"%plusDate": {"base": date, "plus": amount, "unit": unit}}
: adds some time to a date.unit
is either:year
,month
,day
,minute
{"%ceilDate": {"base": date, "unit": unit}}
: align date to the next unit (eg next hour, day...) - default unit isday
{"%floorDate": {"base": date, "unit": unit}}
: truncate date to the unit (eg hour, day...) - default unit isday
{"%extractDate": {"minute": date}}
: extract UTC time field (second, minute, hour, day, month, year) from date.
Sequential values
%sequence
: create a sequential number from a global sequence.%threadSequence
: create a sequential number from a per-thread sequence.
Arrays and objects
{"%array": {"size": integer, "min": integer, "max": integer, "of": { template }}}
: variable-length array (min/max elements, of subtemplate). Ifsize
is present,min
/max
are ignored.{"%keyValueMap": {"min": integer, "max": integer, "key": { template resolving to string }, "value": { template } }}
: variable-length subdocument with keys/values generated from the provided templates. Key uniqueness is enforced at generation time.{"%head": {"of": "expression"}}
: first element ofexpression
(should resolve to array or string){"%arrayElement": {"from": [...], "at": integer}}
:at
th element (0-based) ofarray
Alternative values
{"%oneOf": {"options": [ ... list of options ...], "weights": [ ... list of weights ...]}}
: pick among options.weights
is optional; only use positive ints (or expressions that resolve to ints).{"%dictionary": {"name": "dictionary name"}}
: pick a value from a dictionary (synonym with"#dictionary name"
){"%dictionaryConcat": {"from": "dictionary name", "length": (number), "sep": "separator}}
: string length values from a dictionary, separated by sep.{"%dictionaryAt": {"from": "dictionary name", "at": (integer)}}
: get the nth element of a dictionary.
Geospatial
{"%longlat": {"countries": ["FR", "DE"], "jitter": 0.5}}
: create a longitude / latitude pair in one of the provided countries.jitter
adds some randomness - there are only 30ish places per country at most in the dataset, so if you want locations to have a bit of variability, this picks a random location withinjitter
nautical miles (1/60th of a degree) of the raw selection. A nautical mile equals roughly 1800 metres.{"%coordLine": {"from": [x, y], "to": [x, y]}}
: create a long,lat pair (really an x,y pair) that is on the line betweenfrom
andto
.
Utility
{"%descend": {"in": {document}, "path": "dot.delimited.path"}}
is used to traverse documents. This should be mostly legacy, as#document.dot.delimited.path
is equivalent.%workloadName
: name of the current workload%threadNumber
: number of the thread in the current workload{"%head": {"of": xxx}}
: get the first element ofof
. Supports Strings, UUIDs, ObjectId, Binary, BSON arrays (aka Java Lists).
Any other expression will be passed to JavaFaker - to call lordOfTheRings().character()
just write %lordOfTheRings.character
. You can only call faker methods that take no arguments.
The best way to generate random text is to use %lorem.word
or %lorem.sentence
.
It is possible to create variables, which are evaluated once and reused multiple times in a template.
For example, look at this templates
section:
"templates": [
{
"name": "19th century people",
"database": "simrunner",
"collection": "19th_century",
"variables": {
"birthday": {"%date": {"min": {"$date": "1800-01-01"}, "max": {"$date": "1900-01-01"}}}
},
"template": {
"first": "%name.firstName",
"last": "%name.lastName",
"birth": "#birthday",
"death": {"%date": {"min": "#birthday", "max": {"$date": "1950-01-01"}}}
}
}
]
This creates records like this one:
{
_id: ObjectId("61815f70cb4ef14d9a4a28f5"),
first: 'Zenaida',
last: 'Barton',
birth: ISODate("1807-06-12T17:28:35.949Z"),
death: ISODate("1865-04-15T15:05:13.892Z")
}
... and ensures that death
is in fact posterior to birth
. Such cross-field dependencies (within a single document) is possible by creating a variable birthday (using the normal templates) and generating the field death
by referencing it (using the #
prefix) in the parameters of the %date
generator.
Note: you can't reference a variable in another variable declaration.
Note: variables can also be defined in a workload definition, and used in templated expressions within that workload.
Dictionaries let you create custom sets of data that the template system will pick into. Dictionaries can be a static list, a JSON file read on disk, a plain text file read on disk, a fixed number of objects generated through a template expression, or even a MongoDB collection from your cluster.
Example:
"dictionaries": {
"names": ["John", "Jill", "Joe", "Janet"],
"statuses": ["RUNNING", {"status": "DONE", "substatus": "OK"}, {"status": "DONE", "substatus": "NOK"}],
"characters": {"file": "characters.json", "type": "json"},
"locations": {"file": "locations.txt", "type": "text"},
"dates": {"type": "templateUnique", "size": 10, "template": "%natural"}
"identifiers": {"type": "collection", "collection": "referenceIdentifiers", "db": "refdb", "query": {"valid": true}, "limit": 1000, "attribute": "name"}
}
This creates five dictionaries:
names
is an inline list of stringsstatuses
is an inline list of BSON values - this shows strings and documents, but it could be anything that is expressible in Extended JSONcharacters
is a JSON file read from disk. The file must contain a single document with an array field calleddata
that contains the dictionary (similar to inline dictionaries)locations
is a plain text file, a dictionary entry per line (only strings, no other or mixed types)identifiers
is read fromrefdb.referenceIdentifiers
collection on the cluster. Onlycollection
is mandatory, for the other parameters default values are:db
: inherited from template definitionquery
:{}
limit
: 1,000,000 (same as remembered field prefetching)attribute
: attribute to use for the dictionary
Dictionaries can be used in templates:
- either directly (pick a word in the dict) with the
"#dict"
or{"%dictionary": {"name": "dict"}}
syntaxes. - or by concatenating multiple entries of a dictionary. This is useful to create variable-length text based out of real words, rather than Lorem Ipsum. Most UNIX/Linux systems (including macOS) have a dictionary for spell checking at /usr/share/dict/words, that can be read directly by SimRunner to make a (nonsensical) text that you can query from, for example using Atlas Search.
SimRunner can "remember" a library of values that exist in the data set. These values can then be used in query templates, exactly like a dictionary. They cannot however be used in document generation.
These values can have two origins:
- at initialization time, SimRunner will preload values from the configured collection (if it exists)
- every time a document is generated, SimRunner will extract values to remember
At its simplest, you can create a library of values by specifying "remember": ["value"]
in the template. This will turn on value collection both at init time (so-called preloading) and at generation time. "value" can be a top-level or nested field (with dotted.path
syntax). If the field resolves to an array, it is recursively unwound until we get to a scalar value.
Note that dots in the field path are replaced by underscores to name the field in the resulting dictionary. For example, if you have this template definition:
{
"template": {
"top": {
"bottom": "%number"
}
},
"remember": [ "top.bottom" ]
}
Then you can use the following in workloads:
{
"op": "find",
"filter": { "top.bottom": "#top_bottom" }
}
With this, you are guaranteed that #top_bottom
will be resolved to the value of an actual bottom
field.
For more control, you can use the following long form: "remember": [ {"field": "x", "compound": ["x", "y"], "name": "name", "preload": true, "number": 10000} ]
. This long form provides the following features:
field
: field name or field path, like simply listing inremember
compound
: instead of managing a single field, generate a document by compounding several fields. For example,"compound": [ "x", "y.z" ]
will remember a value of the form{"x": ..., "y_z": ...}
. The behaviour is the same as for the simple syntax: paths are descended and dots (.) are replaced with underscores (_) in field names. Ifcompound
is present,field
is ignored.name
: this is the name of the value library, which will be used in queries. By default, it is the same asfield
with dots replaced by underscores. If usingcompound
, it is mandatory to specify a name.preload
: should we load values from the existing collection at startup (default: true)?number
: how many distinct values should we preload from the existing collection at startup (default: one million)?
Compounding is useful when you want to run complex queries and still ensure they do match some existing records. For example, with the following template:
{
"name": "person",
"template": {
"_id": "%objectid",
"first": "%name.first",
"last": "%name.last",
"date of birth": "%date.birthday"
(...)
},
"remember": ["first", "last"]
}
If you want to run a query on both first
and last
, you could do that by remember
ing both fields and running a workload like:
{
"template": "person",
"op": "find",
"params": {
"filter": {
"first": "#first",
"last": "#last"
}
}
}
but it would pick a first name and a last name at random - most of the time yielding a combination that doesn't actually exist in the database. Instead, use a compound remember specification:
{
"name": "person",
"template": {
"_id": "%objectid",
"first": "%name.first",
"last": "%name.last",
"date of birth": "%date.birthday"
(...)
},
"remember": [{"compound": ["first", "last"], "name": "firstAndLast"}]
}
and query it like this:
{
"template": "person",
"op": "find",
"variables": {
"compoundVar": "#firstAndLast"
},
"params": {
"filter": {
"first": "#compoundVar.first",
"last": "#compoundVar.last"
}
}
}
Arrays are supported in remembering values: they are unwound so only scalar values are remembered. If you have a compound remember specification, the cartesian product of arrays are unwound. For example, consider the following document:
{
"a": [ "one", "two" ],
"b": [ "three", "four" ]
}
If you have a remember specification of { "compound": ["a", "b" ], "name": "cmp"}
, then the system will generate a dictionary called "cmp" with the following values:
[
{ "a": "one", "b": "three" },
{ "a": "one", "b": "four" },
{ "a": "two", "b": "three" },
{ "a": "two", "b": "four" }
]
When the system encounters a #name
token, it is resolved in the following order:
- Variables
- Remembered values
- Dictionaries
For compatibility's sake, the ##name
syntax can still be used to refer to variables.
When a token like #name.sub.document
is found, documents are descended as expected. Arrays are descended, yielding a list of values.
A template can define a createOptions
document with the same syntax as in the create database command. This is useful to create capped collections, or timeseries, or validation.
Note that views and collation options are not supported.
A template can define basic sharding options:
- shard key
- presplit chunks or a custom presplit class
Example configuration:
"sharding": {
"key": { "first": 1 },
"presplit": [
{ "point": {"first": "A"}, "shard": "shard01" },
{ "point": {"first": "M"}, "shard": "shard02" }
]
}
Example configuration using a custom presplit class:
"sharding": {
"key": {
"first": 1
},
"presplit": "org.schambon.loadsimrunner.sample.SamplePreSplitter"
}
Copy the sample preslit class to create your own.
Presplit is optional. It is not possible to presplit a hashed sharded collection.
Some notes:
- the first chunk (from minKey to the first point) remains on the primary shard of the collection.
- if a collection is already sharded, all sharding options are ignored (no resplitting or anything like that).
- if a collection is sharded, even if sharding isn't configured, the
drop
option is reinterpreted to bedeleteMany({})
(delete all documents rather than dropping). This is because dropping a collection drop requires flushing the cache on all routers, which is not practical from the client side. - it is up to the user to make sure the cluster configuration (like the number and names of the shards) aligns with the sharding configuration here. No checks are made.
If you want to test a set of identical workloads across multiple collections, you can use template instances. In your template definition, add a instances
parameter (and optionally, instancesStartAt
for more control) and this will clone your template definition and all associated workload definitions the specified number of times.
For example:
{
"name": "myTemplate",
"database": "db",
"collection": "coll",
"instances": 20,
"template": { ... }
}
This will create collections coll_0
through coll_19
and apply duplicate workloads (also named workload_name_0
through workload_name_19
) on these collections.
If you don’t like the numbering to start at 0, use "instancesStartAt": n
to specify your starting point. It may be interesting in case you’re running a clustered Simrunner to spread the instances - so Simrunner A would target collections 0 through 99 and Simrunner B would target collections 100 through 199.
Note that the template instances are created in parallel; by default up to 500 worker threads are allocated to initialize templates. If this number doesn't suit you, you can change it with the "numWorkers": n
parameter.
Be aware that all workloads are duplicated! So if you run 100 instances and have a workload that defines 100 threads, you will end up with 10000 client threads! Be sure to adjust thread counts and pacing so you don’t overwhelm your hardware.
- threads: number of worker threads for this workload
- batch: (insert or updateOne only) bulk write batch size
- pace: operations should run every pace milliseconds (on each thread)
- readPreference: primary, secondary, primaryPreferred, secondaryPreferred, nearest (no tag sets)
- readConcern: majority, local, available, linearizable, snapshot
- writeConcern: majority, w1, w2 (no tag sets)
- stopAfter: stop after n iterations, for each thread.
Note that stopAfter counts full iterations of the workload on a single thread - e.g. if you're inserting documents in batches of 100 on 10 threads, and you want 1,000,000 documents in the collection, then you need to set "stopAfter": 1000
. Said another way, total docs = stopAfter * threads * batch.
{
"name": "Insert a person",
"template": "person",
"op": "insert",
"threads": 1,
"batch": 0
}
Inserts a new instance of the named template.
Options:
- batch: bulk insert. Omit or specify
"batch": 0
for unit insert.
{
"name": "Find by first name",
"template": "person",
"op": "find",
"params": {
"filter": { "first": "#first" },
"sort": { "birthday": -1 },
"project": { "age": { "$subtract": [{"$year": "$$NOW"}, {"$year": "$birthday"}]}},
"limit": 100
}
}
Executes a simple query. filter
can use the template language ({"first": "%name.first"}
looks for a random first name) including references to remember
ed fields ({"first": "#first"}
). The result is passed to MongoDB as is, so you can use normal operators ($lt
, $gte
, etc.) including $expr
for pretty complex searches.
find
workloads always fetch all the results that match. Use limit
to simulate findOne
.
Options:
- filter: the query filter, can use templates
- sort: sorting specification, does not use templates
- project: projection specification, does not use templates
- limit: number of documents to fetch
{
"name": "Update last name",
"template": "person",
"op": "updateOne",
"params": {
"filter": { "_id": "#_id"},
"update": { "$set": { "last": "%name.lastName"}},
"upsert": true
}
}
Performs an update (one or many).
Options:
- filter: the query filter, can use templates
- update: the update specification, must include mutation operators ($).
update
can be a pipeline for expressive updates. - upsert: is this an upsert? (defaults to
false
)
{
"name": "Replace one person",
"template": "person",
"op": "replaceOne",
"params": {
"filter": { "_id": "#_id" },
"update": { "first": "%name.firstName", "last": "#last" },
"upsert": true
}
}
Replaces a document with an inline template. Note, if the update
field specifies an _id
field, that is stripped. Also note that fields generated inline will not be remembered.
Options:
- filter: the query filter, can use templates
- update: the replacement document, can use templates (including references)
- upsert: is this an upsert (defaults to
false
)
{
"name": "Replace one person with a new template instance",
"template": "person",
"op": "replaceWithNew",
"params": {
"filter": { "first": "%name.firstName", "last": "%name.lastName" },
"upsert": true
}
}
Replaces a document with a new instance of its original template. The same notes as replaceOne
apply.
{
"name": "sum numbers of random name",
"template": "person",
"op": "aggregate",
"params": {
"pipeline": [
{"$match": {"first": "#first"}},
{"$group": {"_id": null, "count": {"$sum": "$embed.number"}}}
]
}
}
Run an aggregation pipeline.
Options:
- pipeline: the pipeline to run. No particular restrictions (if on Atlas, this can call Atlas search
$search
indexes for example). All stages are run through the template generator.
This is a special workload type for timeseries insertion.
{
"comment": "Insert historical data",
"name": "Insert",
"template": "metrics",
"op": "timeseries",
"threads": 1,
"batch": 200,
"pace": 10,
"comments": [
"Threads should always be 1 or absent for timeseries",
"Batch and Pace work as expected"
],
"params": {
"meta": {
"metaField": "sensorId",
"dictionary": "sensors",
"generate": "all",
"comment": "At each iteration of the workload, iterate over the full dictionary"
},
"time": {
"timeField": "ts",
"start": {"$date": "2023-01-01"},
"stop": {"$date": "2023-01-31"},
"step": 300000,
"jitter": 30000,
"comment": [
"Increment a timer based on step / jitter. This is useful for backfilling history.",
"For ongoing, use 'value': '%now' (or other template) instead of start/stop/step/jitter"
]
},
"workers": 1000
}
}
NOTE: Timeseries workloads do not check whether the underlying collection is a timeseries collection. It actually works very well with regular collections too! It just will not leverage MongoDB 5+'s timeseries optimizations. Ensure (in the template or in an out-of-band init script) that the collection is setup as you want.
Each record in a time series has a meta
field (which identifies the series) and a time field. In a timeseries workload:
- you should always set
threads
to 1. Setting threads to more than 1 will just run several times the workload in parallel, each with its own "timeline" so this is probably not what you want. Parallelization is achieved through theworkers
parameter instead. pace
works as usual.- the associated template should NOT contain time fields or meta fields. This is set by the workload.
- you must provide a dictionary that contains the
meta
values.
Each time the workload runs, it will generate documents for a number of series (controlled by the generate
option; supported are "all"
and {"random": "expression"}
). It will then generate a timestamp with one of two options:
start
/stop
/step
: the workload keeps a clock initalized atstart
. At each run it will increase bystep
milliseconds. When the clock reachesstop
, the workload stops. This clock will be used for all series generated in the run. You can add per-series noise using thejitter
option (in milliseconds): each individual record's timestamp will be the global clock plus or minus jitter milliseconds.stop
andjitter
are optional.value
pass in an expression that resolves to a date.
In general you will use start
/step
to populate historical data, then use value
(often with %now
) to simulate ongoing activity.
Once all records are generated, they are inserted according to the batch
option:
- a numeric value will cause insertMany statements with
batch
documents per statement - any other value will cause single insertOne statements to be issued for each document
Write operations are spread out over a number of worker
threads (defaulting to 1).
Sometimes you just want to write JSON documents with a certain template to a Kafka topic. This workload type will do just that.
{
"comment": "Insert a price record in Kafka every 1000ms",
"name": "Price",
"template": "price",
"op": "kafkaInsert",
"params": {
"bootstrap-servers": "localhost:9092",
"topic": "prices"
},
"threads": 1,
"pace": 1000
}
Params:
boostrap-servers
: your Kafka brokerstopic
: topic to write to
No support is built in (yet) for security, authentication, and other advanced options.
Important note: you still need a MongoDB instance, even if it doesn't do anything. Simrunner won't start if the connectionString
at the top of the config file is invalid. Templates are mapped to collections, so even though you are writing to Kafka, an empty collection will be created in MongoDB for your template. (Fixing this would require a deep refactoring of Simrunner; contributions welcome.)
Simrunner has basic support for Queryable Encryption using automatic encryption (explicit encryption is not supported).
In order to setup QE, add an encryption
stanza in the configuration file, with the following structure:
"encryption": {
"enabled": true,
"sharedlib": "<Shared lib path>",
"keyVaultUri": "<Key vault URI>",
"keyProviders": {
"local": {
"key": "<path to your master key>"
}
},
"keyVaultNamespace": "encryption.__keyvault",
"collections": [
{
"database": "db",
"collection": "coll",
"kmsProvider": "local",
"fields": [
// encrypted field map
]
}
]
}
A full example is in tests/queryable-enc.json.
Noteworthy:
- you MUST install the MongoDB Crypt Shared library, downloadable as part of MongoDB Enterprise Advanced distributions. In the config file, provide the full path to the
mongo_crypt_v1
library (for instance on a Mac, this ismongo_crypt_v1.dylib
) as thesharedlib
parameter. - you can use a different MongoDB instance as a key vault. If you do not provide a
keyVaultUri
parameter, Simrunner will use the globalconnectionString
. - you MUST provide a keyVaultNamespace (it can be anything you like).
- only
local
key provider is supported currently. AWS, GCP, Azure and KMIP will be added in the future (PRs are welcome ^_^)
This has been tested against MongoDB 7.0 in Atlas on a Mac using Rosetta. If you are using another configuration and encounter issues, feel free to file a bug report.
A report is printed on stdout (as well as in a simrunner.log
file) every second. For each specified workload, it will print the following:
18:18:59.403 [main] INFO org.schambon.loadsimrunner.Reporter - 43 - Insert:
584 ops per second
58400 records per second
3.311301 ms mean duration
3.000000 ms median
4.000000 ms 95th percentile
(sum of batch durations): 19338.000000
100.000000 / 100.000000 / 100.000000 Batch size avg / min / max
Line by line, this consists of:
- timestamp, workload name (here: "Insert")
- total number of operations per second (if doing bulk writes, it's the number of batches we sent)
- total number of records per second - for insert, that's number of records inserted, for find it's the number of records fetched, etc.
- mean duration of an operation
- median duration of an operation
- 95th percentile duration of an operation
- average / min / max batch size - this is mostly useful for
find
andupdateMany
, tells you how many records are returned / updated per operation. Forinsert
it should be exactly equal to your specified batch size. - util% - this tells you approximately the percentage of time spent interacting with the database (if you have multiple threads running, it can be more than 100%). This is useful to decide if apparent poor performance is due to the DB or to the test harness itself)
In the http
section you can configure a REST interface to get periodic reports as JSON.
"http": {
"enabled": false,
"port": 3000,
"host": "localhost"
}
enabled
: boolean, enable the HTTP server (default: false)port
: int, what port to listen to (default: 3000),host
: string, what host/IP to bind to (default: "localhost"). If you want to listen to the wider network, set a hostname / IP here, or "0.0.0.0".
Once the system has started, use curl host:port/report
for a list of all reports since the beginning, or curl host:port/report\?since=<ISO date>
for a list of all reports since the provided date. This answers with JSON similar to this:
[
{
"report": {
"Find by key": {
"95th percentile": 1.0,
"client util": 96.99823425544469,
"max batch size": 1.0,
"mean batch size": 0.98914696529794,
"mean duration": 0.2278446011336935,
"median duration": 0.0,
"min batch size": 0.0,
"ops": 4339,
"records": 4292
},
"Insert": {
"95th percentile": 83.3,
"client util": 1.5597410241318423,
"max batch size": 1.0,
"mean batch size": 1.0,
"mean duration": 15.9,
"median duration": 1.0,
"min batch size": 1.0,
"ops": 1,
"records": 1
}
},
"time": "2021-11-02T12:46:23.908Z"
}
]
The reports SimRunner builds can add up to a bunch of data. Of course you can visualise them in the http interface (and download as CSV to work in your favourite spreadsheet program), but at some point it would be nice to send it to something more robust, like... a database?
If you add a mongoReporter
section to your config file, SimRunner will write all reports to the provided MongoDB collection. Note that if the collection doesn't exist (or if you specify "drop": true
), then it will create it as a time series collection. This requires at least MongoDB 5.0. If you're running on older versions, just create your report collection manually.
Configuration:
"mongoReporter": {
"enabled": true,
"connectionString": "$REPORTER_MONGO_URI",
"database": "simrunner",
"collection": "report",
"drop": false,
"runtimeSuffix": true
}
enabled
: should we log results to MongoDB ? (default:false
)connectionString
: MongoDB connection string. Does not have to be the same as the tested cluster (arguably, should not).database
: database in which to store the reportscollection
: collection in which to store the reportsdrop
: drop the collection? (default:false
)runtimeSuffix
: iftrue
, the UTC date and time when SimRunner starts up is appended to the collection name. This creates in effect one collection per test run. (default:false
)
Note that the HTTP interface doesn't need to be running for the MongoReporter to work. They are two completely different subsystems.
If field
is a remembered field, the #field
expression takes a value from the "field" bag at random every time it is executed. If you need the same value twice, use a variable (which is set once per template). For instance, to create a workload that finds a document by first name and creates a new field by appending a random number to the first name, you can do this:
"workloads": [{
"name": "fancy update",
"template": "sometemplate",
"variables": {
"first": "#first"
},
"op": "updateOne",
"params": {
"filter": { "first": "#first" },
"update": { "$set": {"newfield": {"%stringConcat": ["#first", " - ", "%natural"]}}}
}
}]
JSON has no syntax for comments... but SimRunner will happily ignore configuration keys it doesn't recognise. We consider it good practice to add a comment
key to your workload definitions.
If you use bulk writes (the batch
argument in an insert
or update
workload), it is interesting to note that the value of workload variables is set once per batch (template variables are evaluated once per document). Also, any variables defined in the workload are inherited by the template generator (for inserts). This lets you create once-per-batch values.
For instance if you are dealing with time series data, you may want to insert a bunch of values for a single sensor in a batch:
"templates": [{
"template": {
"sensorId": "#sensor",
"values": (...)
}
}],
"workloads": [{
"op": "insert",
"variables": {
"sensor": "%number"
},
"batch": 100
}]
In the above example, for each batch of 100 values, the sensor
variable is set once and inherited by template generation. Any variables set at the template level would, however, be re-evaluated every time a document is generated.
- Does not support arrayfilters, hint
- No declarative support for transactions or indeed, multi-operation workflows ("read one doc and update another") - you have to use custom runners for that.