English | 中文
If you need to construct a large amount of data, native Python will take a lot of time. Please use pypy to execute datafaker. For example:
pypy -m datafaker hbase localhost:9090 PIGONE 50000 --meta hbase.txt
Or multi thread execution, 8 threads generate data, and write PG 2000 pieces of data in batch each time:
datafaker mysql postgresql+psycopg2://postgres:postgres@localhost/testpg pig_fnumbe_test 100000 --meta meta.txt --worker 8 --batch 2000
Because the hbase.thrift.server.socket.read.timeout parameter set by HBase is too small, the default is 60 seconds
Therefore, add the configuration in conf/hbase-site.xml:
<property>
<name>hbase.thrift.server.socket.read.timeout</name>
<value>600000</value>
<description>eg:milisecond</description>
</property>
Restart HBase and thrift
Most of the examples show MySQL as an example.
Any relational database that supports sqlache can be used, such as PG, Oracle, tidb, redshift, etc.
But the type is RDB, for example:
datafaker rdb postgresql+psycopg2://postgres:postgres@localhost/testpg pig_fnumbe_test 100000 --meta meta.txt --worker 8 --batch 2000
Write to Oracle
datafaker rdb oracle://root:root@127.0.0.1:1521/helowin stu 10 --meta meta.txt
Sqlalchemy connection string must be Oracle: form
Operating system | Python version | test situation | remarks |
---|---|---|---|
Mac osx | python2.7 / 3.5 + | pass | |
Linux | python2.7 | through | |
Windows10 | python3.6 | via |
You need to set the interval and batch parameters, for example:
datafaker rdb postgresql+psycopg2://postgres:postgres@localhost/testpg pig_fnumbe_test 100000 --meta meta.txt --interval 0.5 --batch 1
If you need to write to other data sources, please give me the issue