Python script to upload files to Elasticsearch
To run the uploader, first download the files in the repo https://github.com/SmartcitySantiagoChile/elasticsearchInstaller/ and follow the instructions provided in the Readme.
After elasticsearch and cerebro are running, go to the dataUploader folder and execute:
pip install -r requirements.txt
To load a file, the usage is:
python3 datauploader/loadData.py path/to/file
file: path to the file that contains the data to load.
--host: the IP where ES is running. Default:"127.0.0.1".
--port: the port that application is listening on. Default: 9200.
--index: name of the index to create/use. If an index with the same name does not exist, the application will create it. Otherwise, it will use the existing index. By default, the name of the index is the extension of the file being uploaded.
--chunk: number of docs to send to ES in one chunk. Default: 5000.
--threads: number of threads to use. Default: 4.
--timeout: timeout parameter of the ES client. Default:30 (in seconds.)
The default chunk size and number of threads are the ones that gave the best results when experimenting with different files, so using the defaults is recommended. Nevertheless, sometimes loading a big file can cause a timeout error; in this case, raising the timeout value should solve the issue (elastic/elasticsearch-py#231)
The mappings must be in a folder called mappings
in the same directory where loadData.py
is. The script uses
the datafile extension to know what mapping to use so, if the mapping is called ext-template.json
, then the file
must have extension ext
. For the mappings, only JSON format is accepted.
As the ES team is planning to deprecate doctypes, the mappings used to load the file should all use the default doctype
that is doc
. If a property is a date
, then expliciting the format is recommended since ES is not able to
parse all date formats, and it may cause an exception. E.g.:
{
"mappings": {
"doc": {
"properties": {
"id": {
"type": "long"
},
"dateTime": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"type": {
"type": "text"
},
"description": {
"type": "text"
}
}
}
}
}
Additionally, is a property is listed as boolean only true
and false
values are accepted, using 0 and 1 will
cause a MapperParsingException
. In some cases when the data is not properly parsed the
option ignore malformed
can be
used (https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-malformed.html,
check stop-template.json
.)
The extensions that this script loads are: expedition
, general
, od
, profile
, shape
, speed
, stop
and trip
.
In order to load a file that has a different extension its mapping must be included in the mappings folder and its
header must be added to datafile.py
. If not every column of the file is going to be loaded, if more columns need
to be added or if it's a nested file (meaning it has more than a line per document, like shape
), then a Python
file that inherits from datafile.py
must be created and the getHeader
and makeDocs
methods must be
overwritten.
Before to run tests you need to install test dependencies with the command:
pip install -r requirements-dev.txt
After that, you can execute:
python -m unittest
This test the process to upload to elasticsearch.
Build command:
docker-compose -p datauploader -f docker\docker-compose.yml build
Run command:
docker-compose -p datauploader -f docker\docker-compose.yml up --abort-on-container-exit
To create a release of this project, you must follow the next steps:
- Add a new version to
CHANGELOG.md
file with the following format:
vX.Y.Z YYYY-mm-DD
- Comment 1
- Comment 2
...
The most recent versions go at the top of the file.
- Change version code in
setup.py
with the same version added inCHANGELOG.md
- Commit changes of
CHANGELOG.md
file andsetup.py
file - Create a tag with the same version added in
CHANGELOG.md
. Use the commandgit tag vX.Y.Z
, avoid adding comments because it will be ignored it - Push tag to GitHub with the command:
git push origin vX.Y.Z
After these steps GitHub Actions take control, it will create a release.