This application is a server for on-demand synchronization of vector datasources to a PostgreSQL/PostGIS database. It offers JSON-based HTTP REST webservice which may be called by external applications to trigger the synchronization of geodata into a PostgreSQL database.
The name "batyr" originates from an Asian elephant who lived in a zoo in Kazakhstan. This elephant was claimed of being able to speak with a vocabulary of about 20 words. This ability and the fact that the official logo of the PostgreSQL also being an elephant made the name of the animal somewhat fitting for an application which main purpose is communicating with remote datasources to read their data into a PostgreSQL database. For more information on this elephant refer to the corresponding Wikipedia article.
The synchronization process can be divided into six steps:
- batyr creates a new temporary table in the database which uses the same schema definition as the target table.
- data is pulled from the source and gets written to the new temporary table.
- batyr uses the primary key definition of the target table to update the contents of the target table using the newly fetched contents of the temporary table. The update will only affect rows where data actually differ to reduce the number of writes and the amount of possibly defined triggers firing. There is just the current limitation that multi-geometries (ST_GeometryCollection, ST_Multi*) may not be compared using the PostGIS ST_Equals function, so rows containing such geometries will be compared using the binary representation of the geometries.
- batyr checks the temporary table for rows which are missing in the target table using the primary key and inserts these into the target table.
- batyr deletes all rows from the target table which are not part of the new data. This step is optional and may be disabled by the
allow_feature_deletion
setting and also is generally deactivated when a filter is used. - The temporary table gets dropped again.
These six steps are performed inside a transaction and will all get rolled back in case of an error.
Instead of the step-by-step synchronization described above, an alternative "bulk mode" can also be used, which simply truncates the target table and copies all data from the source. This can be useful for very large tables, if a full synchronization is too expensive.
The handling of different coordinate systems relies of the PostGIS geometry_columns view or - in older versions - table. batyr will use the SRID information from there to transform incoming geometries to the coordinate system of the target table. Incoming data without coordinate system information will get this SRID assigned without an transformation.
This synchronization itself is performed asynchronously. This means that after an external application sends a request to pull a layer, the data in the database might not yet be updated when the request is finished. Instead the request gets queued internally and will be worked upon in the background. The main purpose of this behavior is to avoid blocking the requesting application in case there is a lot of data to fetch or there are many other requests batyr also has to handle.
In case you build batyr from sources, the following dependencies (Debian Wheezy/Ubuntu packages) are required:
sudo apt-get install libpoco-dev libpocofoundation9 libpoconet9 libpocoutil9 libgdal1 libgdal1-dev cmake g++ build-essential libpq-dev discount python
The software can be built by the following commands:
cmake .
make
usage: batyrd -c=CONFIGFILE [OPTIONS]
Server for on-demand synchronization of vector datasources to
a PostgreSQL/PostGIS database.
version: 0.1.2 [git: 5e8d43ce50]
--daemon run application as a daemon
--pidfile=path write PID to given file
-h, --help Display help information and exit.
-cfile, --configfile=file Path to the configuration file.
The batyr server includes an integrated web interface which allows to pull the configured layers and provides a status page as well as a page listing the state of the current job queue.
The web interface is available on the port configured with the port
setting in the HTTP
section of the configuration file. The default port is 9090
.
The server is configured via a configuration file which is specified by using the -c
switch on
the command line.
batyrd -c=/etc/my-batyr-config-file.cfg
The valid values for each setting are documented in the example file bellow. For boolean values the following values are valid
- True: yes, true, y, t, 1
- False: no, false, n, f, 0
# Configuration for the batyr server
#
# Syntax hints:
# - valid values for boolean settings are yes,no,true,false,1,0
# - multiline values are possible by starting the line after the key with a "+"
[MAIN]
# The number of worker threads to launch
#
# Optional
# Type: integer; must be > 1
# Default: 2
num_worker_threads = 4
# The time after which finished and failed jobs are removed
# As all jobs are kept in memory this time should not be set too
# high.
#
# The units are seconds
#
# Optional
# Type: integer; must be > 1;
# Default: 600
max_age_done_jobs = 600
# Connection to the postgresql database
# A connection string containing all parameters needed to connect
# to the database. The syntax is described in the postgresql manual
# on
# http://www.postgresql.org/docs/9.2/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS
#
# Mandatory
dsn = "dbname=batyr user=batyr password=batyr host=localhost"
# keep database connections open even when there are currently
# no jobs to handle
#
# Optional
# Type: boolean
# Default: yes
use_persistent_connections = yes
# Logging settings
[LOGGING]
# The loglevel of the application
#
# valid values are: "error", "warning", "information" and
# "debug", although "debug" may only be used with DEBUG
# builds
#
# Optional
# Default: "information"
loglevel = debug
# logfile
# If this setting is not set or empty, all log messages will be written to stdout
#
# Optional
# Default: <not set>
logfile =
# Settings for the HTTP-Interface
[ HTTP ]
# The port to listen on
#
# Optional
# Default: 9090
port = 9091
# Value for the Access-Control-Allow-Origin header to be send with the HTTP
# api for allowing cross site HTTP-requests from javascript clients.
#
# For more information see https://developer.mozilla.org/en-US/docs/HTTP/Access_control_CORS
#
# When empty the header will be omitted. To allow request from everywhere use "*"
#
# Optional
# Default: <not set>
access_control_allow_origin =
# the layer configurations
[LAYERS]
# Example layer. The name in the "[[" "]]" will be used as the unique
# indentifier for the layer.
[[africa]]
# enable/disable this layer. Each layer is enabled by default
#
# Optional
# Default: yes
enabled = true
# A optional description to show in the web interface
# has to be in one line
#
# Optional
# Default: empty
description = Countries of africa based on http://www.mapmakerdata.co.uk.s3-website-eu-west-1.amazonaws.com/library/stacks/Africa/index.htm
# The source the contents for the layer should be pulled from
# This corresponds with the "datasource_name" parameter the ogrinfo utility uses. So
# ogrinfo might be used to test the value here for correctness
#
# Mandatory
source = testdata/shapes/Africa.shp
# the name of the layer in the datasource
# This corresponds with the "layer" parameter of the ogrinfo program
#
# Mandatory
source_layer = Africa
# The name of table the layers contents should be written to
# The schema has to be omitted here.
#
# Mandatory
target_table_name = africa
# The name of schema of the table referenced in target_table_name
#
# Mandatory
target_table_schema = test
# Allow features to be deleted during the pull.
# Features which are not part of the source will be removed from the target.
# This setting is always disabled when a filter is used
#
# Optional
# Type: boolean
# Default: no
allow_feature_deletion = yes
# Filter the features of the source by a criteria to pull only matching
# features in the db.
#
# The features might be further filtered by the "filter" parameter
# of a pull request
#
# Optional
# Type: Unquoted String. Attribute filter as described in the OGR documentation
# at http://www.gdal.org/ogr/ogr_sql.html in the section of the
# WHERE-clause
# Default: empty
# Example: myattribute=3
filter = 1 = 1
# Ignore features with attributes which can not be cast to the datatype
# of the table. The default behaviour is failing the whole pull when one
# single feature can not be casted.
#
# Optional
# Type: boolean
# Default: no
ignore_failures = no
# Override/set the primary key for the table of this layer.
#
# Optional. The normal behaviour would be analyzing the schema of the table to find
# the column(s) of the primary key
# Type: comma-seperated list of column names
# Default: empty
primary_key_columns = id
# Enable/disable bulk mode. If bulk mode is enabled all records in the target table
# will be deleted or truncated (depending on corresponding option) and all records
# from the source table will be copied to the target table. This can increase performance
# on large datasets.
#
# Optional. If is not set the default behaviour for pulling tables will be used. See
# MANUAL.md file.
# Type: boolean
# Default: false
bulk_mode = true
# Which SQL command to use for the bulk mode (if bulk mode is turned on): DELETE or
# TRUNCATE.
#
# Optional.
# Valid values are: "delete" and "truncate".
# Default: "delete".
bulk_delete_method = truncate
The layer section may be repeated for each layer with a unique name.
batyr will not create tables nor make any modifications of the database schema by itself. So the user has to create the table herself/himself.
In general the mapping of fields in the source data to the columns of the target table is based on the names of the fields. This means an attribute of an incoming feature will be written to the column with the corresponding name. If no such column exists, the attribute will simply be ignored. Columns of the target table which are also not part of the source features will also be left untouched. batyr will attempt to cast values to the datatype of the column table. This approach will work for most cases, but has a few limitations: Writing an attribute of the type text to a column in the target table of the type integer will work as long as the source only contains numeric values. When the source attribute may also contain letters or other non-numeric characters the sync will fail. This is something to keep in mind when designing the schema of the target table.
The target table is required to have a primary key which is also part of the source data - this is quite important as the values of the primary key are the basis of the synchronization.
The geometry column of the table should have a constraint which defines the spatial reference system of the column to allow batyr to reproject incoming geometries to the correct coordinate system. Earlier PostGIS versions require this information to be also stored in the geometry_columns
table, which may be automatically added using the Populate_Geometry_Columns
function.
Columns may be added or removed from the target table without restarting batyr.
Current limitations are:
- Only tables with one geometry column are supported
- Views are not supported as there is no primary key information available. This issue is already on the roadmap to be resolved in the future
The provided HTTP-API is the same which the integrated web interfaces uses and provides the methods described in this part of the documentation. Data is strictly exchanged in the form of JSON objects and the API requires an application/json
header if data is POSTed to it.
The basic object the API deals with is called a job
and possesses the following attributes:
id
: Identifier of the job. This value is always present.type
: Type of job. This value is always present as possible values are:pull
andremove-by-attributes
.timeAdded
: Timestamp when the job was received. Always present.status
: Status of the job. Possible values arequeued
,in_progress
,finished
andfailed
. Always present.layerName
: Name of the layer the job wants to pull. Always present.filter
: Attribute filter. Optional. Only used with pull-jobs.message
: A message from the server regarding this job. Mostly empty, but will contain an error message in case something went wrong.numPulled
: Number of features pulled/read from the source. Attribute is available whenstatus
isfinished
orfailed
.numCreated
: Number of newly created features in the database. Attribute is available whenstatus
isfinished
orfailed
.numUpdated
: Number of existing features in the database which have been updated. Features will only be updated if they show differences. Attribute is available whenstatus
isfinished
orfailed
.numIgnored
: Number of features ignored because of one or more of their attributes havig an type incompatible with the table in the database. This beviour has to be enabled in the configfile. Attribute is available whenstatus
isfinished
orfailed
.numDeleted
: Number of features deleted by this job. Attribute is available whenstatus
isfinished
orfailed
.
{
"id": "c94a6c77c18649668fd780744ea745a645a6",
"type": "pull",
"timeAdded": "2013-10-08T13:58:51Z",
"status": "finished",
"layerName": "africa",
"filter": "id=\"4\"",
"message": "",
"numCreated": 0,
"numUpdated": 0,
"numDeleted": 0,
"numIgnored": 0,
"numPulled": 0
}
A list of all jobs which are known to the server.
{
"maxAgeDoneJobsSeconds": 600,
"jobs": []
}
Returns all currently configured layers with their names and description.
{
"layers": [
{
"name": "africa",
"description": "Countries of africa based on http://www.mapmakerdata.co.uk.s3-website-eu-west-1.amazonaws.com/library/stacks/Africa/index.htm"
},
{
"name": "dataset1",
"description": "testing different values"
}
]
}
This method returns an object with the configuration and the current state of the server. The key numFailedJobs
might be used to monitor the server using a tool like for example nagios.
{
"appName": "batyrd",
"appVersion": "0.1.0",
"appGitVersion": "96835a2f99",
"numLayers": 2,
"numQueuedJobs": 0,
"numFinishedJobs": 0,
"numFailedJobs": 0,
"numInProcessJobs": 0,
"numWorkers": 4
}
Fetch a job object by its id.
GET /api/job/1ab8c197ed014a4cbc20a6dfc98a1b101b10.json
{
"id": "1ab8c197ed014a4cbc20a6dfc98a1b101b10",
"type": "pull",
"timeAdded": "2013-10-10T07:23:49Z",
"timeFinished": "2013-10-10T07:23:49Z",
"status": "finished",
"layerName": "dataset1",
"filter": "",
"message": "",
"numCreated": 0,
"numUpdated": 2,
"numDeleted": 0,
"numIgnored": 0,
"numPulled": 2
}
{
"message": "No job with the id 1ab8c197ed014a4cbc20a6dfc98a1b101b10 found"
}
Allows starting a new job by POSTing a JSON document to this URL. The layerName
parameter is mandatory while the filter
parameter is optional. The request will return a job object with the properties of the newly created job. Returns an HTTP status 200
if the request was successful and 400
if the sent data was incorrect.
{
"layerName":"africa",
"filter":"id=\"4\""
}
{
"id": "c94a6c77c18649668fd780744ea745a645a6",
"type": "pull",
"timeAdded": "2013-10-08T13:58:51Z",
"status": "queued",
"layerName": "africa",
"filter": "id=\"4\"",
"message": ""
}
Remove features from the database by matching their columns to attributes of JSON objects. It is possible to specify multiple criteria in one request
This request is more or less an additional feature for applications which need to selectively remove features from the database. In general performing a full sync using the pull
API method is the preferred way of ensuring consistent data.
{
"layerName":"africa"
"attributeSets": [
{
"column1": "some value",
"column2": null
},
{
"column3": "some other value"
}
]
}
This request will remove all features where the column1 is equal to "some value" and column2 is null as well
as all features where column3 equals "some other value". Values for attribute have to be strings or null
.
{
"id": "c94a6c77c18649668fd780744ea745a645a6",
"type": "remove-by-attributes",
"timeAdded": "2013-10-08T13:58:51Z",
"status": "queued",
"layerName": "africa",
"attributeSets": [
{
"column1": "some value",
"column2": null
},
{
"column3": "some other value"
}
],
"message": ""
}
- The GDAL/OGR library
- The libpq library to connect to PostgreSQL.
- POCO Framework for its embedded HTTP server as well as a generic application framework.
- rapidjson for parsing and generating JSON.
- cmake to build the software.
- AngularJS as the clientside javascript framework.
- jQuery for animations.
- Moment.js date and time formatting.
- Pure CSS for its CSS styles.
- Font Awesome as iconset.
The development of batyr has been sponsored by the trafimage project of Swiss Federal Railways SBB.
For development related notes see the included DEVELOPMENT.md file.
Copyright (c) 2013, geOps
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of the geOps nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.