Moved to new docs
Moved to new docs
- The extractor configuration has 3 parts -
api
,config
andcache
- The
api
section defines the API behavior such as authentication method, pagination, API's base URI etc - The
config
section should contain actual authentication information (tokens etc), as well as individual endpoints in thejobs
section - The
cache
section enables private transparent proxy cache for caching HTTP responses
Moved to new docs
Moved to new docs
The most important part of configuration, the API url (should end with a /
)
- Must be either a string or user function (allows custom domains, see examples)
Example:
https://yourDomain.zendesk.com/api/v2/
-- OR --
{
"api": {
"function": "concat",
"args": [
"https://",
{ "attr": "domain" },
".zendesk.com/api/v2/"
]
},
"config": {
"domain": "yourDomain"
}
}
- for https://yourDomain.zendesk.com/api/v2/
- uses
config
part, where attribute domain would containyourDomain
Moved to new docs
Set the retry limit, rate limit reset header and HTTP codes to retry if the API returns an error
- retryConfig.headerName: (string)
Retry-After
- Name of the header with information when can we access the API again
- retryConfig.httpCodes: (array)
[500, 502, 503, 504, 408, 420, 429]
- HTTP codes on which to retry
- retryConfig.curlCodes: (array)
[6, 7, 28, 35, 52]
- CURL error codes on which to retry
- retryConfig.maxRetries: (int)
10
- Maximum retry attempts (useful for exponential backoff, if the limit reset header is not present)
Moved to new docs
-
Headers required to be set in the config section
-
Should be an array, eg:
App-Key,X-User-Email
-
http.headers.{Header-Name} attribute in config section (eg:
http.headers.App-Key
){ "api": { "http": { "requiredHeaders": [ "App-Key", "X-User-Email" ] } }, "config": { "http": { "headers": { "App-Key": "asdf1234", "X-User-Email": "some@email.com" } } } }
Moved to new docs
- Headers to be sent with all requests from all configurations
- eg: http.headers.Accept-Encoding:
gzip
Moved to new docs
- Define the default request options, that will be included in all requests
- eg: http.defaultOptions.params.queryParameter:
value
Moved to new docs
Moved to new docs
-
authentication.type:
basic
-
use username and password or #password attributes in the config section.
-
password takes preference over #password, if both are set
{ "api": { "authentication": { "type": "basic" } }, "config": { "username": "whoever", "password": "soSecret" } }
Moved to new docs
-
Supports signature function as a value
-
Values should be described in
api
section -
Example bucket attributes:
-
authentication.type:
query
-
authentication.query.apiKey:
{"attr": "apiKey"}
- this will look for the apiKey query parameter value in the config attribute named apiKey
-
authentication.query.sig:
{ "function": "md5", "args": [ { "function": "concat", "args": [ { "attr": "apiKey" }, { "attr": "secret" }, { "function": "time" } ] } ] }
-
this will generate a sig parameter value from MD5 of merged configuration table attributes apiKey and secret, followed by current time() at the time of the request (time() being the PHP function)
-
Allowed functions are listed below in the User functions section
-
If you're using any config parameter by using
"attr": "parameterName"
, it has to be identical string to the one in the actual config, including eventual#
if KBC Docker's encryption is used.{ "api": { "authentication": { "type": "url.query", "query": { "apiKey": { "attr": "apiKey" }, "sig": { "function": "md5", "args": [ { "function": "concat", "args": [ { "attr": "apiKey" }, { "attr": null }, { "function": "time" } ] } ] } } } }, "config": { "apiKey": "asdf1234" } }
-
-
Data available for building the signature:
- attr: An attribute from
config
(first level only) - query: A value from a query parameter
- Ex.:
{ "query": "par1" }
will returnval1
if the query contains?par1=val1
- Ex.:
- request: Information about the request
- Available information:
url
path
queryString
method
hostname
port
resource
- Available information:
- attr: An attribute from
Moved to new docs
-
Log into a web service to obtain a token, which is then used for signing requests
-
authentication.type:
login
-
authentication.loginRequest: Describe the request to log into the service
- endpoint:
string
(required) - params:
array
- method:
string
: [GET
|POST
|FORM
] - headers:
array
- endpoint:
-
authentication.apiRequest: Defines how to use the result from login
- headers: Use values from the response in request headers
[$headerName => $responsePath]
- query: Use values from the response in request query
[$queryParameter => $responsePath]
- headers: Use values from the response in request headers
-
authentication.expires (optional):
- If set to an integer, the login action will be performed every
n
seconds, wheren
is the value - If set to an array, it must contain
response
key with its value containing the path to expiry time in the response-
relative
key sets whether the expiry value is relative to current time. False by default.{ "api": { "authentication": { "type": "login", "loginRequest": { "endpoint": "Security/Login", "headers": { "Content-Type": "application/json" }, "method": "POST", "params": { "UserName": { "attr": "username" }, "PassWord": { "attr": "password" } } }, "apiRequest": { "headers": { "Ticket": "Ticket" } } } }, "config": { "username": "whoever", "password": "soSecret" } }
-
- If set to an integer, the login action will be performed every
Moved to new docs
- Use OAuth 1.0 tokens
- Using OAuth in ex-generic-v2 in KBC currently requires the application to be registered under the API's component ID and cannot be configured in Generic extractor itself
This requires the authorization.oauth_api.credentials
object in configuration to contain #data
, appKey
and #appSecret
, where #data
must contain a JSON encoded object with oauth_token
and oauth_token_secret
properties. appKey
must contain the consumer key, and #appSecret
must contain the consumer secret.
Use Keboola Docker and OAuth API integration to generate the authorization configuration section.
- authentication.type:
oauth10
Example minimum config.json
:
{
"authorization": {
"oauth_api": {
"credentials": {
"#data": {"oauth_token":"userToken","oauth_token_secret":"tokenSecret"},
"appKey": 1234,
"#appSecret": "asdf"
}
}
},
"parameters": {
"api": {
"authentication": {
"type": "oauth10"
}
}
}
}
Moved to new docs
Uses User functions to use tokens in headers or query. Instead of attr
or time
parameters, you should use authorization
to access the OAuth data. If the data is a raw token string, use authorization: data
to access it. If it's a JSON string, use authentication.format: json
and access its values isong the .
annotation, like in example below (authorization: data.access_token
).
The query and request information can also be used just like in the querry
authentication method.
- authentication.type:
oauth20
Example config for Bearer token use:
{
"authorization": {
"oauth_api": {
"credentials": {
"#data": {"status": "ok","access_token": "testToken"}
}
}
},
"parameters": {
"api": {
"authentication": {
"type": "oauth20",
"format": "json",
"headers": {
"Authorization": {
"function": "concat",
"args": [
"Bearer ",
{
"authorization": "data.access_token"
}
]
}
}
}
}
}
}
Example for MAC authentication:
- Assumes the user token is in the OAuth data JSON in
access_token
key, and MAC secret is in the same JSON inmac_secret
key.
{
"authorization": {
"oauth_api": {
"credentials": {
"#data": {"status": "ok","access_token": "testToken", "mac_secret": "iAreSoSecret123"},
"appKey": "clId",
"#appSecret": "clScrt"
}
}
},
"parameters": {
"api": {
"baseUrl": "http://private-834388-extractormock.apiary-mock.com",
"authentication": {
"type": "oauth20",
"format": "json",
"headers": {
"Authorization": {
"function": "concat",
"args": [
"MAC id=",
{
"authorization": "data.access_token"
},
", ts=",
{
"authorization": "timestamp"
},
", nonce=",
{
"authorization": "nonce"
},
", mac=",
{
"function": "md5",
"args": [
{
"function": "hash_hmac",
"args": [
"sha256",
{
"function": "implode",
"args": [
"\n",
[
{
"authorization": "timestamp"
},
{
"authorization": "nonce"
},
{
"request": "method"
},
{
"request": "resource"
},
{
"request": "hostname"
},
{
"request": "port"
},
"\n"
]
]
},
{
"authorization": "data.mac_secret"
}
]
}
]
}
]
}
}
}
}
}
}
Moved to new docs
Configured in api.pagination.method
Moved to new docs
Moved to new docs
-
pagination.method:
offset
-
pagination.limit: integer
- If a limit is set in configuration's params field, it will be overriden by its value
- If the API limits the results count to a lower value than this setting, the scrolling will stop after first page, as it stops once the results count is lower than configured count
-
pagination.limitParam(optional)
- sets which query parameter should contain the limit value (default to
limit
)
- sets which query parameter should contain the limit value (default to
-
pagination.offsetParam(optional)
-
sets which query parameter should contain the offset value (default to
offset
)"api": { "pagination": { "method": "offset", "limit": 1000, "limitParam": "limit", "offsetParam": "offset" } }
-
-
pagination.firstPageParams(optional)
- Whether or not include limit and offset params in the first request (default to
true
)
- Whether or not include limit and offset params in the first request (default to
-
pagination.offsetFromJob(optional)
- Use offset specified in job config for first request (false by default)
{ "api": { "pagination": { "method": "offset", "limit": 1000, "offsetFromJob": true } }, "config": { "jobs": [ { "endpoint": "resource", "params": { "offset": 100 } } ] } }
Moved to new docs
-
pagination.method:
response.param
-
pagination.responseParam:
- path within response that points to a value used for scrolling
- pagination ends if the value is empty
-
pagination.queryParam:
- request parameter to set to the value from response
-
pagination.includeParams:
false
- whether params from job configuration are used in next page request
-
pagination.scrollRequest:
- can be used to override settings (endpoint, method, ...) of the initial request
"api": { "pagination": { "method": "response.param", "responseParam": "_scroll_id", "queryParam": "scroll_id", "scrollRequest": { "endpoint": "_search/scroll", "method": "GET", "params": { "scroll": "1m" } } } }
Moved to new docs
-
pagination.method:
response.url
-
pagination.urlKey:
next_page
- path within response object that points to the URL
- if value of that key is empty, pagination ends
-
pagination.paramIsQuery:
false
- Enable if the response only contains a query string to use with the same endpoint
-
pagination.includeParams:
false
- whether or not to add "params" from the configuration to the URL's query from response
- if enabled and the next page URL has the same query parameters as the "params" field, values from the "params" are used
"api": { "pagination": { "method": "response.url", "urlKey": "nextPage", "includeParams": true } }
Moved to new docs
simple page number increasing 1 by 1
-
pagination.method:
pagenum
-
pagination.pageParam:(optional)
page
by default -
pagination.limit:(optional) integer
- define the page size
- if limit is omitted, the pagination will end once an empty page is received. Otherwise it stops once the reply contains less entries than the limit.
-
pagination.limitParam:(optional)
- query parameter name to use for limit
"api": { "pagination": { "method": "pagenum", "pageParam": "page", "limit": 500, "limitParam": "count" } }
-
pagination.firstPage: (optional)
1
by default. Set the first page number. -
pagination.firstPageParams(optional)
- Whether or not include limit and page params in the first request (default to
true
)
- Whether or not include limit and page params in the first request (default to
Moved to new docs
Looks within the response data for an ID which is then used as a parameter for scrolling.
The intention is to look for identifiers within data and in the next request, use a parameter asking for IDs higher than the highest found (or the opposite, lower than the lowest using the reverse
parameter)
-
pagination.method:
cursor
-
pagination.idKey: (required)
- Path within response data (ie the array which is parsed into CSV) containing an identifier of each object, which is then used in the next request's query
-
pagination.param: (required)
- Parameter name in which to pass the value in the next request
-
pagination.increment: (optional) integer
- A number by which to increment the highest(/lowest) found value.
- Can be a negative number, ie if the lowest ID in data is
10
, andincrement
is set to-1
, the next request parameter value will be9
-
pagination.reverse: (optional) bool,
false
by default- If set to true, the scroller will look for the lowest number instead of the highest(default)
-
Example:
"pagination": { "method": "cursor", "idKey": "id", "param": "max_id", "increment": -1, "reverse": true }
-
Data:
"results": [ {"id": 11}, {"id": 12} ]
-
Request:
http://api.example.com/resource?max_id=10
Moved to new docs
Allows setting scrollers per endpoint.
-
pagination.method:
multiple
-
pagination.default: (optional)
- Set a default scroller to use, if none is specified for the endpoint (if not set, no scrolling is used)
-
pagination.scrollers: (required)
- An object where each item represents one of the supported scrollers with their respective configuration
- The key of each item is then used as identifier for the scroller and must be used in the
scroller
parameter of a job
-
Example configuration:
"pagination": {
"method": "multiple",
"scrollers": {
"param_next_cursor": {
"method": "response.param"
},
"param_next_results": {
"method": "response.param"
},
"cursor_timeline": {
"method": "cursor",
"idKey": "id",
"param": "max_id",
"reverse": true,
"increment": -1
}
}
}
"jobs": [
{
"endpoint": "statuses/user_timeline",
"scroller": "cursor_timeline"
},
{
"endpoint": "search",
"scroller": "param_next_results",
"params": {
"q": "...(twitter search query)"
}
}
]
Moved to new docs
Looks within responses to find a boolean field determining whether to continue scrolling or not.
Usage:
"pagination": {
"nextPageFlag": {
"field": "hasMore",
"stopOn": false,
"ifNotSet": false
}
}
- The extractor loads start time of its previous execution into its metadata. This can then be used in user functions as
time: previousStart
. - Current execution start is also available at
time: currentStart
. - This can be used to create incremental exports with minimal overlap, using for example
[start_time: [time: previousStart], end_time: [time: currentStart]]
- It is advised to use both
previousStart
andcurrentStart
as since>until pair to ensure no gap and no overlap in data. - Both values are stored as Unix timestamp.
date
function can be used to reformat it.
Moved to new docs
Moved to new docs
Attributes must be configured accordingly to the api
configuration (eg auth, pagination, http.requiredHeaders). They are under the config
section of the configuration. (see example below)
-
outputBucket: Name of the bucket to store the output data
-
id: Optional, if outputBucket is set. Otherwise the id is used to generate the output bucket name
-
debug: If set to
true
, the extractor will output detailed information about it's run, including all API requests. Warning, this may reveal your tokens or other sensitive data in the events in your project! It is intended only to help solving issues with configuration. -
userData: A set of
key:value
pairs that will be added to theroot
of all endpoints' results- Example:
"config": { "userData": { "some": "tag", "another": "identifier" } }
-
incrementalOutput: (boolean) Whether or not to write the result incrementally
- Example:
"config": { "incrementalOutput": true }
Moved to new docs
- Columns:
-
endpoint (required): The API endpoint
-
params: Query/POST parameters of the api call, JSON encoded
- Each parameter in the JSON encoded object may either contain a string, eg:
{""start_date"": ""2014-12-26""}
- OR contain an user function as described below, for example to load value from parameters:
"start_date": { "function": "date", "args": [ "Y-m-d+H:i", { "function": "strtotime", "args": [ { "attr": "job.1.success" } ] } ] }
- Each parameter in the JSON encoded object may either contain a string, eg:
-
dataType: Type of data returned by the endpoint. It also describes a table name, where the results will be stored
-
dataField: Allows to override which field of the response will be exported.
- If there's multiple arrays in the response "root" the extractor may not know which array to export and fail
- If the response is an array, the whole response is used by default
- If there's no array within the root, the path to response data must be specified in dataField
- Can contain a path to nested value, dot separater (eg
result.results.products
) dataField
can also be an object containingpath
-
children: Array of child jobs that use the jobs' results to iterate
-
The endpoint must use a placeholder enclosed in
{}
-
The placeholder can be prefixed by a number, that refers to higher level of nesting. By default, data from direct parent are used. The direct parent can be referred as
{id}
or{1:id}
. A "grandparent" result would then be{2:id}
etc. -
Results in the child table will contain column(s) containing parent data used in the placeholder(s), prefixed by parent_. For example, if your placeholder is
{ticket_id}
, a column parent_ticket_id containing the value of current iteration will be appended to each row. -
placeholders array must define each placeholder. It must be a set of
key: value
pairs, where key is the placeholder (eg"1:id"
) and the value is a path within the response object - if nested, use.
as a separator.- Example:
"endpoint": "tickets.json", "children": [ { "endpoint": "tickets/{id}/comments.json", "placeholders": { "id": "id" }, "children": [ { "endpoint": "tickets/{2:ticket_id}/comments/{comment_id}/details.json", "placeholders": { "comment_id": "id", "2:ticket_id": "id" } } ] } ]
- You can also use an user function on the value from a parent using an object as the placeholder value
- That object MUST contain a 'path' key that would be the value of the placeholer, and a function. To access the value in the function arguments, use
{"placeholder": "value"}
- Example:
"placeholders": { "1:id": { "path": "id", "function": "urlencode", "args": [ { "placeholder": "value" } ] } }
-
recursionFilter:
- Can contain a value consisting of a name of a field from the parent's response, logical operator and a value to compare against. Supported operators are "==", "<", ">", "<=", ">=", "!="
- Example:
type!=employee
orproduct.value>150
- The filter is whitespace sensitive, therefore
value == 100
will look intovalue␣
for a␣100
value, instead ofvalue
and100
as likely desired. - Further documentation can be found at https://github.com/keboola/php-filter
-
-
method: GET (default), POST or FORM
-
responseFilter: Allows filtering data from API response to leave them from being parsed.
- Filtered data will be imported as a JSON encoded string.
- Value of this parameter can be either a string containing path to data to be filtered within response data, or an array of such values.
- Example:
"results": [ { "id": 1, "data": "scalar" }, { "id": 2, "data": {"object": "can't really parse this!"} } ]
- To be able to work with such response, set
"responseFilter": "data"
- it should be a path within each object of the response array, not including the key of the response array - To filter values within nested arrays, use
"responseFilter": "data.array[].key"
- Example:
"results": [ { "id": 1, "data": { "array": [ { "key": "value" }, { "key": {"another": "value"} } ] } } ]
- This would be another unparseable object, so the filter above would just convert the
{ 'another': 'value' }
object to a string - To filter an entire array, use
array
as the value for responseFilter. To filter each array item individually, usearray[]
.
-
responseFilterDelimiter: Allows changing delimiter if you need nesting in responseFilter, for instance if your data contains keys containing
.
, which is the default delimiter.- Example:
"results": [ { "data.stuff": { "something": [1,2,3] } } ]
- Use
'responseFilter': 'data.stuff/something'
together with'responseFilterDelimiter': '/'
to filter the array insomething
-
Noved to new docs
mappings
attribute can be used to force the extractor to map the response into columns in a CSV file as described in the JSON to CSV Mapper documentation.
Each property in the mappings
object must follow the mapper settings, where the key is the dataType
of a job
. Note that if a dataType
is not set, it is generated from the endpoint and might be confusing if ommited.
If there's no mapping for a dataType
, the standard JSON parser processes the result.
In a recursive job, the placeholer prepended by parent_
is available as type: user
to link the child to a parent. See example below:
Jobs:
"jobs": [
{
"endpoint": "orgs/keboola/repos",
"dataType": "repos",
"children": [
{
"endpoint": "repos/keboola/{1:name}/issues",
"placeholders": {
"1:name": "name"
},
"dataType": "issues"
}
]
}
]
Mappings (of the child):
"mappings": {
"issues": {
"parent_name": {
"type": "user",
"mapping": {
"destination": "repo_name"
}
},
"title": {
"mapping": {
"destination": "title"
}
},
"id": {
"mapping": {
"destination": "id",
"primaryKey": true,
"propertyOrder": 1
}
}
}
}
The parent_name
is the parent_
prefix together with the value of placeholder 1:name
.
"mappings": {
"get": {
"id": {
"mapping": {
"destination": "id",
"primaryKey": true
}
},
"status": {
"mapping": {
"destination": "st"
}
}
}
},
"jobs": [
{
"endpoint": "basic",
"dataType": "get"
}
]
The configuration can be run multiple times with some (or all) values in config
section being overwritten. For example, you can run the same configuration for multiple accounts, overriding values of the authentication settings.
Warning:
- If you use
userData
in iterations, make sure they all contain the same set of keys! - Overriding
incrementalOutput
will only use the setting from the last iteration that writes to eachoutputBucket
This way you can download the same data from two different accounts into a single output table, adding the owner
column to help you recognize which iteration of the config brought in each row in the result.
{
"api": {
"baseUrl": "http://example.com/api",
"authentication": {
"type": "basic"
}
},
"config": {
"outputBucket": "bunchOfResults",
"jobs": [
{
"endpoint": "data"
}
]
},
"iterations": [
{
"username": "chose",
"password": "potato",
"userData": {
"owner": "Chose's results"
}
},
{
"username": "joann",
"password": "beer",
"userData": {
"owner": "Joann's results"
}
}
]
}
Can currently be used in query type authentication or endpoint parameters
Moved to new docs
Moved to new docs
md5
: Generate a md5 key from its argument valuesha1
: Generate a sha1 key from its argument valuetime
: Return time from the beginning of the unix epoch in seconds (1.1.1970)date
: Return date in a specified formatstrtotime
: Convert a date string to number of seconds from the beginning of the unix epochbase64_encode
hash_hmac
: See PHP documentationsprintf
: See PHP documentationconcat
: Concatenate its arguments into a single stringimplode
: Concatenate an array from the second argument, using glue string from the first argifempty
: Return first argument if is not empty, otherwise return second argument
Moved to new docs
The function must be specified in JSON format, which may contain one of the following 4 objects:
-
String:
"something"
-
Function: One of the allowed functions above
-
Example (this will return current date in this format:
2014-12-08+09:38
:"function": "date", "args":[ "Y-m-d+H:i" ]
-
Example with a nested function (will return a date in the same format from 3 days ago):
"function": "date", "args":[ "Y-m-d+H:i", { "function": "strtotime", "args": [ "3 days ago" ] } ]
-
-
Config Attribute:
"attr": "attributeName"
or"attr": "nested.attribute.name"
-
Metadata:
time: previousStart
ortime: currentStart
- only useable in job params. -
Query parameter: TODO
Moved to new docs
{
"parameters": {
"api": {
"baseUrl": {
"function": "concat",
"args": [
"https://",
{
"attr": "domain"
},
".zendesk.com/api/v2/"
]
},
"authentication": {
"type": "basic"
},
"pagination": {
"method": "response.url"
},
"name": "zendesk"
},
"config": {
"id": "test_docker",
"domain": "yours",
"username": "you@wish.com/token",
"password": "ohIdkSrsly",
"jobs": [
{
"endpoint": "exports/tickets.json",
"params": {
"start_time": {
"time": "previousStart"
},
"end_time": {
"function": "strtotime",
"args": [
"2015-07-20 00:00"
]
}
},
"dataType": "tickets_export",
"dataField": "",
"children": [
{
"endpoint": "tickets/{id}/comments.json",
"recursionFilter": "status!=Deleted",
"dataType": "comments",
"placeholders": {
"id": "id"
}
}
]
},
{
"endpoint": "users.json",
"params": {},
"dataType": "users",
"dataField": ""
},
{
"endpoint": "tickets.json",
"params": {},
"dataType": "tickets",
"dataField": ""
}
]
}
}
}
Use private proxy cache for HTTP responses. This is useful for local jobs configuration development.
"parameters": {
"api": "...",
"config": "...",
"cache": true
}
- Caches only responses with one of [
200
,203
,300
,301
,410
] HTTP Status codes - Cache TTL
- Count time from
Cache-Control
andExpires
reponse headers. - If counted value is
null
, extractor will use own default value (30 days) - Default
ttl
value can be overridden by custom config value (time in seconds)
- Count time from
"parameters": {
"api": "...",
"config": "...",
"cache": {
"ttl": 3600
}
}
Moved to new docs
Best way to create and test new configurations is run extractor in docker container.
- Clone this repository
git clone https://github.com/keboola/generic-extractor.git
- Switch to extractor directory
cd generic-extractor
- Build container
docker compose build
- Install dependencies locally
docker compose run --rm dev composer install
- Create data folder for configuration
mkdir data
-
Create
config.json
indata
folderSample configuration which downloads list of Keboola Developers from githhub
data/config.json
:{ "parameters": { "api": { "baseUrl": "https://api.github.com", "http": { "Accept": "application/json", "Content-Type": "application/json;charset=UTF-8" } }, "config": { "debug": true, "jobs": [ { "endpoint": "/orgs/keboola/members", "dataType": "members" } ] } } }
-
Run extraction
docker compose run --rm dev
-
You will find extracted data in folder
data/out
-
Clear
data/out
by runningdocker compose run --rm dev rm -rf data/out
-
Repeat :)
docker compose run --rm tests
or (with local source code and vendor copy)
docker compose run --rm tests-local
MIT licensed, see LICENSE file.