- ingestMethod: "WEB"
- ingestURL - [String] The URL to retrieve data from. Acts as the argument to the
preprocessingScript
field, if it's specified. - parserType - [String] ['xml' or 'json']
- documentElement - [String] the name of the field/element in the raw data under which a data record is contained. For XML type raw data,
documentElement
must be direct descendant of thetopElement
. For JSON type raw data,documentElement
field can match anywhere the raw data JSON hierarchy. - topElement - [String] The top level element in a raw XML document under which all the document (data record) level elements can be found. Only used for XML type remote content.
- filterJsonPath - [String] JSON Path to a field in a raw JSON data record. The value of matching field will be used to select data records for ingestion based on the
filterValue
configured. (Only applicable forparserType: "json"
) - filterXMLPath - [String] A valid XPath expression to a tag text or attribute in a raw XML data record. The value of matching field will be used to select data records for ingestion based on the
filterValue
configured. (Only applicable forparserType: "xml"
). ThefilterValue
can be a Java regular expression also (otherwise exact match) identified by are::
prefix (e.g.re::^Build
matches any value starting with 'Build').
An example source descriptor setup (only important parameters are shown) to filter XML records is shown below;
ingestMethod: "WEB"
ingestURL : "file:///var/data/data-cache/SCR_006549-Flybase/files/"
parserType: "xml"
documentElement: "stock"
topElement: "chado"
filenamePattern: ".+\\.xml$"
filterXMLPath: "//contact/name/text()"
filterValue: "Kyoto"
- filterValue - [String] The exact value or a regular expression for the raw data record field identified by the
filterJsonPath
orfilterXMLPath
configuration. A regular expression string needs to be prefixed byre::
such asfilterValue: "re::^10\\.1"
. - useCache - ["false", "true"] If true uses previously retrieved data for ingestion.
- cacheFilename - Only used if
useCache: "true"
. If specified, the filename/prefix to which theingestURL
contents are cached. If not specified anduseCache
is set, Foundry creates a cache file name from theingestURL
. - filenamePattern - [String] A java regular expression to filter file names extracted from a zip file or tar ball retrieved from
ingestURL
. - offsetParam - [String] the name of the offset query parameter for the GET request specified by
ingestURL
for pagination. - limitParam - [String] the name of the limit query parameter for the GET request specified by
ingestURL
for pagination. - limitValue - [int] the value of the the limit query parameter for the GET request specified by
ingestURL
for pagination. This specifies how many records to retrieve per page. - mergeIngestURLTemplate - For list-detail type of web APIs, this field specifies the template URL with a single placeholder to retrieve the content of each data record for the list of data record ids retrieved from the
ingestURL
. The record ids are detected using the specifiedidJsonPath
and used to replace the placeholder value in themergeIngestURLTemplate
to create the URL to get the individual record. For example:mergeIngestURLTemplate: "http://neurovault.org/api/collections/${id}/images/"
- mergeFieldName - The JSON field for the part of the detail document record subtree that will be merged with the summary record.
- mergeDocElName - The JSON field in the summary data record under which the subtree identified by the
mergeFieldName
in the detail record will be merged - idJsonPath - [String] JSON Path to the id of the summary records retrieved by
ingestURL
for themergeIngestURLTemplate
placeholder. - normalize - ["false", "true"] Default is "false". If set normalizes the generated JSON data record to be ingested. The normalization involves correcting array inside a single element kind of JSON generation errors.
- preprocessScript - The full path to a Bash wrapper script to execute web-scrapping scripts. The wrapper script takes a single argument for the data files to be saved after the retrieval/scraping etc is finished. The argument is defined via the
ingestURL
field. - waitTime - [long] wait time in milliseconds (default is no wait between mergeIngestURL calls). (optional).
- sampleMode - ["false", "true"] Default is "false". If set to true only up to
sampleSize
data records will be ingested. - sampleSize - [int] The number of data records that will be ingested if
sampleMode: "true"
Source Descriptor:
ingestURL: "file:///var/data/data-cache/SCR_010490-Protocols.io-Protocols"
preprocessScript: "/home/ubuntu/dev/java/Foundry-Data/OriginalData/SCR_010490/wrapper.sh"
wrapper.sh:
#! /bin/bash
python3 SCR_010490-ProtocolsIO-Ingest.py 'latest' $1