-
Notifications
You must be signed in to change notification settings - Fork 0
Configuration Apache Airflow for DLME
- csv
- iiif_json
- oai_xml
- xml*
Each new data provider must be added in the catalog.yaml
file under sources
like so:
aub:
args:
path: /opt/airflow/catalogs/aub.yaml
description: "American University of Beirut"
driver: intake.catalog.local.YAMLFileCatalog
metadata: {}
catalog.yaml
is read in order to know where to fetch configuration catalogs for each provider. args.path
and description
will be different for each provider and must be specified by the user. driver
and metadata
can be copied as above.
For each provider, create a separate configuration file and update args.path
with its location. Here are the contents of an example configuration file:
metadata:
version: 1
sources:
aco:
driver: oai_xml
args:
collection_url: https://libraries.aub.edu.lb/xtf/oai
metadata_prefix: oai_dc
set: "aco"
metadata:
fields:
id:
path: "//header:identifier"
namespace:
header: "http://www.openarchives.org/OAI/2.0/"
optional: True
Each collection is nested under sources
and the specific configurations for that collection are nested within it. driver
specifies the intake driver you wish to use to map the source data to a pandas dataframe. The args
will vary slightly across drivers; these variations are listed below, under each driver heading. For all driver types, metadata.fields.id
must be filled out with the path
To do: What does optional mean? Is it always set to optional?
For csv files, the pandas dtype needs to be specified as intake will attempt to guess the pandas dtype when not specified and it may guess incorrectly.
The iiif_json
driver will fetch all contents nested under the metadata
field of a IIIF manifest. Objects that are not nested under metadata
must be explicitly listed in the configuration file. Here is an example:
alexandria_bombardment:
description: "Alexandria Bombardment of 1882 Photograph Album"
driver: iiif_json
metadata:
data_path: auc/iiif/alexandria_bombardment
config: auc_iiif_config_csv
fields:
context:
path: '@context'
optional: true
description_top:
path: 'description'
optional: true
id:
path: '@id'
optional: true
iiif_format:
path: 'sequences..format'
profile:
path: 'sequences..profile'
resource:
path: 'sequences..resource.@id'
thumbnail:
path: 'thumbnail..@id'
optional: true
The xml
driver is intended to be a generic driver for parsing any xml file. As such, it cannot safely make assumptions about the data, such as shape or naming conventions. These must be specified in the configuration file. The record_selector
must be specified in order to identify the xml element denoting a new record. All fields must be listed in the configuration file (with the correct path and namespace) in order to be be mapped. Here is an example:
metadata:
version: 1
sources:
aims:
description: "American Institute for Maghrib Studies"
driver: xml
metadata:
data_path: aims
config: aims_config
record_selector:
path: "//item"
namespace:
fields:
id:
path: ".//guid"
namespace:
optional: false
title:
path: ".//title"
namespace:
optional: true
args:
collection_url: https://feed.podbean.com/themaghribpodcast/feed.xml