Skip to content

Latest commit

 

History

History
182 lines (166 loc) · 6.07 KB

protocol.md

File metadata and controls

182 lines (166 loc) · 6.07 KB

DIT-ES ResourceSync protocol

General description

The DIT (the harvesting component of the Data Interoperability Toolkit - https://github.com/openminted/omtd-publisher-connector-harvester) will continuously update a ‘omtd-resourcesync’ index into ES. This index will contain two different types:

  • resource: keeps track of the current state of the resources to be sync
  • change: logs changes of the resources to by sync

Every time the DIT updates one of its resources (i.e. download a new file => change = ‘created’), it will post two new documents into ES:

  • resource document:
    • If change == ‘created’: create new document with F’s metadata
    • If change == ‘updated’: update F’s metadata
    • If change == ‘deleted’: delete F’s document in ES
  • change document: create a new document logging the occurred change

In this way, the resource type will always contain a snapshot of the current state of the resources, in order to easily generate a resource list from it. Likewise, change lists can be created and updated querying Elasticsearch, providing a time interval to retrieve the changes we are interested in. The ResourceSync source will just refer to the Elasticsearch index as reference for the resources' state.

Why is this useful?

ResourceSync is a flexible and powerful tool to synchronize very large sets of resources, which may be physical files or not. Elasticsearch or other data storage systems, assisted by an update layer on top of them, allows to keep track of the state of the resources without time consuming processes (i.e. checking changes on several million files). Moreover, this enables more sophisticated pagination techniques, avoiding to regenerate a whole resourcelist when few changes have occurred (i.e. it may be sufficient to regenerate a single sitemap instead of 50k)

Elasticsearch mappings

resource type

Here's the mapping for the resource type of the ‘omtd-resourcesync’ index:

{
  "resource": {
    "properties": {
      "resource_set": {
        "type": "string",
        "index": "not_analyzed"
      },
      "location": {
        "type": "nested",
        "properties": {
          "value":{
            "type":"string",
            "index":"not_analyzed"
          },
          "type":{
            "type":"string",
            "index":"not_analyzed"
          }
        }
      },
      "length": {
        "type": "integer",
        "index": "not_analyzed"
      },
      "md5": {
        "type": "string",
        "index": "not_analyzed"
      },
      "mime": {
        "type": "string",
        "index": "not_analyzed"
      },
      "lastmod": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ssZ"
      },
      "ln": {
        "type": "nested",
        "index_name": "link",
        "properties": {
          "href": {
            "type": "nested",
            "properties": {
              "value":{
                "type":"string",
                "index":"not_analyzed"
              },
              "type":{
                "type":"string",
                "index":"not_analyzed"
              }
            }
          },
          "rel": {
            "type": "string",
            "index": "not_analyzed"
          },
          "mime": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      },
      "timestamp": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ssZ"
      }
    }
  }
}

For each resource, the following fields will be filled out:

  • resource_set: the name of the resource set the resource will belong to
  • timestamp: timestamp automatically generated by Elasticsearch when the document is created/updated
  • location: can be a
    • url: complete resource address, the url_prefix parameter won't be used
    • abs_path: absolute path, which will be resolved wrt the resource_root_dir parameter and then attached to the url_prefix
    • rel_path: relative path, which will be attached to the url_prefix
  • length: length of the resource
  • md5: md5 hash of the resource (may become an array of hashes to support different hashing techniques
  • mime: mime type of the resource
  • lastmod: last modification time of the resource
  • ln: links to other resources, each one of them can have three fields
    • rel: relationships description (i.e. describes, described by)
    • href: link to the resource, similar to location (url/abs_path/rel_path)
    • mime: mime type of the linked resource

change type

Here's the mapping for the change type:

{
  "change": {
    "properties": {
      "resource_set": {
        "type": "string",
        "index": "not_analyzed"
      },
      "location": {
        "type": "nested",
        "properties": {
          "value":{
            "type":"string",
            "index":"not_analyzed"
          },
          "type":{
            "type":"string",
            "index":"not_analyzed"
          }
        }
      },
      "change": {
        "type": "string",
        "index": "not_analyzed"
      },
      "lastmod": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ssZ"
      },
      "datetime": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ssZ"
      },
      "timestamp": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ssZ"
      }
    }
  }
}
  • resource_set: the name of the resource set the resource will belong to
  • timestamp: timestamp automatically generated by elasticsearch when the document is created/updated
  • location: can be a
    • url: complete resource address, the url_prefix parameter won't be used
    • abs_path: absolute path, which will be resolved wrt the resource_root_dir parameter and then attached to the url_prefix
    • rel_path: relative path, which will be attached to the url_prefix
  • change: type of the occurred change, can be created/updated/deleted
  • lastmod: last modification time of the resource

Note: the current mapping will be extended with further metadata and updated according to new versions of the ResourceSync specification