Skip to content

META SHARE Harvesting Protocol v1.0

cspurk edited this page Oct 17, 2012 · 1 revision

META-SHARE Harvesting Protocol v1.0

Introduction

This document describes the protocol which is spoken between a META-SHARE Managing Node (henceforth the “client”) and a normal META-SHARE Node (henceforth the “server”) during the synchronization of META-SHARE language resource descriptions conforming to the META-SHARE metadata schema in version 3.0 (henceforth “LR”/“LRs”). The synchronization is actually a harvesting process as the client will only pull LRs from the server; no LRs are ever sent from the client to the server.

Overview

The harvesting process basically works as follows:

  1. The client authenticates on the server.
  2. The client requests an LR inventory from the server.
  3. The server responds with its LR inventory.
  4. For every server LR which is different or missing in the LR storage of the client:
    1. the client requests the LR from the server.
    2. the server responds with the requested LR.
    3. the client updates its LR storage with the server LR and takes note of where the LR came from.
  5. For every LR in the LR storage of the client which originally came from the server but which is not (anymore) in the LR inventory of the server, the client deletes the LR from its LR storage.

This document is purely about the specification of the request/response protocol which is spoken between the client and the server. From the above steps it therefore only covers steps 1 (see Section “Client Authentication”), 2/3 (see Section “LR Inventory Transfer”) and 4.1/4.2 (see Section “LR Transfer”).

In all cases where this specification talks about HTTP URLs, a compliant client/server must also support HTTPS URLs.

In the following, the server is assumed to be available via HTTP under the base URL “%SERVER_URL%” (e.g., http://metashare.dfki.de).

Client Authentication

For the client to authenticate with the server, the client has to send an HTTP GET request to “%SERVER_URL%/login/”. The server has to answer this request with an HTTP response with status 200 containing an HTTP Set-Cookie header field according to RFC 2965 with a value for the name csrftoken (henceforth “%CSRF_TOKEN_VALUE%”). This value should be chosen by the server in a way which helps to prevent CSRF attacks.

Next, the client has to send an HTTP POST request to “%SERVER_URL%/login/” with the following parameters:

  • username: %SERVER_USER%
  • password: %SERVER_PASSWORD%
  • this_is_the_login_form: 1
  • csrfmiddlewaretoken: %CSRF_TOKEN_VALUE%

“%SERVER_USER%” and “%SERVER_PASSWORD%” are authentication credentials. The HTTP POST request must additionally have an HTTP Cookie header field according to RFC 2965 with the “%CSRF_TOKEN_VALUE%” value for the name csrftoken.

If the authentication request from the client was valid, then the server must answer with an HTTP response with status 200 containing an HTTP Set-Cookie header field according to RFC 2965 with a value for the name sessionid (henceforth “%SESSION_ID%”); in addition, the content of the HTTP response must contain the string Logout. Otherwise the server should return an HTTP response with status 403.

LR Inventory Transfer

For the client to obtain the LR inventory from the server, the client has to send an HTTP GET request to “%SERVER_URL%/sync/?sync_protocol=1.0”. The HTTP GET request must additionally have an HTTP Cookie header field according to RFC 2965 with the “%SESSION_ID%” value (from the authentication procedure) for the name sessionid.

The server should respond with an HTTP response with status 403 in case of authentication problems. If there should be no sync_protocol parameter or if the version is not 1.0, then the server should respond with an HTTP response with status 501.

If there should be no problems with the request from the client, then the server must answer with an HTTP response with status 200 containing an HTTP Sync-Protocol header field with the value 1.0; in addition, the content of the response must be an inventory ZIP file attachment as defined below.

Inventory ZIP File

The inventory ZIP file returned by the server has to have the following characteristics:

  • It is a .ZIP File according to the .ZIP File Format Specification.
  • It contains exactly one file named inventory.json which has the format defined below.

The inventory.json file has to have the following characteristics:

  • It is an UTF-8-encoded file whose content text is a JSON object conforming to RFC 4627.
  • The JSON object has one member (key/value pair) per non-internal LR in the LR storage of the server.
  • Each JSON object member has the LR storage ID string (defined below) as the key.
  • The value of each JSON object member is an LR checksum string for the corresponding LR as defined below.

LR Storage ID

Every LR has to have an LR storage ID which is unique across the META-SHARE Network. Each LR storage ID must be a 64-character hexadecimal (lower-case) string.

A new LR storage ID is created by concatenating two 32-character hexadecimal (lower-case) strings comprising one version 1 UUID and one version 4 UUID according to RFC 4122.

LR Checksum

The LR checksum is a 32-character hexadecimal (lower-case) string which resembles a checksum for a specific LR. The string has to be an MD5 message digest as defined in RFC 1321 for the string which is the concatenation of the following LR-specific substrings:

  1. the content of the metadata XML file (as defined below) for the LR as a Unicode string
  2. the content of the global storage JSON file (as defined below) for the LR as a Unicode string

LR Transfer

For the client to obtain the LR with LR storage ID “%LR_ID%” from the server, the client has to send an HTTP GET request to “%SERVER_URL%/sync/%LR_ID%/metadata/”. The HTTP GET request must additionally have an HTTP Cookie header field according to RFC 2965 with the “%SESSION_ID%” value (from the authentication procedure) for the name sessionid.

The server should respond with an HTTP response with status 403 in case of authentication problems.

If there should be no problems with the request from the client, then the server must answer with an HTTP response with status 200; the content of the response must be a resource ZIP file attachment (as defined below) for the LR with LR storage ID “%LR_ID%”.

The client should verify that the LR checksum from the LR inventory is the same as the locally calculated LR checksum for the LR in the received resource ZIP file.

Resource ZIP File

A resource ZIP file has to have the following characteristics:

  • It is a .ZIP File according to the .ZIP File Format Specification.
  • It contains exactly two files.
  • The first file in the resource ZIP file is a metadata XML file (as defined below) whhich is named metadata.xml.
  • The second file in the resource ZIP file is a global storage JSON file (as defined below) which is named storage-global.json.

Metadata XML File

A metadata XML file must be a valid XML file with regard to the META-SHARE metadata XML Schema in version 3.0. The metadata XML file represents a single LR.

Global Storage JSON File

A global storage JSON file has to have the following characteristics:

  • It is an UTF-8-encoded file whose content text is a JSON object conforming to RFC 4627. There must not be any whitespaces in the file which are optional according to RFC 4627.
  • The JSON object has exactly eight members (key/value pairs) which appear alphabetically sorted by their keys:
    1. created: a JSON string representing the date and time on which the corresponding LR was created. Format: “YYYY-MM-DD hh:mm:ss” – example: 2012-05-21 17:00:23
    2. deleted: one of the JSON literals true and false, depending on whether the LR is marked as deleted or not
    3. identifier: a JSON string with the LR storage ID (as defined above)
    4. metashare_version: a JSON string with the META-SHARE software version with which the LR was created, e.g., 3.0-SNAPSHOT
    5. modified: a JSON string representing the date and time on which the corresponding LR was last modified. The format is the same as for the created member.
    6. publication_status: a one-character JSON string representing the publication status of the LR; either g for an ingested LR or p for a published LR.
    7. revision: a positive integer represented as a JSON number. The integer represents the revision of the LR; the revision number must be incremented each time the LR is changed.
    8. source_url: a JSON string with the base URL of the origin node, e.g., “%SERVER_URL%” for LRs that originate from the server.