Skip to content

Latest commit

 

History

History
55 lines (47 loc) · 2.85 KB

resourcesync.adoc

File metadata and controls

55 lines (47 loc) · 2.85 KB

ResourceSync source

Overview

ResourceSync is a synchronization framework for the web consisting of various capabilities that allow third-party systems to remain synchronized with a server’s evolving resources. It is based on the Sitemaps protocol, extending its capabilities in order to keep track of resources' state.

The purpose of our work is to develop a fully compliant, reliable and scalable implementation of a ResourceSync server. While our use case is motivated by the need to continuously aggregate the metadata and full text content of academic papers, our solution should be applicable to other use cases too.

Requirements

  1. Maximizing response time to resync requests

  2. Managing separate sets of resources belonging to different data providers

  3. Optimization for continuous additions and occasional deletes/updates

  4. Dealing with sitemaps limitations (max 50k items/document, max 50MB)

  5. Metadata and content files related to a publication should be linked

Architecture

resync arch
Figure 1. Architecture

Approach to develop a scalable ResourceSync implementation

  • Resource state will be kept into ElasticSearch indexes. For each resource, we will store metadata that will be showed in sitemaps docs (last mod, hash, size, etc)

  • Whenever a new ResourceList has to be generated, its pagination will come directly from the ElasticSearch index, and each xml document will be cached and regenerated after a given amount of detected changes

  • When a change in the folder is detected, the event will be registered into ElasticSearch, in a way that a ChangeList related to a given period of time will be generated taking into account the events occurred in that time interval. ChangeList caching will be implemented as well.

  • Every time a new ResourceList is generated, all the changes stored into the "change" ES type will be erased or marked as "fetched"

  • Periodically, a new ResourceList will be generated and the current ChangeList will be “closed”, therefore a new ChangeList will be initialized. There is no indication for the appropriate time interval in the ResourceSync specification, so we will tune this parameter based on the metadata/pdfs download rate. More information about how metadata will be stored into Elasticsearch are available here.

Further improvements

  • Push protocol based on PubSubHubbub

    • Change Notification

      • Notifies about changes to particular resources e.g., resource A has been updated | created | deleted

    • Framework Notification

      • Notifies about changes to capabilities e.g., a Change List has been updated | created | deleted

  • Resources and Changes dumps