Skip to content

Latest commit

 

History

History
106 lines (59 loc) · 6.52 KB

README.md

File metadata and controls

106 lines (59 loc) · 6.52 KB

EPrints - Archivematica Integration

Digital Preservation through EPrints-Archivematica Integration - An EPrints export plugin for contents to be preserved with Archivematica

The EPrints-Archivematica integration proposal was first presented at Archivematica Camp and OR2018 in Bozeman. The OR2018 presentation is available here:

Neugebauer, Tomasz , Simpson, Justin and Bradley, Justin (2018) Digital Preservation through EPrints-Archivematica Integration. In: International Conference on Open Repositories, June 3-7, 2018, Bozeman, Montana, USA https://spectrum.library.concordia.ca/983933/

Summary

The following is a summary of the proposed workflow for EPrints-Archivematica integration:

  • “Digital Preservation Export” batch script runs periodically that identifies new/updated items to export and generates the exports in a directory structure optimized for Archivematica transfers described below.

  • The export plugin will create a transfer for each eprint. Each transfer includes:

    • An objects directory containing the uploaded digital files that are part of the eprint as well as any derivative access files generated by EPrints
    • An objects/documents folder containing all uploaded digital files that are a part of the eprint
    • A objects/derivatives folder containing any derivative access files that were generated by EPrints, such as thumbnail images, audio access files, video access files
    • A metadata folder with Dublin Core metadata (in JSON format), EPrints XML metadata, EPrints-generated "revision" XML files, and an md5deep-style checksum manifest for digital files in the objects directory
    • A metadata/revisions folder containing all EPrints-generated “revision” XML files

Transfers are moved to a specified shared storage location.

Eprint Export Folder Structure

  • The following would be the structure of the documents folder:

Eprint Export Folder Structure - Documents

  • The following would be the structure of the derivatives folder:

fileid-XXXXX -> folder# -> filename

  • Archivematica's Automation Tools monitors shared storage for new bag, creates transfers/ingests in Archivematica according to a user-defined processing configuration, and then stores AIPs in archival storage.

This integration is currently in the technical specification phase.

Implementation details

Derivatives

$c->{DPExport}->{include_derivatives}=1;

Setting this to 0 would exclude anything such as thumbnail images and web accessible versions of the audio and video files.

Checksum manifest

The metadata/checksum.md5 file should follow the specifications detailed in the Archivematica documentation for creating a transfer with existing checksums.

Specifically, in this implementation, each line of the checksum.md5 manifest should contain the md5 hash value for a file in the objects directory, followed by a space, followed by the relative path to the file from the checksum.md5 file itself.

Example:

2121dca88ad7f701d3f3e2d041004a56 ../objects/documents/my-doc.pdf

For files with MD5 values already recorded in the EPrints database, use these values in the manifest. For these values already recorded in EPrints database, they should be checked (ie., recalculated for the file and compared to what is stored in EPrints) signalling an error if there is a mismatch. These errors indicate that file corruption may have already taken place. There should be a configuration option to control what happens in case of a checksum mismatch:

$c->{DPExport}={on-checksum-mismatch}=skip-proceed|halt

skip-proceed should be the default, meaning that the problematic eprint is flagged with an error in the eprint's digital preservation errors field, but the batch job continues. If 'halt' is chosen, the entire batch job that the problematic eprint is a part of halts.

In addition, there should be an option to communicate checksum-mismatch error by email:

$c->{DPExport}={on-checksum-mismatch-email-notification}= 1|0

It should be set to 0 by default, and if set to 1, in addition to the problematic eprint not exporting, an email with the error information is sent to the address selected in the following config:

$c->{DPExport}={DP-admin-email}="[email address]"

For files with no MD5 value in the EPrints database:

  • Generate a new MD5 from the file on disk
  • Write the MD5 to the EPrints database
  • Write the MD5 to the checksum.md5 manifest
  • Note that the MD5 was generated for the given file in the eprints' digital preservation warnings field

Preservation Management Screen

An EPrintsArchivematica preservation management screen allows the administrator to browse eprints, including by last exported date, and export status (success or failure - with reason). For example, an eprint fails the checksum rechecking prior to export and so is not exported, or is exported with new checksum.

Preservation Triggers

Plugin configuration file will include a list of metadata elements who's change would flag an eprint as in need of preservation. For example:

$c->{DPExport}= {trigger_fields => [{ meta_fields => [ "title" ] }]}

In addition, the configuration file will include a list of trigger_events that take place on an eprint which flag it as in need of preservation. For example:

$c->{DPExport}= {trigger_events => [{ events => ["FilesModified"] }]}

It should be possible to configure to flag an item for preservation every time it is moved to "archive" (either from an event or through the appearance of a "datestamp" meta_field).

There should be a command line bin script that will export entire "live" archive dataset, or a list of eprintIDs.

The configuration file should specify if the preservation actions (preservation of eprints flagged as in need of preservation) should be performed in batches or asap (as soon as change occurs):

$c->{DPExport}=>{perform_preservation} = asap|batch

In addition, an option should specify when the batch processing should take place:

$c->{DPExport}=>{perform_preservation_batch} = #use cron time string format to specify time

Batch process all eprints flagged for preservation at the same time each night/week/month/year