Digital Preservation through EPrints-Archivematica Integration - An EPrints export plugin for contents to be preserved with Archivematica
The EPrints-Archivematica integration proposal was first presented at Archivematica Camp and OR2018 in Bozeman. The OR2018 presentation is available here:
Neugebauer, Tomasz , Simpson, Justin and Bradley, Justin (2018) Digital Preservation through EPrints-Archivematica Integration. In: International Conference on Open Repositories, June 3-7, 2018, Bozeman, Montana, USA https://spectrum.library.concordia.ca/983933/
The following is a summary of the proposed workflow for EPrints-Archivematica integration:
-
“Digital Preservation Export” batch script runs periodically that identifies new/updated items to export and generates the exports in a directory structure optimized for Archivematica transfers described below.
-
The export plugin will create a transfer for each eprint. Each transfer includes:
- An
objects
directory containing the uploaded digital files that are part of the eprint as well as any derivative access files generated by EPrints - An
objects/documents
folder containing all uploaded digital files that are a part of the eprint - A
objects/derivatives
folder containing any derivative access files that were generated by EPrints, such as thumbnail images, audio access files, video access files - A
metadata
folder with Dublin Core metadata (in JSON format), EPrints XML metadata, EPrints-generated "revision" XML files, and anmd5deep
-style checksum manifest for digital files in theobjects
directory - A
metadata/revisions
folder containing all EPrints-generated “revision” XML files
- An
Transfers are moved to a specified shared storage location.
- The following would be the structure of the documents folder:
- The following would be the structure of the derivatives folder:
fileid-XXXXX -> folder# -> filename
- Archivematica's Automation Tools monitors shared storage for new bag, creates transfers/ingests in Archivematica according to a user-defined processing configuration, and then stores AIPs in archival storage.
This integration is currently in the technical specification phase.
$c->{DPExport}->{include_derivatives}=1;
Setting this to 0 would exclude anything such as thumbnail images and web accessible versions of the audio and video files.
The metadata/checksum.md5
file should follow the specifications detailed in the Archivematica documentation for creating a transfer with existing checksums.
Specifically, in this implementation, each line of the checksum.md5
manifest should contain the md5 hash value for a file in the objects
directory, followed by a space, followed by the relative path to the file from the checksum.md5
file itself.
Example:
2121dca88ad7f701d3f3e2d041004a56 ../objects/documents/my-doc.pdf
For files with MD5 values already recorded in the EPrints database, use these values in the manifest. For these values already recorded in EPrints database, they should be checked (ie., recalculated for the file and compared to what is stored in EPrints) signalling an error if there is a mismatch. These errors indicate that file corruption may have already taken place. There should be a configuration option to control what happens in case of a checksum mismatch:
$c->{DPExport}={on-checksum-mismatch}=skip-proceed|halt
skip-proceed should be the default, meaning that the problematic eprint is flagged with an error in the eprint's digital preservation errors field, but the batch job continues. If 'halt' is chosen, the entire batch job that the problematic eprint is a part of halts.
In addition, there should be an option to communicate checksum-mismatch error by email:
$c->{DPExport}={on-checksum-mismatch-email-notification}= 1|0
It should be set to 0 by default, and if set to 1, in addition to the problematic eprint not exporting, an email with the error information is sent to the address selected in the following config:
$c->{DPExport}={DP-admin-email}="[email address]"
For files with no MD5 value in the EPrints database:
- Generate a new MD5 from the file on disk
- Write the MD5 to the EPrints database
- Write the MD5 to the
checksum.md5
manifest - Note that the MD5 was generated for the given file in the eprints' digital preservation warnings field
An EPrintsArchivematica preservation management screen allows the administrator to browse eprints, including by last exported date, and export status (success or failure - with reason). For example, an eprint fails the checksum rechecking prior to export and so is not exported, or is exported with new checksum.
Plugin configuration file will include a list of metadata elements who's change would flag an eprint as in need of preservation. For example:
$c->{DPExport}= {trigger_fields => [{ meta_fields => [ "title" ] }]}
In addition, the configuration file will include a list of trigger_events that take place on an eprint which flag it as in need of preservation. For example:
$c->{DPExport}= {trigger_events => [{ events => ["FilesModified"] }]}
It should be possible to configure to flag an item for preservation every time it is moved to "archive" (either from an event or through the appearance of a "datestamp" meta_field).
There should be a command line bin script that will export entire "live" archive dataset, or a list of eprintIDs.
The configuration file should specify if the preservation actions (preservation of eprints flagged as in need of preservation) should be performed in batches or asap (as soon as change occurs):
$c->{DPExport}=>{perform_preservation} = asap|batch
In addition, an option should specify when the batch processing should take place:
$c->{DPExport}=>{perform_preservation_batch} = #use cron time string format to specify time
Batch process all eprints flagged for preservation at the same time each night/week/month/year