Skip to content

Accessioning complex content

Andrew Berger edited this page Mar 19, 2024 · 13 revisions

Accessioning complex content

Note: this type of accessioning used to be specific to media but is now generalized with a few changes.

The default settings for content types such as "book", "image", and "file" are appropriate for a large percentage of the content accessioned into SDR, but some content requires special handling. Complex content refers to any content where you need more granular control over how your files are organized within each object. An example of complex objects that are accessioned in pre-assembly are multimedia objects, designed for audio and video content. Media content files must be grouped in a particular way in order to stream in Stanford's embed player:

  • video files must be grouped into video resources
  • audio files must be grouped into audio resources
  • still image files, if not being used as video or audio thumbnails, must be grouped into image resources
  • additional file types, such as PDFs, must be part of file resources

The use of a manifest supports this custom grouping, and also supports other custom objects. In order to accession complex content, an additional manifest must be submitted along with the generic preassembly manifest.csv file. This manifest makes it possible to control:

  • how files are grouped into resources*
  • whether each file should be accessible from the Purl or only stored in preservation
  • how resources are labeled
  • the order in which resources appear in the viewer
  • the role of a file

* definition of a resource: a resource is a grouping of files that all fill the same purpose. All files in an object must be part of at least one resource. Examples of resources are:

  • a single "video" resource in a media object may contain: the high-quality archival copy of a video plus the web-accessible streaming copy of the same video
  • a single "page" resource in a book object may contain: the high-quality archival scan of a page, the web-accessible JP2 version of that scan, the OCR text from that page, and a PDF of a page

Accessioning with a file manifest

To use a file manifest to accession complex content when starting a new job, configuration everything as normal, selecting the options below:

  • "Content structure" drop-down menu == Choose the content type, be sure to select "Media" for media objects.
  • "Content metadata creation" drop-down menu == "Default"
  • "I have a file manifest" checkbox == checked
  • "Preserve, Shelve, Publish Settings" radio button == "Default"

Then you must create a manifest called "file_manifest.csv" listing all of the files in all of your objects, and place this manifest file in the same location as the "manifest.csv" file with your content. The structure of this manifest file is described below.

The file manifest

The file_manifest.csv must be formatted as follows:

druid,filename,resource_label,sequence,publish,preserve,shelve,resource_type
oo000oo0001,filename,human-readable label,number,yes/no,yes/no,yes/no,video/audio/image/file/object/media

Please note that the second line of the CSV above is only providing an example description of what data goes in which field. It should be taken out before submitting the CSV.

The required columns must be labeled exactly as below and are defined as follows:

  • druid: the druid corresponding to the files for the given object. The files must be in a folder where the folder name is the same as the druid.
  • filename: the name of the file found in the object identified in the first column; can be multiple rows per object
  • resource_label: a human-readable label for the resource (note: all files in the same resource should have the same resource_label)
  • sequence: a number indicating the sequence in which the resources should be listed on the PURL, starting from 1. This column is also used to define when a new resource begins, which is when you have the next sequence listed (e.g. 2) - incrementing the sequence indicates a new resource. You should make the sequences sequential (e.g. 1,2,3,...)
  • publish: whether the file should be listed on the PURL (yes/no)
  • preserve: whether the file should be stored in preservation (yes/no)
  • shelve: whether the file should be stored on the digital Stacks (yes/no)
  • resource_type: the type that should be given to the resource (note: as with the label, all files in the same resource should have the same resource_type)

In addition, there are two optional columns you can add. They can be left off the file_manifest.csv if not needed, but if you do add these columns to the header of the csv, leave these fields blank for all the files where you do not wish to assign a value.

  • role: A file in a resource can have a specified "role", with the allowed values of "transcription", "caption", "thumbnail", "annotations", "derivative", or "master". Any other values will be ignored. This can be useful for OCR or Geo support.
  • file_language: The language of the file. Must be in the BCP-47 format. (See the Wikipedia page for IETF language tags for examples.)

Checksums

If you have generated checksums for any files in your object, and you place the standard MD5 checksum value in a file that has the exact same filename, but with an .md5 extension, the checksum for that file will be picked up automatically and added to content metadata.

For example, if you have the following files:

file123.jpg
file245.jpg
file245.jpg.md5  <-- a standard md5 text file with "MD5 (file245.jpg) = ee4e90be549c5614ac6282a5b80a506b" in it

with the following file_manifest.csv,

druid,filename,resource_label,sequence,publish,preserve,shelve,resource_type
vd000bj0000,file123.jpg,Video File 1,1,yes,yes,yes,video
vd000bj0000,file245.jpg,Video File 1,1,yes,yes,yes,video

it will generate the following content metadata:

<contentMetadata objectId="vd000bj0000" type="media">
   <resource sequence="1" id="vd000bj0000_1" type="video">
      <label>Video file 1</label>
      <file id="file123.jpg" preserve="yes" publish="yes" shelve="yes"/>
      <file id="file245.jpg" preserve="yes" publish="yes" shelve="yes">
           <checksum type="md5">ee4e90be549c5614ac6282a5b80a506b</checksum>
      </file>
   </resource>
<contentMetadata>

Note that the .md5 file is NOT listed in the file_manifest.csv but is found, and the checksum is added to the appropriate file based on the matching filename.

Caveats

  • You must list any and all files in the file_manifest.csv that you want to appear in the content metadata. Files not listed in the manifest will NOT appear in content metadata even if they exist in the object folder and are staged with the object.
  • Conversely, if you list files in the manifest and they do NOT exist in the object (even if you have a typo in the filename), they will STILL be listed. The content metadata is built from the manifest and NOT from scanning the object folder, which is different than what happens you do not provide a file_manifest.csv
  • If you do not have a sequence of "1" in the first row after the header, content metadata will NOT be generated correctly, so put the files in sequentially when adding to the manifest.
  • Although you do not need to repeat the resource_label and sequence value for each file in the same resource, it is recommended that you do to avoid any unexpected problems. If you leave repeated resource_label and sequence values blank in following rows, the value in the resource_label column on the line starting the new resource will be used, and the value of the sequence will be assumed to be the same until is incremented.

Example

A real-world example of a media_manifest matching https://purl.stanford.edu/mp830rp2787 is:

druid,filename,resource_label,sequence,publish,preserve,shelve,resource_type,role
mp830rp2787,mp830rp2787_SC1473_s2_Hayman_Warren_Video_pm.mov,"Video file",1,no,yes,no,video
mp830rp2787,mp830rp2787_SC1473_s2_Hayman_Warren_Video_sl.mp4,"Video file",1,yes,yes,yes,video
mp830rp2787,mp830rp2787_SC1473_s2_Hayman_Warren_Video_thumb.jp2,"Video file",1,yes,yes,yes,video,thumbnail
mp830rp2787,mp830rp2787_SC1473_s2_Hayman_Warren_Audio_pm.wav,"Audio file",2,no,yes,no,audio
mp830rp2787,mp830rp2787_SC1473_s2_Hayman_Warren_Audio_sl.m4a,"Audio file",2,yes,yes,yes,audio
mp830rp2787,mp830rp2787_SC1473_s2_Hayman_Warren_Release.pdf,"Release",3,no,yes,no,file
mp830rp2787,mp830rp2787_SC1473_s2_Hayman_Warren_Transcript.pdf,"Transcript",4,yes,yes,yes,file,transcription

In this example you can see that:

  • for the resource labeled "Video file"
    • its resource type is "video"
    • its sequence number is 1
    • it has three files
    • the MOV file is the high-quality archival copy and is stored only in preservation
    • the MP4 file is the streaming copy and is available on the Purl and stored in preservation
    • the JP2 file is the thumbnail that appears over the video player as the default image when the player loads
  • for the resource labeled "Audio file"
    • its resource type is "audio"
    • its sequence number is 2
    • it has two files
    • the WAV file is the high-quality archival copy and is stored only in preservation
    • the M4A file is the streaming copy and is available on the Purl and stored in preservation
  • for the resource labeled "Release"
    • its resource type is "file"
    • its sequence number is 3
    • it has only one file
    • this PDF is the release form and is stored only in preservation and so is not available to the public on the Purl
  • for the resource labeled "Transcript"
    • its resource type is "file"
    • its sequence number is 4
    • it has only one file
    • this PDF is the transcript and is available on the Purl and stored in preservation

Note that this is only an example to describe how media accessioning works. Many combinations are possible, depending on the nature of your content. The media player requires at least one video or one audio resource but objects can be only video or only audio, and there can be more than one resource for each.

Accessioning content in batches

Content can be accessioned in batches. For a batch of druids, the file_manifest.csv must have a line for each file to be accessioned. This list must correspond to the list of objects in the manifest.csv. Furthermore, the object folders must be named for the corresponding druids.

Discovery reports

Please run a discovery report before accessioning any complex content. The discovery report will help identify possible mismatches between the manifest CSVs and the files to be accessioned.

Clone this wiki locally