Skip to content
Andrew Berger edited this page Oct 19, 2023 · 12 revisions

Staging Content

"Staging" content is the process of organizing files for deposit. Depending on the volume and complexity of your data, and the specifics of your workflow, the staging process could consist of:

  • Copying folders and files to a shared mount using a file manager (such as the Mac's Finder or Windows' Explorer)
  • Uploading folders and files to the Globus file transfer service
  • Copying folders and files to a shared mount using SFTP or command-line tools

There is one prerequisite to staging your content: you must have digital repository identifiers for each item your are depositing or updating. If you do not yet have these identifiers, you must create them through a process called registration.

1. Organizing your files

No matter how you transfer files into the SDR, when staging your files you should always follow the same organizational structure:

  • Create a single folder to contain all items in your deposit
  • Within that folder, create an individual folder for each item
    • Each folder should contain the file or files to be deposited for that item
  • Create a manifest file that lists all of the items and folders to be deposited
    • This manifest file must be a CSV and must be named manifest.csv
  • If necessary, create a second manifest file called a "file manifest" that lists all files in the batch of items being deposited
    • This file must be a CSV and must be named file_manifest.csv
    • While not required for every Preassembly deposit, the file manifest is required when:
      • Depositing media items
      • Applying custom settings to specific files in a deposit
      • Updating existing items

Instructions for creating the manifest file(s) are given below.

Example of a deposit folder for a batch of image items

In this example:

Each of the folders with names like jh486mk1405 represent a single SDR item. In this case, each item will consist of a single image, the TIFF files that are inside each item's folder. Note that the manifest.csv is placed "alongside" the item folders within the larger deposit folder.

[folder containing the entire batch]/
├── jh486mk1405
│   └── ErbWE.tif
├── manifest.csv
├── mw438sy2326
│   └── AkremiF.tif
├── sv928qy8859
│   └── BerrySmithJ.tif
├── vb063xr4527
│   └── RoachJM.tif
├── vz805mb3344
│   └── ShillinglawDT.tif
├── ww805gw8199
│   └── RobertsCE.tif
└── yg789dz9935
    └── RobertsCET.tif

Using the command-line to batch organize files into folders

Some users have found the following advice helpful for automating the organization of files into folders using the Bash/Linux command line.

  • To prepare content from the command line on the mounted drive, create two files in the folder that contains the content folder, either by uploading them or by using a command-line text editor such as nano. The first file should be called druids.txt and contain a list of the druids being prepared, one per line. The second should be called filenames.txt and contain one druid-filename pair per line, separated by a tab.

    • Both druids.txt and filenames.txt should have Unix (\n) style line endings. If the file was created in a Windows program that uses \r\n or a Mac program that uses \r, use the dos2unix command to convert the line endings to \n: dos2unix /path/to/file.txt

    • To create the sub-folders: while read druid; do mkdir content/"$druid"; done <druids.txt

    • To move the files from the content directory into the corresponding druid directories, substitute the file path to the folder containing the files in the following: while read druid filename; do mv /{stagingMount}/{projectFolder}/{fileFolder}/"$filename" content/"$druid"; done <filenames.txt

2. Creating the manifest file(s)

Creating the manifest.csv file

A manifest.csv file is required for all Preassembly deposits. This file has a dual purpose: it is an inventory or all items to be deposited in a single batch, and it is a device used to match druid identifiers to specific folders.

The manifest.csv file always follows the same structure:

  • the column headings are druid and object
  • the first column is a list of druids (without the "druid:" prefix
  • the second column is the name of the folder corresponding to the item connected with the druid on the same row

A manifest.csv for the batch of items in the staging example above would look like this:

  druid,object
  mw438sy2326,mw438sy2326
  sv928qy8859,sv928qy8859
  jh486mk1405,jh486mk1405
  vb063xr4527,vb063xr4527
  ww805gw8199,ww805gw8199
  yg789dz9935,yg789dz9935
  vz805mb3344,vz805mb3344

In this case, the folders are each named for a druid. Please note that the folder name does not necessarily have to be a druid, but it is often easier to process a Preassembly job when each folder is named for a druid, as that leaves no doubt as to which items correspond to which folders.

That said, there may be projects where folders are created for items prior to the creation of unique item identifiers (the druids). In that situation, it may be possible to leave the folder names alone rather than rename them en masse when the druids are created. This can be helpful when working with third-party vendors who may not know the druids for a set of items, for example.

The manifest.csv for a batch of items where folder names follow a non-druid pattern could look like:

  druid,object
  {druid},Pamphlet 01
  {druid},Pamphlet 02

In this example, the folders with the files in them are named "Pamphlet 01" and "Pamphlet 02". The manifest.csv tells Preassembly to look in these folders for the files matching the repository identifiers in the first column.

Creating the file_manifest.csv

As noted above, the file_manifest.csv is an additional manifest file that lists every file in a deposit, in contrast to the manifest.csv which lists only the folders and items in a deposit. The manifest.csv file is adequate for deposits that make use of Preassembly's default settings for processing simple items, such as images, books, PDF documents, and files. The manifest.csv is not adequate for more complex processing where you need more granular control over how each file is treated. In those cases, a file_manifest.csv must be included in the deposit, as this file manifest contains instructions for how each file should be processed.

Types of deposit where a file manifest is required:

  • Media deposits (audio and/or video) or deposits of other complex types of context (such as disk images) (detailed instructions)
  • Updates to existing objects where you are adding or modifying specific files (detailed instructions)
  • Anything else requiring customizing metadata, such as providing captions for images (detailed instructions

Detailed instructions for how to prepare file manifests for these scenarios are provided on their own wiki pages, linked from the list above or in the documentation sidebar.

Manifest formatting guidelines

  • Make sure each druid is listed only once
  • Do not leave any lines blank
  • Make sure to include the "druid,object" header

Using the command line to create a manifest.csv

Some users have found the following advice helpful for automating the creation of the manifest.csv using the Bash/Linux command line.

It is also possible to generate the manifest from a list of druids via the command line. With a file called druids.txt containing one druid per line in the current folder: sed 's/\(^.*$\)/\1,\1/' <druids.txt >>content/manifest.csv This will generate a file called manifest.csv within the content folder, but without the druid,object header, which needs to be added to the file manually.

3. Copying folders and files to a staging location

Preassembly has been set up to "get" files from certain shared storage locations, known as "staging locations". Once files are placed on a staging location, Preassembly then accesses those files and copies them into the SDR. This makes it possible to process batches of items by placing the items in a single staging location rather than having to upload them individually through the browser.

Choosing a staging location

Preassembly is integrated with multiple servers in the library, which means that you have a choice as to where to stage your content:

  • Departmental shared mount
    • Staff in certain departments have shared file storage mounts that are connected to Preassembly. These mounts are generally accessible from staff computers (personal or shared workstations) in those departments
    • The full list of these storage mounts can be found on Consul at: content mount paths
    • If you are a member of one of these departments and are unsure of how to use these mounts, please contact the Repository Manager
  • Globus
    • All staff can use Globus, which is a file transfer service managed by the library and the university
    • Globus supports both browser-based uploads and file syncing through a client application
    • Globus also integrates with Stanford Google Drive accounts, making it possible to deposit content from Google Drive without having to download it first
  • File storage on the Preassembly server itself
    • Access to this server is via the command-line or SFTP only

The choice of a staging location ultimately comes down to a number of factors:

  • If you are a member of a department with access to a shared mount, you likely already have access to your department's staging location without needing to install or configure any additional software
  • If you are not a member of a department with access to a shared mount, you must use Globus or the Preassembly server
  • If you do not want to use SFTP or the command-line, you should use Globus
  • If you have content on Google Drive, you should use Globus
  • If no other option works best for you, you can still use the Preassembly server

Using a departmental share

If you are member of one of the following departments, you may be able to use a departmental file share as a Preassembly staging source:

  • Archive of Recorded Sound
  • Maps
  • Music Library
  • Special Collections and University Archives

Due to the nature of file shares, the exact name of the share on your workstation or laptop may vary. Under Windows, it may be mounted as a specific drive letter (like "Z:") rather than as a named location like "/ars" (for Archive of Recorded Sound and Music Library materials). Please review the list of staging locations on Consul to see if one of them applies to your situation, and contact the Repository Manager if you have questions.

Once you've identified a staging location to use with Preassembly, the final steps for staging your content is to copy your files to the staging location, making sure that you follow the guidelines above regarding manifests and folder organization. You can of course organize your files after copying them to the staging location. The important point is that the files must be organized and present on a staging location before you can run Preassembly.

Using Globus

Globus is a file transfer service that is available to anyone with an active Stanford SUNet ID. Similar to cloud file management services, Globus supports either uploading files using the browser or uploading files by installing a client application on your machine. Unlike commercial cloud services, the library's installation of Globus is managed by the library itself.

To use Globus with Preassembly, follow these steps:

  1. Go the Preassembly application and click on the button labeled "Request Globus Link".
  • This will create a location in Globus where you can place your files.
  • It will also open a new browser tab showing that location.
  • Note: you may be prompted to allow popups from the Preassembly website in order to view the link
  1. Transfer files to the Globus location, making sure that you follow the guidelines above regarding manifests and folder organization
  • Detailed descriptions of how to copy files into the Globus staging location will be documented on their own page
  1. Once your files have been staged in Globus, you are ready to move on to Preassembly

Using the Preassembly server

If neither a departmental share nor Globus will work for you, you have the option to connect directly to the Preassembly server. Doing so requires some familiarity with SFTP or command-line copying tools like rsync, and requires special configuration. Instructions for connecting to Preassembly are available on Consul.

Once on the Preassembly server:

  1. Start at this path: /dor/staging
  2. Create a folder for yourself if you haven't done so already.
  • This is a shared location so it's better if you keep all your work within a folder associated with your name or SUNet ID.
  • Example: [my-name]-accessions, a folder at the full path /dor/staging/[my-name]-accessions
  1. Using SFTP or rsync, copy your manifests and item folders onto that path
  • Example: If I had a folder named "pamphlets-batch-1" that contained item folders and a manifest, then I would end up with a folder at `/dor/staging/[my-name]-accessions/pamphlets-batch-1" as my staging location