-
Notifications
You must be signed in to change notification settings - Fork 2
Content Staging
"Staging" content is the process of organizing files for deposit. Depending on the volume and complexity of your data, and the specifics of your workflow, the staging process could consist of:
- Copying folders and files to a shared mount using a file manager (such as the Mac's Finder or Windows' Explorer)
- Uploading folders and files to the Globus file transfer service
- Copying folders and files to a shared mount using SFTP or command-line tools
There is one prerequisite to staging your content: you must have digital repository identifiers for each item your are depositing or updating. If you do not yet have these identifiers, you must create them through a process called registration.
No matter how you transfer files into the SDR, when staging your files you should always follow the same organizational structure:
- Create a single folder to contain all items in your deposit
- Within that folder, create an individual folder for each item
- Each folder should contain the file or files to be deposited for that item
- Create a manifest file that lists all of the items and folders to be deposited
- This manifest file must be a CSV and must be named
manifest.csv
- This manifest file must be a CSV and must be named
- If necessary, create a second manifest file called a "file manifest" that lists all files in the batch of items being deposited
- This file must be a CSV and must be named
file_manifest.csv
- While not required for every Preassembly deposit, the file manifest is required when:
- Depositing media items
- Applying custom settings to specific files in a deposit
- Updating existing items
- This file must be a CSV and must be named
Instructions for creating the manifest file(s) are given below.
In this example:
Each of the folders with names like jh486mk1405
represent a single SDR item. In this case, each item will consist of a single image, the TIFF files that are inside each item's folder. Note that the manifest.csv
is placed "alongside" the item folders within the larger deposit folder.
[folder containing the entire batch]/
├── jh486mk1405
│ └── ErbWE.tif
├── manifest.csv
├── mw438sy2326
│ └── AkremiF.tif
├── sv928qy8859
│ └── BerrySmithJ.tif
├── vb063xr4527
│ └── RoachJM.tif
├── vz805mb3344
│ └── ShillinglawDT.tif
├── ww805gw8199
│ └── RobertsCE.tif
└── yg789dz9935
└── RobertsCET.tif
Some users have found the following advice helpful for automating the organization of files into folders using the Bash/Linux command line.
-
To prepare content from the command line on the mounted drive, create two files in the folder that contains the content folder, either by uploading them or by using a command-line text editor such as
nano
. The first file should be calleddruids.txt
and contain a list of the druids being prepared, one per line. The second should be calledfilenames.txt
and contain one druid-filename pair per line, separated by a tab.-
Both
druids.txt
andfilenames.txt
should have Unix (\n
) style line endings. If the file was created in a Windows program that uses\r\n
or a Mac program that uses\r
, use thedos2unix
command to convert the line endings to\n
:dos2unix /path/to/file.txt
-
To create the sub-folders:
while read druid; do mkdir content/"$druid"; done <druids.txt
-
To move the files from the content directory into the corresponding druid directories, substitute the file path to the folder containing the files in the following:
while read druid filename; do mv /{stagingMount}/{projectFolder}/{fileFolder}/"$filename" content/"$druid"; done <filenames.txt
-
A manifest.csv
file is required for all Preassembly deposits. This file has a dual purpose: it is an inventory or all items to be deposited in a single batch, and it is a device used to match druid identifiers to specific folders.
The manifest.csv
file always follows the same structure:
- the column headings are
druid
andobject
- the first column is a list of druids (without the "druid:" prefix
- the second column is the name of the folder corresponding to the item connected with the druid on the same row
A manifest.csv
for the batch of items in the staging example above would look like this:
druid,object
mw438sy2326,mw438sy2326
sv928qy8859,sv928qy8859
jh486mk1405,jh486mk1405
vb063xr4527,vb063xr4527
ww805gw8199,ww805gw8199
yg789dz9935,yg789dz9935
vz805mb3344,vz805mb3344
In this case, the folders are each named for a druid. Please note that the folder name does not necessarily have to be a druid, but it is often easier to process a Preassembly job when each folder is named for a druid, as that leaves no doubt as to which items correspond to which folders.
That said, there may be projects where folders are created for items prior to the creation of unique item identifiers (the druids). In that situation, it may be possible to leave the folder names alone rather than rename them en masse when the druids are created. This can be helpful when working with third-party vendors who may not know the druids for a set of items, for example.
The manifest.csv
for a batch of items where folder names follow a non-druid pattern could look like:
druid,object
{druid},Pamphlet 01
{druid},Pamphlet 02
In this example, the folders with the files in them are named "Pamphlet 01" and "Pamphlet 02". The manifest.csv tells Preassembly to look in these folders for the files matching the repository identifiers in the first column.
As noted above, the file_manifest.csv
is an additional manifest file that lists every file in a deposit, in contrast to the manifest.csv
which lists only the folders and items in a deposit. The manifest.csv
file is adequate for deposits that make use of Preassembly's default settings for processing simple items, such as images, books, PDF documents, and files. The manifest.csv
is not adequate for more complex processing where you need more granular control over how each file is treated. In those cases, a file_manifest.csv
must be included in the deposit, as this file manifest contains instructions for how each file should be processed.
Types of deposit where a file manifest is required:
- Media deposits (audio and/or video) or deposits of other complex types of context (such as disk images) (detailed instructions)
- Updates to existing objects where you are adding or modifying specific files (detailed instructions)
- Anything else requiring customizing metadata, such as providing captions for images (detailed instructions
Detailed instructions for how to prepare file manifests for these scenarios are provided on their own wiki pages, linked from the list above or in the documentation sidebar.
- Make sure each druid is listed only once
- Do not leave any lines blank
- Make sure to include the "druid,object" header
Some users have found the following advice helpful for automating the creation of the manifest.csv
using the Bash/Linux command line.
It is also possible to generate the manifest from a list of druids via the command line. With a file called druids.txt containing one druid per line in the current folder: sed 's/\(^.*$\)/\1,\1/' <druids.txt >>content/manifest.csv
This will generate a file called manifest.csv
within the content
folder, but without the druid,object
header, which needs to be added to the file manually.
Preassembly has been set up to "get" files from certain shared storage locations, known as "staging locations". Once files are placed on a staging location, Preassembly then accesses those files and copies them into the SDR. This makes it possible to process batches of items by placing the items in a single staging location rather than having to upload them individually through the browser.
Preassembly is integrated with multiple servers in the library, which means that you have a choice as to where to stage your content:
- Departmental shared mount
- Staff in certain departments have shared file storage mounts that are connected to Preassembly. These mounts are generally accessible from staff computers (personal or shared workstations) in those departments
- The full list of these storage mounts can be found on Consul at: content mount paths
- If you are a member of one of these departments and are unsure of how to use these mounts, please contact the Repository Manager
- Globus
- All staff can use Globus, which is a file transfer service managed by the library and the university
- Globus supports both browser-based uploads and file syncing through a client application
- Globus also integrates with Stanford Google Drive accounts, making it possible to deposit content from Google Drive without having to download it first
- File storage on the Preassembly server itself
- Access to this server is via the command-line or SFTP only
The choice of a staging location ultimately comes down to a number of factors:
- If you are a member of a department with access to a shared mount, you likely already have access to your department's staging location without needing to install or configure any additional software
- If you are not a member of a department with access to a shared mount, you must use Globus or the Preassembly server
- If you do not want to use SFTP or the command-line, you should use Globus
- If you have content on Google Drive, you should use Globus
- If no other option works best for you, you can still use the Preassembly server
If you are member of one of the following departments, you may be able to use a departmental file share as a Preassembly staging source:
- Archive of Recorded Sound
- Maps
- Music Library
- Special Collections and University Archives
Due to the nature of file shares, the exact name of the share on your workstation or laptop may vary. Under Windows, it may be mounted as a specific drive letter (like "Z:") rather than as a named location like "/ars" (for Archive of Recorded Sound and Music Library materials). Please review the list of staging locations on Consul to see if one of them applies to your situation, and contact the Repository Manager if you have questions.
Once you've identified a staging location to use with Preassembly, the final steps for staging your content is to copy your files to the staging location, making sure that you follow the guidelines above regarding manifests and folder organization. You can of course organize your files after copying them to the staging location. The important point is that the files must be organized and present on a staging location before you can run Preassembly.
Globus is a file transfer service that is available to anyone with an active Stanford SUNet ID. Similar to cloud file management services, Globus supports either uploading files using the browser or uploading files by installing a client application on your machine. Unlike commercial cloud services, the library's installation of Globus is managed by the library itself.
To use Globus with Preassembly, follow these steps:
- Go the Preassembly application and click on the button labeled "Request Globus Link".
- This will create a location in Globus where you can place your files.
- It will also open a new browser tab showing that location.
- Note: you may be prompted to allow popups from the Preassembly website in order to view the link
- Transfer files to the Globus location, making sure that you follow the guidelines above regarding manifests and folder organization
- Detailed descriptions of how to copy files into the Globus staging location will be documented on their own page
- Once your files have been staged in Globus, you are ready to move on to Preassembly
If neither a departmental share nor Globus will work for you, you have the option to connect directly to the Preassembly server. Doing so requires some familiarity with SFTP or command-line copying tools like rsync
, and requires special configuration. Instructions for connecting to Preassembly are available on Consul.
Once on the Preassembly server:
- Start at this path:
/dor/staging
- Create a folder for yourself if you haven't done so already.
- This is a shared location so it's better if you keep all your work within a folder associated with your name or SUNet ID.
- Example:
[my-name]-accessions
, a folder at the full path/dor/staging/[my-name]-accessions
- Using SFTP or rsync, copy your manifests and item folders onto that path
- Example: If I had a folder named "pamphlets-batch-1" that contained item folders and a manifest, then I would end up with a folder at `/dor/staging/[my-name]-accessions/pamphlets-batch-1" as my staging location
- Getting started
- Deposit workflow overview
- Content staging
- Using Globus to stage files
- Filling out the Preassembly web form
- Running the Discovery Report and Preassembly Jobs
- Updating existing items
- Accessioning complex content
- Accessioning images with captions
- Explanation of possible errors found by a discovery report
- What Happens After My Job Completes?
- My Job Seems to Be Taking A Really Long Time (like... days!)
- My files did not show up on the PURL as expected
- Using preassembly for self-deposited content