-
Notifications
You must be signed in to change notification settings - Fork 2
Content Staging
"Staging" content is the process of organizing files for deposit. Depending on the volume and complexity of your data, and the specifics of your workflow, the staging process could consist of:
- Copying folders and files to a shared mount using a file manager (such as the Mac's Finder or Windows' Explorer)
- Uploading folders and files to the Globus file transfer service
- Copying folders and files to a shared mount using SFTP or command-line tools
There is one prerequisite to staging your content: you must have digital repository identifiers for each item your are depositing or updating. If you do not yet have these identifiers, you must create them through a process called registration.
No matter how you transfer files into the SDR, when staging your files you should always follow the same organizational structure:
- Create a single folder to contain all items in your deposit
- Within that folder, create an individual folder for each item
- Each folder should contain the file or files to be deposited for that item
- Create a manifest file that lists all of the items and folders to be deposited
- This manifest file must be a CSV and must be named
manifest.csv
- This manifest file must be a CSV and must be named
- If necessary, create a second manifest file called a "file manifest" that lists all files in the batch of items being deposited
- This file must be a CSV and must be named
file_manifest.csv
- While not required for every Preassembly deposit, the file manifest is required when:
- Depositing media items
- Applying custom settings to specific files in a deposit
- Updating existing items
- This file must be a CSV and must be named
Instructions for creating the manifest file(s) are given below.
In this example:
Each of the folders with names like jh486mk1405
represent a single SDR item. In this case, each item will consist of a single image, the TIFF files that are inside each item's folder. Note that the manifest.csv
is placed "alongside" the item folders within the larger deposit folder.
[folder containing the entire batch]/
├── jh486mk1405
│ └── ErbWE.tif
├── manifest.csv
├── mw438sy2326
│ └── AkremiF.tif
├── sv928qy8859
│ └── BerrySmithJ.tif
├── vb063xr4527
│ └── RoachJM.tif
├── vz805mb3344
│ └── ShillinglawDT.tif
├── ww805gw8199
│ └── RobertsCE.tif
└── yg789dz9935
└── RobertsCET.tif
Some users have found the following advice helpful for automating the organization of files into folders using the Bash/Linux command line.
-
To prepare content from the command line on the mounted drive, create two files in the folder that contains the content folder, either by uploading them or by using a command-line text editor such as
nano
. The first file should be calleddruids.txt
and contain a list of the druids being prepared, one per line. The second should be calledfilenames.txt
and contain one druid-filename pair per line, separated by a tab.-
Both
druids.txt
andfilenames.txt
should have Unix (\n
) style line endings. If the file was created in a Windows program that uses\r\n
or a Mac program that uses\r
, use thedos2unix
command to convert the line endings to\n
:dos2unix /path/to/file.txt
-
To create the sub-folders:
while read druid; do mkdir content/"$druid"; done <druids.txt
-
To move the files from the content directory into the corresponding druid directories, substitute the file path to the folder containing the files in the following:
while read druid filename; do mv /{stagingMount}/{projectFolder}/{fileFolder}/"$filename" content/"$druid"; done <filenames.txt
-
A manifest.csv
file is required for all Preassembly deposits. This file has a dual purpose: it is an inventory or all items to be deposited in a single batch, and it is a device used to match druid identifiers to specific folders.
The manifest.csv
file always follows the same structure:
- the column headings are
druid
andobject
- the first column is a list of druids (without the "druid:" prefix
- the second column is the name of the folder corresponding to the item connected with the druid on the same row
A manifest.csv
for the batch of items in the staging example above would look like this:
druid,object
mw438sy2326,mw438sy2326
sv928qy8859,sv928qy8859
jh486mk1405,jh486mk1405
vb063xr4527,vb063xr4527
ww805gw8199,ww805gw8199
yg789dz9935,yg789dz9935
vz805mb3344,vz805mb3344
In this case, the folders are each named for a druid. Please note that the folder name does not necessarily have to be a druid, but it is often easier to process a Preassembly job when each folder is named for a druid, as that leaves no doubt as to which items correspond to which folders.
That said, there may be projects where folders are created for items prior to the creation of unique item identifiers (the druids). In that situation, it may be possible to leave the folder names alone rather than rename them en masse when the druids are created. This can be helpful when working with third-party vendors who may not know the druids for a set of items, for example.
The manifest.csv
for a batch of items where folder names follow a non-druid pattern could look like:
druid,object
{druid},Pamphlet 01
{druid},Pamphlet 02
In this example, the folders with the files in them are named "Pamphlet 01" and "Pamphlet 02". The manifest.csv tells Preassembly to look in these folders for the files matching the repository identifiers in the first column.
As noted above, the file_manifest.csv
is an additional manifest file that lists every file in a deposit, in contrast to the manifest.csv
which lists only the folders and items in a deposit. The manifest.csv
file is adequate for deposits that make use of Preassembly's default settings for processing simple items, such as images, books, PDF documents, and files. The manifest.csv
is not adequate for more complex processing where you need more granular control over how each file is treated. In those cases, a file_manifest.csv
must be included in the deposit, as this file manifest contains instructions for how each file should be processed.
Types of deposit where a file manifest is required:
- Media deposits (audio and/or video) or deposits of other complex types of context (such as disk images) (detailed instructions)
- Updates to existing objects where you are adding or modifying specific files (detailed instructions)
- Anything else requiring customizing metadata, such as providing captions for images (detailed instructions
Detailed instructions for how to prepare file manifests for these scenarios are provided on their own wiki pages, linked from the list above or in the documentation sidebar.
- Make sure each druid is listed only once
- Do not leave any lines blank
- Make sure to include the "druid,object" header
Some users have found the following advice helpful for automating the creation of the manifest.csv
using the Bash/Linux command line.
It is also possible to generate the manifest from a list of druids via the command line. With a file called druids.txt containing one druid per line in the current folder: sed 's/\(^.*$\)/\1,\1/' <druids.txt >>content/manifest.csv
This will generate a file called manifest.csv
within the content
folder, but without the druid,object
header, which needs to be added to the file manually.
Preassembly has been set up to "get" files from certain shared storage locations, known as "staging locations". Once files are placed on a staging location, Preassembly then accesses those files and copies them into the SDR. This makes it possible to process batches of items by placing the items in a single staging location rather than having to upload them individually through the browser.
Preassembly is integrated with multiple servers in the library, which means that you have a choice as to where to stage your content:
- Departmental shared mount
- Staff in certain departments have shared file storage mounts that are connected to Preassembly. These mounts are generally accessible from staff computers (personal or shared workstations) in those departments
- The full list of these storage mounts can be found on Consul at: content mount paths
- If you are a member of one of these departments and are unsure of how to use these mounts, please contact the Repository Manager
- Globus
- All staff can use Globus, which is a file transfer service managed by the library and the university
- Globus supports both browser-based uploads and file syncing through a client application
- Globus also integrates with Stanford Google Drive accounts, making it possible to deposit content from Google Drive without having to download it first
- File storage on the Preassembly server itself
- Access to this server is via the command-line or SFTP only
The choice of a staging location ultimately comes down to a number of factors:
- If you are a member of a department with access to a shared mount, you likely already have access to your department's staging location without needing to install or configure any additional software
- If you are not a member of a department with access to a shared mount, you must use Globus or the Preassembly server
- If you do not want to use SFTP or the command-line, you should use Globus
- If you have content on Google Drive, you should use Globus
- If no other option works best for you, you can still use the Preassembly server
- Getting started
- Deposit workflow overview
- Content staging
- Using Globus to stage files
- Filling out the Preassembly web form
- Running the Discovery Report and Preassembly Jobs
- Updating existing items
- Accessioning complex content
- Accessioning images with captions
- Explanation of possible errors found by a discovery report
- What Happens After My Job Completes?
- My Job Seems to Be Taking A Really Long Time (like... days!)
- My files did not show up on the PURL as expected
- Using preassembly for self-deposited content