This repository is a sandbox in which to prototype tools for cleanup, transformation, and validation of data curated by editors of the Digital Index of Middle English Verse (DIMEV). Files are for testing only: researchers interested in Middle English verse should consult dimev.net. Commentary is welcome.
The repository also hosts source files for an experimental new DIMEV website, built with Jekyll and hosted by GitHub Pages. All this is very much work in progress. An inspiration is Andrew Dunning's prototype for a digital edition of Richard Sharpe, A Handlist of Latin Writers of Great Britain and Ireland Before 1540.
artefacts/
Warnings, reports, and csv artefacts of the scripts inscripts/
. Transformed source data are written instead todocs/
for use by the Jekyll website builder.docs/
Source files and templates for a website. The contents ofdocs/_items/
are written byscripts/transform-Records.py
.schemas/
JSON schemas for validation of transformed source files.scripts/
Python scripts for review and transformation of the files in thedimev
repository. For details see comments at the head of each file. Scripts presume that thedimev
repository has been cloned to a directory sibling to this one.
The following is a summary of plans for DIMEV data. A fuller treatment is provided in the Technical Introduction.
Records.xml
will be atomized (one file per<record>
) to make effective use ofgit
distributed version control. Data will be parsed to identify irregularities, remediated (manually where necessary), and written to a new consistent structure. For instance, any field that may be an array must be an array (even if an array of one). After migration, subsequent updates to any file must validate against a schema. Early prototypes of data files are indocs_items
. An early prototype of the schema isschemas/records.json
. Cross references (i.e., those<record>
items without an@xml:id
) will be handled differently, tbd.Manuscripts.xml
andMSSIndex.xml
will be de-duplicated. Data will be atomized (one file per<item>
), parsed, remediated, and written to a new consistent structure. For an early partial prototype, see the output ofscripts/transform-Manuscripts.py
.Inscriptions.xml
andPrintedBooks.xml
will be handled similarly. After migration, subsequent updates to any file must validate against a schema.Bibliography.xml
. Data will be parsed and remediated (as above), written to a standard bibliographic data format and imported to Zotero for distribution and curation on that platform. For a prototype of this conversion, seeartefacts/bibliography.yaml
; the schema isschemas/csl-data.json
. To import tags we must target a format other than CSL JSON, per this discussion. Tags will be used to link bibliographic items to their objects, as in the Bodleian Library's bibliographical references for Western manuscripts. Links to on-line facsimiles of manuscripts will be handled differently, probably as a field within the data structure for manuscripts.Glossary.xml
tbd.