python script to turn transcript into a complete TEI document - work in progress
Two sample files are provided:
- transcript of Syriac translation of Cyprian's letter to Quintus from ms. Paris, Bibliothèque nationale de France, syr. 62
- output.xml
Easy instrument for scholars to make a transcription of a manuscript and turn it to a ready xml/TEI document. The idea is loosely based on https://github.com/PatristicTextArchive/transcription_dsl , see also: dx.doi.org/10.1515/zac-2020-0019 (thanks!).
- combination with Collatex (https://wiki.tei-c.org/index.php/CollateX)
- reading in key metadata
- converting line breaks, page breaks, punctuation (needs refinement), rubrics, paragraphs, marginal notes
The idea is to type/transcribe a manuscript in, i.e., Microsoft Word or LibreOffice. This will likely provide better support for special fonts and a more readable layout (individual line spacing etc.). At the moment, the script is designed with a focus on Syriac, specifically regarding punctuation. A more generic approach will be implemented in the future.
For all explanations below, cf. the sample file (Cyprian-to-Quintus_transcript-from-ms-Paris-syr-62.txt), if necessary.
At the beginning of the document, the following metadata must be provided, each in a single line:
- Author, wrapped in
<a></a>
- Title, wrapped in
<t></t>
- Siglum, wrapped in
<s></s>
-- In the sample file, the siglum chosen is rather verbose. It's recommended not to use spaces, as the siglum will be used as an xml id. - Transcriber, wrapped in
<tr></tr>
-- Provide full name of person who transcribed the manuscript, i.e. "Mary S. Black". - Initials of transcriber, wrapped in
<tri></tri>
-- Provide initials of that person, i.e. "msb". This will also be turned into an xml id.
The transcription follows in a new line right after the metadata. Currently, the following features can be transcribed:
- Rubrics, wrapped in
<r></r>
- Paging, transcribed as, e.g.
|fol. 34v|
-- For the vertical bar (|), also known as pipe, see https://www.computerhope.com/jargon/p/pipe.htm#pipe. The expressionfol.
must be followed by a space. - Linebreaks, transcribed simply as a new line. -- This helps in proof reading against the manuscript.
- Marginal notes, wrapped in
<mx></m>
, wherex
is eithert
,r
,l
,b
, ora
, which stand for top, right, left, bottom or above, respectively. Note thatx
is not repeated in the closing tag. - End of paragraph, transcribed as
//
-- This will be converted to</p><p>
. Be aware to properly nest elements (see https://www.w3schools.com/xml/xml_syntax.asp under "XML Elements Must be Properly Nested"). For example, in the sample .txt-file provided, the last//
must follow the closing rubric tag</r>
. Otherwise, the rubric would begin within a paragraph but not end there.