Skip to content

Latest commit

 

History

History
38 lines (30 loc) · 2.9 KB

README.md

File metadata and controls

38 lines (30 loc) · 2.9 KB

convert_transcript_to_tei

python script to turn transcript into a complete TEI document - work in progress

Two sample files are provided:

  • transcript of Syriac translation of Cyprian's letter to Quintus from ms. Paris, Bibliothèque nationale de France, syr. 62
  • output.xml

Basic idea

Easy instrument for scholars to make a transcription of a manuscript and turn it to a ready xml/TEI document. The idea is loosely based on https://github.com/PatristicTextArchive/transcription_dsl , see also: dx.doi.org/10.1515/zac-2020-0019 (thanks!).

Planned

Current features

  • reading in key metadata
  • converting line breaks, page breaks, punctuation (needs refinement), rubrics, paragraphs, marginal notes

Basic Documentation

The idea is to type/transcribe a manuscript in, i.e., Microsoft Word or LibreOffice. This will likely provide better support for special fonts and a more readable layout (individual line spacing etc.). At the moment, the script is designed with a focus on Syriac, specifically regarding punctuation. A more generic approach will be implemented in the future.

For all explanations below, cf. the sample file (Cyprian-to-Quintus_transcript-from-ms-Paris-syr-62.txt), if necessary.

(1) Metadata

At the beginning of the document, the following metadata must be provided, each in a single line:

  • Author, wrapped in <a></a>
  • Title, wrapped in <t></t>
  • Siglum, wrapped in <s></s> -- In the sample file, the siglum chosen is rather verbose. It's recommended not to use spaces, as the siglum will be used as an xml id.
  • Transcriber, wrapped in <tr></tr> -- Provide full name of person who transcribed the manuscript, i.e. "Mary S. Black".
  • Initials of transcriber, wrapped in <tri></tri> -- Provide initials of that person, i.e. "msb". This will also be turned into an xml id.

(2) Transcription

The transcription follows in a new line right after the metadata. Currently, the following features can be transcribed:

  • Rubrics, wrapped in <r></r>
  • Paging, transcribed as, e.g. |fol. 34v| -- For the vertical bar (|), also known as pipe, see https://www.computerhope.com/jargon/p/pipe.htm#pipe. The expression fol. must be followed by a space.
  • Linebreaks, transcribed simply as a new line. -- This helps in proof reading against the manuscript.
  • Marginal notes, wrapped in <mx></m>, where x is either t, r, l, b, or a, which stand for top, right, left, bottom or above, respectively. Note that x is not repeated in the closing tag.
  • End of paragraph, transcribed as // -- This will be converted to </p><p>. Be aware to properly nest elements (see https://www.w3schools.com/xml/xml_syntax.asp under "XML Elements Must be Properly Nested"). For example, in the sample .txt-file provided, the last // must follow the closing rubric tag </r>. Otherwise, the rubric would begin within a paragraph but not end there.