Improve the readability of the Peydurma going somethhing along the line of: "It is great, but it is not friendly for a regular reader. Plus, a lot of the notes are minor spelling mistakes/divergences that can very easily dealt with, thus greatly reducing the noise and improving readability."
Text (docx) + notes (xlsx) file pairs for every text in the Nalanda corpus.
The regular user-friendly version of each text in two formats (docx and layered text)
-
input all the Peydurma notes for the whole collection in xlsx files
-
get a clean Derge version of each text in the collection
-
mark it with the note number so as to have a "copy" of the Peydurma files
-
reinsert the marks in the text
-
check for evident spelling mistakes and select the right spelling (noise reduction)
-
apply heuristics to discard un-necessary notes, ideally keeping only the notes that affect the understanding of a given passage. (noise reduction)
-
go from syllable-based notes to word-based notes to improve readability (and modify sufficiently Peydurma to not have copyright problems. leverages pytib)
-
improve the note format to increase readability
-
provide a layered version of the final document for an online platform where users/scholars will be able to give feedback/provide choices between notes
Where: Files are a bit scattered all over the place here and here.
Esukhia team in Dharamsala has produced the following for every text:
-
text.docx contains:
-
raw text
-
page numbers in Derge edition
-
note information from Peydurma
-
text.xlsx contains:
-
note mark corresponding to what is found in text.docx
-
note content: what every edition says
-
notes added by the Esukhia staff (to be detected and ignored)
-
formatting (color/background/size that has simply been ignored)
Issues:
-
notes number discrepancies: sizes of docx and xlsx don't correspond, ...
-
strategy: see where the scripts fail and manually detect and correct files
-
the notes added by Esukhia staff has to be identified and ignored
-
strategy implemented by Rabten: put all of them on the F row since there can't possibly be any note that far)
-
files were produced alongside the work was done, so the naming conventions are not consistent and my attempt to create a unified format met opposition with Tibetans.
As shown in this Drawing, and lines 1 to 10 of this table, the docx and xlsx have to be converted to txt and csv files using this script. The output files are then copied in canon-notes/1-a-reinsert_notes/input.
Issues:
- csv_contains_pardrang.py:
detects and deletes an additional column preventing the correct execution of reinsertion.py
- reinsertion.py:
loops over every pair of txt+csv files and attempts to reconstruct the passage for every of the existing editions for a given note.
The aim is ultimately to reconstruct digital versions of all the editions from the txt file using the notes.
output/comparison_xls/ was the first attempt of reinsertion. It has better formatting than the yaml counterparts in conc_yaml/, but is not used in the workflow.
output/unified_structure/ contains a version of every text segmented in syllables.
Every syllable subjected to a note is a yaml dict with the editions as keys, a list of syllables as values. It is used as the single-source-of-truth for all the workflow from that point onwards.
The rationale is to automatise as much as possible the process of categorising notes and enriching them with the relevant information so that human reviewers of each note will have the maximum information at hand, without having to resort to external sources such as dictionaries, verb lexicons, etc.
This implies the categorisation is going to be modified/enhanced by humans, before moving to the next step, but this has never happened.
First, segment.py is ran, to produce a segmented version of the different editions, which will then individually checked for spelling mistakes, etc. (the whole categorization script)
I hoped that for each file, the segmentation is reviewed by a human to reduce to the maximum the bad segmentations and thus reduce false positives and inaccuracies in the categorisation (based of words rather than syllables). I have processed a few files like that, but it implies a lot of manual work.
categorisation.py includes many different things, out of which the figures of n-gram frequencies in the Kangyur for all segmenting mistakes detected by pytib.
The rationale is explained here.
parse_json.py creates the "revision interface" that presents the alternatives of a given notes with left and right context (word-based, not syllable based as in the docx and xlsx or Peydurma). It also includes some information gathered during the automatic categorisation and provides a column where the reviewer will put his final decision for a given note.
Notes in a text are grouped by type so the reviewing process is more effective and comparing similar notes or adjusting decisions is easier.
parse_json.py also produced an updated version of the "unified_structure" from 1-a-reinsert_notes/output/unified_structure
An example is here or any other tab of the same document (I manually processed them). The choices are encoded in the letters of DUCK in the "new" column.
An unimplemented attempt to automatize the DUCKing process based on heuristics is described here. The most difficult choices would then be left for scholars, while the easiest ones could be applied automatically, thus achieving the aim of reducing the noise of unnecessary notes.
create_ducked.py takes as input the output of parse_json in output/antconc_format/, applies a K(eep note) default choice to mimic what I have manually done in the Google Spreadsheet and allow to continue the workflow.
0-1_apply_note_choices.py: applies the note choices in the DUCKed file onto the updated_structure to have the final edition.
0-2_rawify.py: create a version of the text without any formatting, segmented on shad and spaces (for later comparison)
1-1_unmark.py: creates a version of the text without note marks so as not to hinder segmentation and spell check in subsequent step.
1-2_segment_final.py: use pytib to find any newly introduced spelling errors/inaccuracies, and propose pre-formatted notes in case we want to correct the spelling.
1-3_copy_post_seg.py: glueware to ensure manually modified files do not overwrite the ones generated automatically.
2-1_reinsert_a.py: get the line breaks("a"s) from the Derge Kangyur (eKangyur) and apply them on our final edition.
3-1_reinsert_page_nums.py: take the data from resources/དཀར་ཆག་ཀུན་གསལ་མེ་ལོང་། - format example.csv (contains the pagination information compiled from various sources by NT)
and infers the exact page start location, number and side in the Derge Edition (wanted by the Geshe at Namgyal Tratsang). (this implies the titles of each work corresponds to the name of our files, so implies some manual work)
3-3_copy_final.py: glueware that puts the result of the whole workflow in output/3-3-final/ (any file here, for ex)
files here provide statistics oriented towards estimating the risk of having copyright problems (amount of similar notes) and a visual representation of text where syllables are converted to dots and the modified syllables bear the letter of the DUCKing decision that was made.
contains a .sh script to convert the final files into docx files with page + side marks and the notes reformatted and DUCKed.
Any question or remark is welcome!
the latest branch of canon-notes includes all the data for the whole Nalanda corpus, while the nagarjuna branch focuses on this specific author. It was meant as a work basis with Rabten, but the Indian internet connection forced me to create a "code" branch, that is stripped of all data files.