A method to evaluate text representing a natural history bio-collection object. Scores calculated in this process can help provide a rough evaluation of the quality of the text, whether it be generated by human transcription or OCR. Initially, this will focus on labels and annotations applied to herbarium collections. This process will check the text for words including taxonomic names, collector names, annotator names, location names, common abbreviations and other expected words (all stored in purpose-built dictionaries) and patterns such as dates, numbers, and geocoordinates. The text is given a score based on the number of the matches found.
The included dictionaries have been developed through testing Darwin Score on a set of data from OCRed and transcribed herbarium labels. These dictionaries are sure to be incomplete and you will have to modify them to include words that you want to match when running Darwin Score. I'll be refining and adding some of the scripts I've written that help identify new words for inclusion in the dictionaries.
Darwin Score currently has only two very simple regex patterns. One is to match dates and the other is to match numbers. These need improvement and other patterns need to be added to improve the accuracy of Darwin Score. In particular, there are many date formats that won't match the included pattern and no geolocation patterns have been added yet. Eventually I hope to build a robust pattern repository at https://github.com/jbest/regex-repo and use some or all of the patterns submitted there.
python darwinscore.py
To use with your own input files, create an input file with the paths of the files you wish to score then provide the path of that file in the darwinscore.py script. Yes, this is kludgy. Yes, this will change.
Please let me know your thoughts about how you would like to use this to score your own data, and how the scores should be output or stored.
I'll be adding some of the tools I used to generate and maintain the dictionaries. See the TODO list for details.