This repository contains the code necessary to reproduce the experiments from the paper
Neural Word Search in Historical Manuscript Collections,
Tomas Wilkinson, Jonas Lindström, Anders Brun
The paper addresses the problem of Segmentation-free Word Spotting in Historical Manuscripts, where a computer detects words on a collection of manuscript pages and allows a user to search within them.
We provide:
- Trained models
- Instructions for training a model and evaluating on the IAM, Washington, Botany, and Konzilsprotokolle datasets
If you find this code useful in your research, please cite:
@article{wilkinson2018neural,
title={Neural Word Search in Historical Manuscript Collections},
author={Wilkinson, Tomas and Lindstr{\"o}m, Jonas and Brun, Anders},
journal={arXiv preprint arXiv:1812.02771},
year={2018}
}
The models are implemented in PyTorch using Python 2.7. To install Pytorch, it's easiest to follow the instructions on their website. To use this repository with the least amount of work, it's recommended to use the Anaconda distribution.
Once pytorch is installed you will require a few additional dependecies that are installed with the commands
pip install easydict
conda install opencv
You can download models trained on each dataset by following this link and downloading the zip file corresponding to a particular dataset.
To evaluate a model you can run
python test.py -weights models/model_name
Or for a Ctrl-F-Mini model, run
python test_dtp.py -weights models/model_name
There are a few flags relevant to testing in train_opts.py, such as evaluating with 4 folds for the washington benchmarks.
To evaluate the Botany or Konzilsprotokolle datasets run the provided evaluation toolkit (download instructions below)
python botany_konz_eval/hkws16_competition.py -weights models/model_name
If you evaluate the Ctrl-F-Mini, use the dtp_only flag.
If you want to exactly reproduce the results from the paper, download the exact data used here and unzip it in data/reproduce/ and use the reproduce_paper flag
To download the IAM datasets, go to this website and register. Download and unpack the data by running these commands, but with your own username and passwords
mkdir -p data/iam
cd data/iam
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/ascii/words.txt
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/forms/formsA-D.tgz
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/forms/formsE-H.tgz
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/forms/formsI-Z.tgz
wget http://www.fki.inf.unibe.ch/DBs/iamDB/tasks/largeWriterIndependentTextLineRecognitionTask.zip
mkdir forms
tar -C forms -xzf formsA-D.tgz
tar -C forms -xzf formsE-H.tgz
tar -C forms -xzf formsI-Z.tgz
unzip largeWriterIndependentTextLineRecognitionTask.zip
cd ../../
Similarly, for the Washington dataset run
mkdir -p data/washington/
cd data/washington
wget http://ciir.cs.umass.edu/downloads/gw/gw_20p_wannot.tgz
tar -xzf gw_20p_wannot.tgz
cd ../../
For the botany or konzilsprotokolle datasets run
mkdir -p data/botany/
cd data/botany
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_I_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_I_WordImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_I_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_II_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_II_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_III_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_III_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Test_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Test_WordImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Test_QryImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Test_QryStrings.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Test_GT.zip
cd ../../
or
mkdir -p data/konzilsprotokolle/
cd data/konzilsprotokolle
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_I_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_I_WordImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_I_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_II_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_II_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_III_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_III_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Test_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Test_WordImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Test_QryImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Test_QryStrings.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Test_GT.zip
cd ../../
cd botany_konz_eval/
Also download and unzip the evaluation software
cd botany_konz_eval/
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/software/icfhr16kws_evaluation_toolkit.zip
unzip icfhr16kws_evaluation_toolkit.zip
cd ..
Next, you can either pre-augment the datasets and save them to H5 data files, which is quicker if training multiple times on a dataset, or you can skip this step and go directly to training while doing augmentation on the fly, which will be slower if training multiple models (unless you have lots of cpu cores). As of now, training ctrlfnet-mini using on the fly augmentation is not recommended, as it's quite slow to extract dtp proposals every training iteration.
Run
python preprocess_h5.py -dataset washington -augment 1 -suffix augmented
You are now ready to train a model from scratch. However, we pre-trained models on the IIIT-HWS 10K dataset that we used for initialization. You can download models pretrained on the IIIT-HWS dataset here and unzip them in the directory models, but you may also train a model yourself by running
mkdir -p data/iiit_hws
cd data/iiit_hws
wget http://ocr.iiit.ac.in/data/dataset/iiit-hws/iiit-hws.tar.gz
wget http://ocr.iiit.ac.in/data/dataset/iiit-hws/groundtruth.tar.gz
tar -xzf groundtruth.tar.gz
tar -xzf iiit_hws.tar.gz
cd ../../
python train.py -embedding dct -dataset iiit_hws
Since this dataset only consists of segmented word images, we can only do full-page augmentation with it. As such we right now only support on the fly augmentation.
To train a model run
mkdir -p checkpoints/ctrlfnet
mkdir -p checkpoints/ctrlfnet_mini
python train.py -dataset iam -save_id test -weights models/ctrlfnet_dct_iiit_hws.pt
To train a model run with a preprocessed h5 dataset add the h5 flag
python train.py -dataset iam -save_id test -h5 1
python train.py -dataset washington -save_id test -h5 1
To train a Ctrl-F-Mini model make sure to use the train_dtp.py file, for example:
python train_dtp.py -dataset washington -save_id test -h5 1 -weights models/ctrlfnet_mini_dct_iiit_hws.pt