Skip to content

Latest commit

 

History

History
182 lines (145 loc) · 8.64 KB

README.md

File metadata and controls

182 lines (145 loc) · 8.64 KB

neural-word-search

This repository contains the code necessary to reproduce the experiments from the paper

Neural Word Search in Historical Manuscript Collections,

Tomas Wilkinson, Jonas Lindström, Anders Brun

The paper addresses the problem of Segmentation-free Word Spotting in Historical Manuscripts, where a computer detects words on a collection of manuscript pages and allows a user to search within them.

We provide:

If you find this code useful in your research, please cite:

@article{wilkinson2018neural,
  title={Neural Word Search in Historical Manuscript Collections},
  author={Wilkinson, Tomas and Lindstr{\"o}m, Jonas and Brun, Anders},
  journal={arXiv preprint arXiv:1812.02771},
  year={2018}
}

Installation

The models are implemented in PyTorch using Python 2.7. To install Pytorch, it's easiest to follow the instructions on their website. To use this repository with the least amount of work, it's recommended to use the Anaconda distribution.

Once pytorch is installed you will require a few additional dependecies that are installed with the commands

pip install easydict 
conda install opencv

Trained models

You can download models trained on each dataset by following this link and downloading the zip file corresponding to a particular dataset.

Evaluating a model

To evaluate a model you can run

python test.py -weights models/model_name 

Or for a Ctrl-F-Mini model, run

python test_dtp.py -weights models/model_name 

There are a few flags relevant to testing in train_opts.py, such as evaluating with 4 folds for the washington benchmarks.

To evaluate the Botany or Konzilsprotokolle datasets run the provided evaluation toolkit (download instructions below)

python botany_konz_eval/hkws16_competition.py -weights models/model_name 

If you evaluate the Ctrl-F-Mini, use the dtp_only flag.

If you want to exactly reproduce the results from the paper, download the exact data used here and unzip it in data/reproduce/ and use the reproduce_paper flag

Training a model

To download the IAM datasets, go to this website and register. Download and unpack the data by running these commands, but with your own username and passwords

mkdir -p data/iam
cd data/iam
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/ascii/words.txt
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/forms/formsA-D.tgz
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/forms/formsE-H.tgz
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/forms/formsI-Z.tgz
wget http://www.fki.inf.unibe.ch/DBs/iamDB/tasks/largeWriterIndependentTextLineRecognitionTask.zip
mkdir forms
tar -C forms -xzf formsA-D.tgz
tar -C forms -xzf formsE-H.tgz
tar -C forms -xzf formsI-Z.tgz
unzip largeWriterIndependentTextLineRecognitionTask.zip
cd ../../

Similarly, for the Washington dataset run

mkdir -p data/washington/
cd data/washington
wget http://ciir.cs.umass.edu/downloads/gw/gw_20p_wannot.tgz
tar -xzf gw_20p_wannot.tgz
cd ../../

For the botany or konzilsprotokolle datasets run

mkdir -p data/botany/
cd data/botany
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_I_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_I_WordImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_I_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_II_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_II_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_III_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Train_III_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Test_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Test_WordImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Test_QryImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Test_QryStrings.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Botany_Test_GT.zip
cd ../../

or

mkdir -p data/konzilsprotokolle/
cd data/konzilsprotokolle
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_I_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_I_WordImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_I_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_II_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_II_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_III_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Train_III_XML.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Test_PageImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Test_WordImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Test_QryImages.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Test_QryStrings.zip
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/data/Konzilsprotokolle_Test_GT.zip
cd ../../
cd botany_konz_eval/

Also download and unzip the evaluation software

cd botany_konz_eval/
wget --no-check-certificate https://www.prhlt.upv.es/contests/icfhr2016-kws/software/icfhr16kws_evaluation_toolkit.zip
unzip icfhr16kws_evaluation_toolkit.zip
cd ..

Next, you can either pre-augment the datasets and save them to H5 data files, which is quicker if training multiple times on a dataset, or you can skip this step and go directly to training while doing augmentation on the fly, which will be slower if training multiple models (unless you have lots of cpu cores). As of now, training ctrlfnet-mini using on the fly augmentation is not recommended, as it's quite slow to extract dtp proposals every training iteration.

Run

python preprocess_h5.py -dataset washington -augment 1 -suffix augmented

You are now ready to train a model from scratch. However, we pre-trained models on the IIIT-HWS 10K dataset that we used for initialization. You can download models pretrained on the IIIT-HWS dataset here and unzip them in the directory models, but you may also train a model yourself by running

mkdir -p data/iiit_hws
cd data/iiit_hws
wget http://ocr.iiit.ac.in/data/dataset/iiit-hws/iiit-hws.tar.gz
wget http://ocr.iiit.ac.in/data/dataset/iiit-hws/groundtruth.tar.gz
tar -xzf groundtruth.tar.gz
tar -xzf iiit_hws.tar.gz
cd ../../
python train.py -embedding dct -dataset iiit_hws 

Since this dataset only consists of segmented word images, we can only do full-page augmentation with it. As such we right now only support on the fly augmentation.

To train a model run

mkdir -p checkpoints/ctrlfnet
mkdir -p checkpoints/ctrlfnet_mini
python train.py -dataset iam -save_id test -weights models/ctrlfnet_dct_iiit_hws.pt

To train a model run with a preprocessed h5 dataset add the h5 flag

python train.py -dataset iam -save_id test -h5 1
python train.py -dataset washington -save_id test -h5 1

To train a Ctrl-F-Mini model make sure to use the train_dtp.py file, for example:

python train_dtp.py -dataset washington -save_id test -h5 1 -weights models/ctrlfnet_mini_dct_iiit_hws.pt