neural-word-search

This repository contains the code necessary to reproduce the experiments from the paper

Neural Word Search in Historical Manuscript Collections,

Tomas Wilkinson, Jonas Lindström, Anders Brun

The paper addresses the problem of Segmentation-free Word Spotting in Historical Manuscripts, where a computer detects words on a collection of manuscript pages and allows a user to search within them.

We provide:

Trained models
Instructions for training a model and evaluating on the IAM and Washington datasets

If you find this code useful in your research, please cite:

@article{wilkinson2018neural,
  title={Neural Word Search in Historical Manuscript Collections},
  author={Wilkinson, Tomas and Lindstr{\"o}m, Jonas and Brun, Anders},
  journal={arXiv preprint arXiv:1812.02771},
  year={2018}
}

Installation

The models are implemented in PyTorch using Python 2.7. To install Pytorch, it's easiest to follow the instructions on their website. To use this repository with the least amount of work, it's recommended to use the Anaconda distribution.

Once pytorch is installed you will require a few additional dependecies that are installed with the commands

pip install easydict 
conda install opencv

Trained models

You can download models trained on each dataset by following this link and downloading the zip file corresponding to a particular dataset.

Evaluating a model

To evaluate a model you can run

python test.py -weights models/model_name

Or for a Ctrl-F-Mini model, run

python test_dtp.py -weights models/model_name

There are a few flags relevant to testing in train_opts.py, such as evaluating with 4 folds for the washington benchmarks.

Training a model

To download the IAM datasets, go to this website and register. Download and unpack the data by running these commands, but with your own username and passwords

mkdir -p data/iam
cd data/iam
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/ascii/words.txt
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/forms/formsA-D.tgz
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/forms/formsE-H.tgz
wget --user user --password pass http://www.fki.inf.unibe.ch/DBs/iamDB/data/forms/formsI-Z.tgz
wget http://www.fki.inf.unibe.ch/DBs/iamDB/tasks/largeWriterIndependentTextLineRecognitionTask.zip
mkdir forms
tar -C forms -xzf formsA-D.tgz
tar -C forms -xzf formsE-H.tgz
tar -C forms -xzf formsI-Z.tgz
unzip largeWriterIndependentTextLineRecognitionTask.zip
cd ../../

Similarly, for the Washington dataset run

mkdir -p data/washington/
cd data/washington
wget http://ciir.cs.umass.edu/downloads/gw/gw_20p_wannot.tgz
tar -xzf gw_20p_wannot.tgz
cd ../../

Next, you can either pre-augment the datasets and save them to H5 data files, which is quicker if training multiple times on a dataset, or you can skip this step and go directly to training while doing augmentation on the fly, which will be slower if training multiple models (unless you have lots of cpu cores). As of now, training ctrlfnet-mini using on the fly augmentation is not recommended, as it's quite slow to extract dtp proposals every training iteration.

Run

python preprocess_h5.py -dataset washington -augment 1
python preprocess_h5.py -dataset iam -augment 1

You are now ready to train a model from scratch. However, we pre-trained models on the IIIT-HWS 10K dataset that we used for initialization. You can download models pretrained on the IIIT-HWS dataset here and unzip them in the directory models, but you may also train a model yourself by running

mkdir -p data/iiit_hws
cd data/iiit_hws
wget http://ocr.iiit.ac.in/data/dataset/iiit-hws/iiit-hws.tar.gz
wget http://ocr.iiit.ac.in/data/dataset/iiit-hws/groundtruth.tar.gz
tar -xzf groundtruth.tar.gz
tar -xzf iiit_hws.tar.gz
cd ../../
python train.py -embedding dct -dataset iiit_hws

Since this dataset only consists of segmented word images, we can only do full-page augmentation with it. As such we right now only support on the fly augmentation.

To train a model run

mkdir -p checkpoints/ctrlfnet
mkdir -p checkpoints/ctrlfnet_mini
python train.py -dataset iam -save_id test -weights models/ctrlfnet_dct_iiit_hws.pt

To train a model run with a preprocessed h5 dataset add the h5 flag

python train.py -dataset iam -save_id test -h5 1
python train.py -dataset washington -save_id test -h5 1

To train a Ctrl-F-Mini model make sure to use the train_dtp.py file, for example:

python train_dtp.py -dataset washington -save_id test -h5 1 -weights models/ctrlfnet_mini_dct_iiit_hws.pt

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
misc		misc
train_log		train_log
.gitignore		.gitignore
README.md		README.md
ctrlfnet_model.py		ctrlfnet_model.py
ctrlfnet_model_dtp.py		ctrlfnet_model_dtp.py
evaluate.py		evaluate.py
evaluate_dtp.py		evaluate_dtp.py
preprocess_h5.py		preprocess_h5.py
recall.py		recall.py
recall_utils.py		recall_utils.py
test.py		test.py
test_dtp.py		test_dtp.py
train.py		train.py
train_dtp.py		train_dtp.py
train_opts.py		train_opts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

neural-word-search

Installation

Trained models

Evaluating a model

Training a model

About

Releases

Packages

Languages

jowo1991/neural-word-search

Folders and files

Latest commit

History

Repository files navigation

neural-word-search

Installation

Trained models

Evaluating a model

Training a model

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages