TIES was my undergraduate thesis, Table Information Extraction System. I picked the name from there and made it 2.0 from there.
This is a repository containing source code for the arxiv paper 1905.13391 (link). This paper has been accepted into ICDAR 2019. To cite the paper, use:
@article{rethinkingGraphs,
author = {Qasim, Shah Rukh and Mahmood, Hassan and Shafait, Faisal},
title = {Rethinking Table Parsing using Graph Neural Networks},
journal = {Accepted into ICDAR 2019},
volume = {abs/1905.13391},
year = {2019},
url = {https://arxiv.org/abs/1905.13391},
archivePrefix = {arXiv},
eprint = {1905.13391},
}
We are still working to improve a few technical details for your convenience. We'll remove this note once we are done. Expect them to be done by June 15, 2019. We are also working to improve dataset format for easier understanding.
Partial dataset which was used for test can be found here. We are uploading rest of the dataset. The current format of the dataset is tfrecords
.
In the meantime, if you want to generate the dataset, head on to the following repository:
github.com/hassan-mahmood/Structural_Analysis
The project is divided into language parts, python
and cpp
, for python and C++ respectively. There is nothing in the
cpp
folder as of now.
The python
dir is supposed to be the path where a script is to be run, or alternatively, it could be added to the
$PYTHONPATH
environmental variable. It would contain further directories:
bin
contain the scripts which are to be run from the terminal. Within bin, there would be multiple folders, short for different classes of executable programs.iterate
for running training or inference.analyse
for analysing inference output.checks
this was for testing various files while development. You can safely ignore it.
iterators
provides functionality to iterate through the datasets while you are training or testing.layers
contains basic layers for graph networksmodels
contains the main model and network segments. Most of the functionality can be found inbasic_model.py
. Start to trace from there.ops
contains basic modified operations. These contains the advanced graph operations code.readers
is for readers, entities responsible for reading the data fromtfrecords
. Their format can be changed in this file.libs
contains all other helper and library functions.
Within the context of this repository, iterate
refers to any of train, test or anything which is done iteratively. You
can say anything that is done iteratively mostly on the GPU. So if there is an iterator
somewhere, it probably refers
to an entity which handles training, testing etc.
- Prepare the dataset. For this, you are required to divide the dataset into three different sections, test, train and validation. Test set will be used to run the analysis after training is done. Backpropagation will be run on the train set. Validation set is used to produce plots for tensorboard to monitor performance of the network.
- The dataset files have to be in
tfrecords
format. Make a new file calledtrain_files.txt
. It should contain full paths of all the training tfrecords files. For example:/home/shahrukhqasim/dataset/train_1.tfrecord /home/shahrukhqasim/dataset/train_2.tfrecord /home/shahrukhqasim/dataset/train_3.tfrecord
- Similarly, prepare
validation_files.txt
,test_files.txt
. The contents of these three files should not be overlapping. - Make a config file according to the format given in
configs/config.ini.example
. This file determines all the settings, dataset locations and results generation paths. The example config file contains documentation for your ease. If you are unclear about a setting, send an email to me or generate an issue in this repository. - Each config file will contain multiple configurations. These configurations are recommended to be used for different models.
So, for instance, you make different configs for
DGCNN
,GravNet
andConvolutional
networks.
To run the training, you need to issue the following command:
$ python bin/iterate/table_adjacency_parsing path/to/the/config/file config
While you are running the training, you can monitor using tensorboard. The paths are to be set into the config file as described in the previous step. Use the following command to run the tensorboard:
$ tensorboard --logdir=/media/all/shahrukhqasim/Tables/TrainOut/betaout/summary
You can monitor the performance after that in your browser. The port number will be displayed when you run the above command.
You first need to run inference which will generate bin
files in numpy pickle format.
$ python bin/iterate/table_adjacency_parsing path/to/the/config/file config --test True
TODO: Analaysis code and further documentation is coming.
Python 3.5+ is needed. We recommend using virtualenv but anaconda should also work fine.
The required packages are listed in requirements.txt
. They can be installed by:
$ pip install -r requirements.txt
In addition to this, you need to download another repository from here:
github.com/jkiesele/caloGraphNN
Let's say you clone it into /home/shahrukhqasim/caloGraphNN
. You need to add this path to the $PYTHONPATH
environmental variable.
$ export PYTHONPATH=$PYTHONPATH:/home/shahrukhqasim/caloGraphNN
In addition to this, you should run all the commands from inside of python
directory. And python
should also be present in $PYTHONPATH
environmental variable.
$ export PYTHONPATH=$PYTHONPATH:/home/shahrukhqasim/TIES-2.0/python
You can also add .
to the $PYTHONPATH
if you know you will always run the commands from inside of python
directory.
It is advised you make a sh
file with these export commands and a command which activates the virtual environment.
I use the following sourcing file (ties.sh
):
source ~/Envs/h3/bin/activate
cd /Users/shahrukhqasim/Workspace/TIES-2.0/python
export PYTHONPATH=$PYTHONPATH:/Users/shahrukhqasim/Workspace/caloGraphNN:/Users/shahrukhqasim/Workspace/TIES-2.0
I source it every time I want to run training or inference using:
$ source ties.sh
- Training data uploaded
- Trained models