Skip to content

codefordharma/templeKB

 
 

Repository files navigation

templeKB

This package contains the corpus and the corpus creation and curation platform explained in the paper 'A Seed Corpus of Hindu Temples In India.' LREC2020. Please cite the paper if you are using this software.

Folder structure

. : Platform
corpus : temple corpus
data : Wikipedia pages and scrapped web pages
models : CQ and QA pretrained model files
output : preprocessing and other intermediate outputs

Requirements:

Python 3.7
Transformer model 'bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin' from https://huggingface.co/transformers/pretrained_models.html
BERT pretrained model 'wwm_uncased_L-24_H-1024_A-16'
SQuAD dataset

Paths in KGconfig.py :

wiki_corpus_path
web_scraped_temple_text_path
bert_path
bert_for_qa
squad_path

Create Corpus

Web Scrape

''' python Scrapper.py --url '''

Create corpus

''' python templeQA_1.py '''

Cite

@inproceedings{radhakrishnan-2020-seed, title = "A Seed Corpus of {H}indu Temples in {I}ndia", author = "Radhakrishnan, Priya", booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.32", pages = "254--258", abstract = "Temples are an integral part of culture and heritage of India and are centers of religious practice for practicing Hindus. A scientific study of temples can reveal valuable insights into Indian culture and heritage. However to the best of our knowledge, learning resources that aid such a study are either not publicly available or non-existent. In this endeavour we present our initial efforts to create a corpus of Hindu temples in India. In this paper, we present a simple, re-usable platform that creates temple corpus from web text on temples. Curation is improved using classifiers trained on textual data in Wikipedia articles on Hindu temples. The training data is verified by human volunteers. The temple corpus consists of 4933 high accuracy facts about 573 temples. We make the corpus and the platform freely available. We also test the re-usability of the platform by creating a corpus of museums in India. We believe the temple corpus will aid scientific study of temples and the platform will aid in construction of similar corpuses. We believe both these will significantly contribute in promoting research on culture and heritage of a region.", language = "English", ISBN = "979-10-95546-34-4", }

About

A Corpus of Hindu Temples in India

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%