GitHub - svjack/LC-QuAD-augmentation-toolkit: A augmentation toolkit with the help of DeepPavlov's wikidata tools

LC-QuAD-augmentation-toolkit

A augmentation toolkit with the help of DeepPavlov's wikidata tools

Brief introduction

LC-QuAD 2.0 is a Large Question Answering dataset with 30,000 pairs of question and its corresponding SPARQL query. whose target knowledge base is Wikidata and DBpedia.
When construct a Knowledge Base Query system. The core of natural language to SPARQL query, may construct with the help of this dataset.If you are interested in this topic, i recommend you to read a brief introduction about How to construct a Knowledge Base Question Answering system with the help of DeepPavlov in your language domain --- non English condition.

This project is target on take one record in LC-QuAD dataset as input.(i.e. one English query and its corresponding SPARQL query as input pair), And the project will give you some similar self-construct sentence-sparql_query pairs as output in English.

Use these massive pairs and a translation toolkit, you can construct a KBQA system with the help of any sentence to query based Knowledge Base Engine. (as DeepPavlov's module do)

Installation

Refer to INSTALL.sh to install the environment, make sure that you can run the KBQA of the original DeepPavlov project. The wikidata Knowledge Base hdt file can get from me or the rdfhdt

Below files should be located in the root path of this project after clone from repository.

lcquad_2_0.json
test.json
train.json

lcquad_in_deeppavlov_template_abstract.pkl
pid_tuple_on_s_dict.pkl
property_info_df.pkl

pid_tuple_on_s_dict.db

And below three files are too big to upload.

You can use below link to get them from Baidu Yun Drive. And placed them in the project root path. https://pan.baidu.com/s/1e66Lt6nisM3583dbIGsO5w?pwd=ntwz
Remember use cat to merge wikidata.hdt.aa wikidata.hdt.ab wikidata.hdt.ac into wikidata.hdt before use it

multi_lang_kb_dict.db
kbqa-explore/wikidata.hdt
kbqa-explore/linker_entities.pkl

Toolkit Usage

After environment installed, you can take a look at the snippet located in single_step.py.
It takes en_sent and sparql_query as input parameters and give a output in the format of pandas dataframe. Let's look at some examples that only sample 5 outputs (the total populations may from 3 to 1000).

Example 1:

en_sent = "What is ChemSpider ID of tungsten carbide ?"
sparql_query = "select distinct ?answer where { wd:Q423265 wdt:P661 ?answer}"

np.random.seed(0)
df = aug_one_query(en_sent, sparql_query, aug_times=1000)
df = df.sample(n = 5).sort_values(by = "fuzz", ascending = False)
df.apply(lambda x: x.to_dict(), axis = 1).values.tolist()

This will output:

[{'aug_en_sent': 'What is ChemSpider ID of tungsten trioxide ?',
  'aug_sparql_query': 'select distinct ?answer where { wd:Q417406 wdt:P661 ?answer}',
  'fuzz': 91.95402298850574},
 {'aug_en_sent': 'What is ChemSpider ID of hafnium(IV) carbide ?',
  'aug_sparql_query': 'select distinct ?answer where { wd:Q418001 wdt:P661 ?answer}',
  'fuzz': 80.89887640449437},
 {'aug_en_sent': 'What is ChemSpider ID of Carbonization ?',
  'aug_sparql_query': 'select distinct ?answer where { wd:Q2630655 wdt:P661 ?answer}',
  'fuzz': 74.69879518072288},
 {'aug_en_sent': 'What is identifier in a free chemical database, owned by the Royal Society of Chemistry of tungsten trioxide ?',
  'aug_sparql_query': 'select distinct ?answer where { wd:Q417406 wdt:P661 ?answer}',
  'fuzz': 44.44444444444444},
 {'aug_en_sent': 'What is identifier in a free chemical database, owned by the Royal Society of Chemistry of tantalum hafnium carbide ?',
  'aug_sparql_query': 'select distinct ?answer where { wd:Q424268 wdt:P661 ?answer}',
  'fuzz': 41.25}]

Example 2:

en_sent = "Name the women's association football team who play the least in tournaments."
sparql_query = 'select ?ent where { ?ent wdt:P31 wd:Q1478437 . ?ent wdt:P2257 ?obj . ?ent wdt:P2094 wd:Q606060. } ORDER BY ASC(?obj)LIMIT 5 '

np.random.seed(0)
df = aug_one_query(en_sent, sparql_query, aug_times=1000)
df = df.sample(n = 5).sort_values(by = "fuzz", ascending = False)
df.apply(lambda x: x.to_dict(), axis = 1).values.tolist()

This will output:

[{'aug_en_sent': 'Name the VfL Bochum team who compclass the least in tournaments.',
  'aug_sparql_query': 'select ?ent where { ?ent wdt:P31 wd:Q1478437 . ?ent wdt:P2257 ?obj . ?ent wdt:P2094 wd:Q105861. } ORDER BY ASC(?obj)LIMIT 5 ',
  'fuzz': 72.34042553191489},
 {'aug_en_sent': 'Name the VfL Wolfsburg team who competition class the least in tournaments.',
  'aug_sparql_query': 'select ?ent where { ?ent wdt:P31 wd:Q1478437 . ?ent wdt:P2257 ?obj . ?ent wdt:P2094 wd:Q101859. } ORDER BY ASC(?obj)LIMIT 5 ',
  'fuzz': 68.42105263157895},
 {'aug_en_sent': 'Name the Irapuato FC team who class for competition the least in tournaments.',
  'aug_sparql_query': 'select ?ent where { ?ent wdt:P31 wd:Q1478437 . ?ent wdt:P2257 ?obj . ?ent wdt:P2094 wd:Q1023193. } ORDER BY ASC(?obj)LIMIT 5 ',
  'fuzz': 67.53246753246754},
 {'aug_en_sent': 'Name the 1994 FIFA World Cup team who competition class the least in tournaments.',
  'aug_sparql_query': 'select ?ent where { ?ent wdt:P31 wd:Q1478437 . ?ent wdt:P2257 ?obj . ?ent wdt:P2094 wd:Q101751. } ORDER BY ASC(?obj)LIMIT 5 ',
  'fuzz': 65.82278481012658},
 {'aug_en_sent': 'Name the VfL Wolfsburg team who official classification by a regulating body under which the subject qualifies for inclusion the least in tournaments.',
  'aug_sparql_query': 'select ?ent where { ?ent wdt:P31 wd:Q1478437 . ?ent wdt:P2257 ?obj . ?ent wdt:P2094 wd:Q101859. } ORDER BY ASC(?obj)LIMIT 5 ',
  'fuzz': 51.98237885462555}]

Recommend you to read below parts:

API Documentation

This will help you have a knowledge of the detail function definition.

DeepPavlov-Chinese-KBQA

This will give you a demo about how to construct a KBQA system on a non-English language (take Chinese for example) with the help of DeepPavlov.

Contact

svjack - svjackbt@gmail.com - ehangzhou@outlook.com

Project Link:https://github.com/svjack/LC-QuAD-augmentation-toolkit

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
INSTALL.sh		INSTALL.sh
README.md		README.md
api_doc.md		api_doc.md
bert_dp.txt		bert_dp.txt
kbqa_entity_linking.py		kbqa_entity_linking.py
kbqa_env_all_requirements.txt		kbqa_env_all_requirements.txt
lcquad_2_0.json		lcquad_2_0.json
lcquad_in_deeppavlov_template_abstract.pkl		lcquad_in_deeppavlov_template_abstract.pkl
lcquad_query_aug_script_with_time.py		lcquad_query_aug_script_with_time.py
only_fix_script_ser.py		only_fix_script_ser.py
pid_tuple_on_s_dict.db		pid_tuple_on_s_dict.db
pid_tuple_on_s_dict.pkl		pid_tuple_on_s_dict.pkl
property_info_df.pkl		property_info_df.pkl
requirements.txt		requirements.txt
rv0.ipynb		rv0.ipynb
single_step.py		single_step.py
sortedcontainers.txt		sortedcontainers.txt
test.json		test.json
train.json		train.json
trans_emb_utils.py		trans_emb_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LC-QuAD-augmentation-toolkit

Brief introduction

Installation

Toolkit Usage

Recommend you to read below parts:

API Documentation

DeepPavlov-Chinese-KBQA

Contact

Acknowledgements

About

Releases

Packages

Languages

svjack/LC-QuAD-augmentation-toolkit

Folders and files

Latest commit

History

Repository files navigation

LC-QuAD-augmentation-toolkit

Brief introduction

Installation

Toolkit Usage

Recommend you to read below parts:

API Documentation

DeepPavlov-Chinese-KBQA

Contact

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages