trying to find question answer entities with the help of deep learning in python(bert), textual graph and wikipedia.
-
creating textual graph.
-
according to question input from DBpedia-v2, its related graph will extract.
-
relating graph will prune.
-
ranking of candidte answers of questin from DBpedia-v2 will be done.
for understanding more refer to our paper Learning to Rank Knowledge Subgraph Nodes for Entity Retrieval
from_entity (head) : the title of article(page) of wikipedia or an entity mention which exist with other entity mention co occured in the same sentence in any wiki page
sentence: the sentence which is in the article (the connector).
to_entities (tail) : which is the set of entities that are exists in the sentence.or an entity mention which exist with other entity mention co occured in the same sentence in any wiki page
for this step we got the latest wikipedia dump which contains 5,400,000 articles.
we got enwiki-latest-pages-articles.xml.bz2 from link.
then we extracted:title,interlinks,and sections of each article and wrote it in a line per article with the help of gensim.
the result of gensim processing named enwiki-latest.json.gz which we will use it in datasetMaking.ipynb.
for making our textual graph we also need to know all entities of each mention to find entities which are repeated in the article but there is no link for them because of avoidance of reputation.we took this set from resources of Reimplimentation of Tagme.
we used mention_overall_dict.pickle
here is a sample of data which is reterived for Berlin page in wikipedia
direct triples are triples which head entity is page title and tail entity is found in a sentence in that page.
undirect triples are triples which head entity and tail entity are found in the same sentence in a page.for each pair we make 2 triple, with changing head and tail. (not implemented in datasetMaking.ipynb sample code)
CER is a Contextual Entity Retrieval method that models contextual relationship between entities and effectively limits the extensive search space without compromising performance. In this method, a model is trained to prune an extracted subgraph from a textual knowledge graph that represents the relations between entities and then a second deep model is trained to rank entities in the subgraph by reasoning over the textual content of nodes, edges, and the given query.
we have provided our code in src section that can be used for producing the results. The extracted data from contextual Wikipedia knowledge graph for DBPedia-Entity v2 can be found here.
cd src\data\
bash split.sh
cd ..
bash run.sh