SPARQL/TüNDRA-project

Introduction

The aim of this project is to compare the efficiency of TüNDRA (.conllu) web tool and SPARQL (.ttl) query language for retrieving data from linguistic treebanks. This repository contains the shortened version of the research; for the more detailed version please access our paper "A Comparative Approach to Query an English Dependency Treebank. SPARQL vs. TüNDRA Web Tool".

Material

The same databank UD English-ParTUT was used to illustrate the difference in queries. According to Universal Dependencies website, this treebank is a conversion of a multilingual parallel treebank which was developed at the University of Turin, and it is made of a different kinds of texts such as legal texts, talks and articles from Wikipedia, among others. The .conllu version was accessed directly via Tündra web tool. The .ttl version was first downloaded from the Universal Dependencies website in the .conllu format, and then transformed into .ttl format with the help of ConLL-RDF tool (en_partut.ttl). Some further changes were then made to the .ttl file which will be further described in the Description section, with the final file used for querying being en_partut_adapted.ttl.

Description

.ttl adaptation

To make the outputs of two applications more similar, it is important that SPARQL queries will return the whole sentences and not only the word/lemma/etc., as this is how the output is provided by TüNDRA. The original .ttl file doesn't have a direct reference from a word to the sentence it belongs to, so it was decided to add to every subject (that contains an object nif:Word)? another pair of a predicate-object, namely "conll:SENT *sentence*" which were retrieved from a rdfs:comment predicate. Please refer to turtle_changer.py for more detailed information and specific methods used. Important, the turtle_changer.py file can be used for any .ttl file generated by the aforementioned conllu-rdf tool.

Queries comparison

Below can be found queries searching for the same information. All the screenshots of TüNDRA web tool examples are taken from the TüNDRA Tutorial and UD English ParTUT. For SPARQL queries Apache Jena Fuseki was used. Templates for the SPARQL queries wih the possibility to change the exact instance searched and copy the queries can be found in sparql_queries.py.

lemma_TüNDRA	lemma_SPARQL

regex_TüNDRA	regex_SPARQL
[word = /.*able/]	PREFIX conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#> SELECT ?sent WHERE { ?s conll:WORD ?word; conll:SENT ?sent . FILTER regex(?word, ".*able$") }
word1_or_word2_Tündra	word1_or_word2_SPARQL

pos_and_lemma_TüNDRA	pos_and_lemma_SPARQL
[pos = "NOUN" & lemma = /un.*/]	PREFIX conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#> SELECT ?sent WHERE { ?s conll:POS_COARSE "NOUN"; conll:LEMMA ?lemma; conll:SENT ?sent . FILTER regex(lemma, "^un.*") }
adj_word1_and_word2_TüNDRA	adj_word1_and_word2_SPARQL

words_atadistance_2_TüNDRA	words_atadistance_2_SPARQL
[word = "the"] .2 [word = "world"]	PREFIX conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#> PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> SELECT ?sent WHERE { ?s a nif:Sentence; conll:SENT ?sent FILTER (regex(?sent, "\\bthe \\w+ world\\b")) }
words_atadistance_2or3_TüNDRA	words_atadistance_2or3_SPARQL

words_at_any_distance_TüNDRA	words_at_any_distance_SPARQL
[word = "he"] .* [word = "to"]	PREFIX conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#> PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> SELECT ?sent WHERE { ?s a nif:Sentence; conll:SENT ?sent FILTER (regex(?sent, "\\bhe\\b.*\\bto\\b")) }
adj_pos1_and_pos2_TüNDRA	adj_pos1_and_pos2_SPARQL

word1_headOf_word2_TüNDRA	word1_headOf_word2_SPARQL
[word = "see"] > [word = "we"]	PREFIX conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#> SELECT ?sent WHERE { ?s conll:WORD "see" . ?s1 conll:HEAD ?s; conll:WORD "we"; conll:SENT ?sent }
pos1_headOf_word2_edge_TüNDRA	pos1_headOf_word2_edge_SPAQRL

pos1_headOf_not_pos2_TüNDRA	pos1_headOf_not_pos2_SPARQL
[pos="PROPN"] !> [pos="ADP"]	PREFIX conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#> SELECT ?sent WHERE { ?s conll:POS_COARSE "PROPN"; conll:SENT ?sent . ?s1 conll:HEAD ?s; conll:POS_COARSE ?pos; FILTER (?pos != "ADP") }

Limitations

All in all, quering information given above showed a good result as all the (major) TüNDRA queries can be reproduced in SPARQL. Although TüNDRA syntax is arguably more straightforward and easier to learn than SPARQL syntax, for those who are more used to quering in SPARQL and are advanced in this language, this may appear otherwise.
However, the major disadvantage of SPARQL queries is the limitation of information being possible to retrieve. In comparison to SPARQL, TüNDRA was specifically created for querying linguistic dependency and constituency treebanks, and apart from displaying sentences, it also displays the corresponding tree; table view of each word of the sentence, attributes of which can be added to or deleted from this table "in a click"; statistics of the attributes; as well as another table for showing the table in context. The output of "Statistics" and "Table View" can also be regulated via query itself which is also not available in SPARQL.

Output

The conclusion to draw is that the efficiency of SPARQL is to be improved. For example, natural further development for this topic would be creation and introduction of extra web tools for SPARQL that would allow to visualise the output or even build corresponding linguistic trees. However, the current research proved that it is already possible to use SPARQL syntax for the successful search of desired data.

References

Scott Martens (2013). TüNDRA: A Web Application for Treebank Search and Visualization. In: Proceedings of The Twelfth Workshop on Treebanks and Linguistic Theories (TLT12), Sofia, pp. 133—144. URL: http://bultreebank.org/TLT12/TLT12Proceedings.pdf
Chiarcos C., Fäth C. (2017), CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge. LDK 2017. pp 74-88.
Apache Jena Fuseki https://jena.apache.org/documentation/fuseki2/

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
screenshots		screenshots
A Comparative Approach to Query an English Dependency Treebank. SPARQL vs. TüNDRA Web Tool.pdf		A Comparative Approach to Query an English Dependency Treebank. SPARQL vs. TüNDRA Web Tool.pdf
README.md		README.md
en_partut.ttl		en_partut.ttl
en_partut_adapted.ttl		en_partut_adapted.ttl
sparql_queries.py		sparql_queries.py
trial.ttl		trial.ttl
turtle_changer.py		turtle_changer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPARQL/TüNDRA-project

Introduction

Material

Description

.ttl adaptation

Queries comparison

Limitations

Output

References

About

Releases

Packages

Contributors 2

Languages

tnitn/SPARQL-project

Folders and files

Latest commit

History

Repository files navigation

SPARQL/TüNDRA-project

Introduction

Material

Description

.ttl adaptation

Queries comparison

Limitations

Output

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages