WikiCompare: Comparing knowledge in Wikipedia projects

The goal of this project is to compare the content/knowledge of different Wikipedia projects. In particular, we are interested in multilingual Wikipedias and Wikidata.

For example, looking at the University of Amsterdam:

UvA (Dutch)	UvA (English)	UvA (Wikidata)

You see different content. The goal of this project to create quantative measures of the different.

This is useful in the context of projects we work on in indelab.org which focus on adding knowledge to knowledge bases like Wikidata.

See for example:

Prompting as Probing: Using Language Models for Knowledge Base Construction by Dimitrios Alivanistos, Selene Báez Santamaría, Michael Cochez, Jan-Christoph Kalo, Emile van Krieken, Thiviyan Thanapalasingam Github
Inductive Entity Representations from Text via Link Prediction Daniel Daza, Michael Cochez, and Paul Groth, in The Web Conference 2021. Github

Current Results

Data

The results below are for Dutch univiersities as defined by the following SPARQL query executed over Wikidata

SELECT ?item
WHERE {
  ?item wdt:P31 wd:Q3918 .
  ?item wdt:P17 wd:Q55 .
  ?nlSite schema:isPartOf <https://nl.wikipedia.org/> .
  ?enSite schema:isPartOf <https://en.wikipedia.org/> .
  ?nlSite schema:about ?item .
  ?enSite schema:about ?item .
}

This retrieve all entities of type (wdt:P31) univerity (wd:Q3918) who have a country (wdt:P17) of the Netherlands (wd:Q55). We then use Pywikibot to retrieve the wikipedia pages from the Dutch and English wikipedias as well as the representation from Wikidata. We provide a handly Duckdb file containing this downloaded information.

Comparison of number of sections between NL en EN wikipedias

Word count: comparing the English pages and the Dutch pages translated to English

The word count distribution of the Dutch pages translated to English is more skewed than that of English pages. Concerning the ratios, the number of pages with a higher count of words in the English version (ratio > 1), is slightly higher than those having more words in the translated Dutch version.

Gzip size: comparing the English pages and the Dutch pages translated to English

Wikipedia pages are zipped using the gzip algorithm. The size of the resulting file is an approximation of the algorithmic information content (or Kolmogorov complexity).

The distribution of the gzip size of the pages is pretty similar between English and translated Dutch pages. A one-to-one comparison (ratio) shows that most of the English pages contain more information than the translated Dutch pages.

Comparision of the number of entities extracted

Here we used the pretrained small language models from Spacy for Dutch and English to do named entity recognition.

From sentence embeddings to topics

For all sentences on all wiki pages we get vector embeddings. These are 300-dimensional, so here are the first three PCAs for some:

With DBSCAN we determine the number of clusters, i.e. topics in these texts and these compare as follows:

Statistics of similarity of EN and NL pages of UvA:

ENGLISH

Total words: 2801
Top words:
- university 112
- amsterdam 73
- student 69
- faculty 66
- science 45
- research 37
- uva 29
- academic 23
- dutch 20
- netherlands 19

DUTCH

Total words: 735
Top words:
- university 34
- institute 34
- amsterdam 34
- uva 19
- student 18
- science 12
- faculty 11
- research 10
- study 9
- center 9

POPULAR IN BOTH LANGUAGES:

amsterdam center faculty institute law library locate research school science student study time university uva

Popular in EN, but NOT popular in NL:

academic area campus city cultural degree dentistry department doctoral dutch european former house humanity include medicine million minister museum nobel offer one prize ranked three winner within world

Popular in NL, but NOT popular in EN:

association auc collaboration hva language municipality special

Similarity score of all Dutch Universities:

(Jaccard metric)

Comparison of database entries to web page content

Here we compared the number of claims in the Wikidata database to the entries in the Infobox for each university (excluding those that did not have infoboxes)

Text descriptives

For both langauges, we retrieved descriptive statistics including words use number of sentences.

Using spacy, we also determined complexity of the texts. The LYX index takes into account the number of complex words (above 4 syllabus) in the texts.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
data		data
imgs		imgs
notebooks		notebooks
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiCompare: Comparing knowledge in Wikipedia projects

Current Results

Data

Comparison of number of sections between NL en EN wikipedias

Word count: comparing the English pages and the Dutch pages translated to English

Gzip size: comparing the English pages and the Dutch pages translated to English

Comparision of the number of entities extracted

From sentence embeddings to topics

Statistics of similarity of EN and NL pages of UvA:

Similarity score of all Dutch Universities:

Comparison of database entries to web page content

Text descriptives

About

Releases

Packages

Contributors 5

Languages

License

UvA-DSC/WikiCompare

Folders and files

Latest commit

History

Repository files navigation

WikiCompare: Comparing knowledge in Wikipedia projects

Current Results

Data

Comparison of number of sections between NL en EN wikipedias

Word count: comparing the English pages and the Dutch pages translated to English

Gzip size: comparing the English pages and the Dutch pages translated to English

Comparision of the number of entities extracted

From sentence embeddings to topics

Statistics of similarity of EN and NL pages of UvA:

Similarity score of all Dutch Universities:

Comparison of database entries to web page content

Text descriptives

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages