Ground truth data for bibliographic reference clustering

This repository contains a small dataset of ground truth data to evaluate reference clustering algorithms. It consists of pairs of references (ref, seed_ref), with a ground truth value (same-bibl-entity) indicating whether the two references in the pair are referring to the same bibliographic entity or not (1 = true, 0 = false, 0.5 = partly).

These reference pairs were sampled from all clusters of LB references, trying to represent clusters of different sizes. The cluster size can be derived from values in the column cluster_file as it is coded in the integer preceding the _ character.

The dataset is contained in the file data/dataset.tsv, obtained by concatenating (and filtering columns) the .tsv files in data/partials/*.tsv (see notebook).

Some stats:

total number of reference pairs: 3638
number of positive ref. pairs (same-bibl-entity == 1): 749
number of negative ref. pairs (same-bibl-entity == 0): 2886
number of partly positive ref. pairs (same-bibl-entity == 0.5): 3

Credits

The dataset was manually corrected by Silvia Ferronato in the context of the SNF-funded LinkedBooks project. It was created for the evaluation of the clustering algorithm by Tao Sun, carried out in the context of his semester project student at DHLAB EPFL (see project report).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
concatenate-csv-partials.ipynb		concatenate-csv-partials.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ground truth data for bibliographic reference clustering

Credits

About

Releases 1

Languages

License

ScholarIndex/GT-reference-clustering

Folders and files

Latest commit

History

Repository files navigation

Ground truth data for bibliographic reference clustering

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages