🚀 Rendered Interactive Graph 🚀
Graphing the dependencies (and dependents) of research software and their contributors.
- Install Just
- Create a new Conda / Mamba Env (Python 3.11 preferred)
- Install lib:
just install
(orpip install .
)
There is a lot of pre-processed data stored on a GCS bucket. To download it, run:
rs-graph-data download
This will always download the latest version of the data (as pushed by @evamaxfield).
From the 2023-12-12 upload of the full dataset, we created a 50 example practice dataset to annotate first. In a single round of practice annotate we achieved a Fliess Kappa of 0.90 (almost perfect agreement).
Annotation criteria for entity matches are:
-
dev_details
name, username, or email is the same or incredibly similar to theauthor_details
name. specifically:- if the provided name in the
dev_details
is the same name as the author name from theauthor_details
(or incredibly similar)
or:
- if no name is provided in the
dev_details
but the GitHub username or the email withindev_details
looks similar to the author name fromauthor_details
- if the provided name in the
-
there is no co-author (names) or co-contributor (usernames) which look like a potentially better match.
-
if there is uncertainty, err on the side of
False
We then annotated a dataset of 3000 dev-author pairs following the same criteria as used in practice. After all 3000 dev-author pairs were annotated, we resolved any differences between annotations to form a single unified set of annotations for all 3000 pairs.
Inter-rater Reliability (Fleiss Kappa): 0.90 (Almost perfect agreement)
To reproduce this result run: rs-graph-modeling calculate-irr-for-dev-author-em-annotation
Inter-rater Reliability (Fliess Kappa): 0.90 (Almost perfect agreement)
To reproduce this result run: rs-graph-modeling calculate-irr-for-dev-author-em-annotation --use-full-dataset
dev_details | author_details | match |
---|---|---|
username: ideas-man; name: None; email: None; repos: https://github.com/DLR-RM/BlenderProc; co_contributors: themasterlink, cornerfarmer, MartinSmeyer, mayman99, wboerdijk, joe3141, MarkusKnauer, Sebastian-Jung, 5trobl, thodan, probabilisticrobotics, mansoorcheema, DavidRisch, apenzko, abahnasy, maximilianmuehlbauer, moizsajid, Victorlouisdg, hansaskov, wangg12, MarwinNumbers, harinandan1995, neixlo, zzilch, cuteday, andrewyguo, jascase901, HectorAnadon, beekama; |
name: Klaus H. Strobl; repos: https://github.com/DLR-RM/BlenderProc; co_authors: Martin Sundermeyer, Rudolph Alexander Triebel, Dominik Winkelbauer, Wout Boerdijk, Matthias Humt, Markus C. Knauer, Maximilian Denninger; |
FALSE |
This row is labeled False
because there is no provided name in dev_details
and
the GitHub username in dev_details
does not look similar to the author name in
author_details
.
dev_details | author_details | match |
---|---|---|
username: mattwthompson; name: Matt Thompson; email: redacted@redacted.org; repos: https://github.com/shirtsgroup/physical_validation; co_contributors: ptmerz, dependabot[bot], mrshirts, pre-commit-ci[bot], wehs7661, cwalker7, MatKie, tlfobe, lgtm-com[bot]; |
name: S. T. Boothroyd; repos: https://github.com/shirtsgroup/physical_validation; co_authors: Wei-tse Hsu, Michael R. Shirts, Chris C. Walker, Matthew W Thompson, Pascal Timothã©e Merz; |
FALSE |
This row is labeled False
because the provided name in dev_details
is different
from the name provided in the author_details
. Additionally, within the author details
we see a member of co-authors with the name Matthew W Thompson
which seems like
a better match for the current dev_details
under consideration.
dev_details | author_details | match |
---|---|---|
username: sepandhaghighi; name: Sepand Haghighi; email: None; repos: https://github.com/sepandhaghighi/pyrgg, https://github.com/sepandhaghighi/pycm, https://github.com/ECSIM/opem; co_contributors: sadrasabouri, ivanovmg, ahmadsalimi, dependabot-preview[bot], dependabot[bot], codacy-badger, sadrasabouri, alirezazolanvari, pyup-bot, dependabot-preview[bot], dependabot[bot], GeetDsa, negarzabetian, robinsonkwame, lewiuberg, mahi97, MasoomehJasemi, soheeyang, cclauss, the-lay, pyup-bot, mahi97, sadrasabouri, dependabot-preview[bot], nnadeau, kasraaskari, engnadeau, dependabot[bot]; |
name: Sepand Haghighi; repos: https://github.com/sepandhaghighi/pyrgg, https://github.com/sepandhaghighi/pycm; co_authors: Masoomeh Jasemi, Alireza Zolanvari, Shaahin Hessabi; |
TRUE |
This row is labeled True
because the provided name in dev_details
is the same.
Additionally, within the author details we see no co-authors with a name that looks
like a better match for the current dev_details
under consideration.
Match -- GitHub Username or Email Similar to Author Name, No More Likely Match in Co-Contributors / Co-Authors
dev_details | author_details | match |
---|---|---|
username: kdpeterson51; name: None; email: None; repos: https://github.com/kdpeterson51/mbir; co_contributors: arcaldwell49, iriberri; |
name: Kyle Bradley Peterson; repos: https://github.com/kdpeterson51/mbir; co_authors: Aaron Richard Caldwell; |
TRUE |
This row is labeled True
because the provided GitHub username in dev_details
looks similar to the author name in author_details
. Additionally, within the
author details we see no co-authors with a name that looks like a better match
for the current dev_details
under consideration.
In total, we tested 224 model-feature combinations: 7 models, 4 feature combinations ("dev username - author name", "dev username and dev name - author name", "dev username and dev email - author name", and "dev username, dev name, and dev email - author name"), 4 negative example sizes (to test for issues with extreme label imbalance), and 2 model types (Logistic Regression via Semantic Embeddings or Fine-tuning the Base Models).
We used 60%, 20%, 20% (train, epoch eval, and test) splits for the data.
Ultimately after all model training and evaluations completed we found these configurations to be the top 10:
TODO: include table
We observe that logistic regression from semantic embeddings produced by deberta-v3
performed best
(noting minimal difference between "dev username and dev name" and "dev username, dev name, and dev email").
For our final model, we trained these two models again, individually, and found that a the model with "dev username, dev name, and dev email" produced just barely better results than without dev email.
Thus, the final model (semantic-logit, "dev username, dev name, and dev email" embedded with deberta v3) is made available.
Evaluation Results for Final Model:
- Precision: 0.93
- Recall: 0.96
- F1 (binary): 0.95