-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch build pipeline to new Python library #3996
Comments
It's also successfully building already (although the BibTeX part is not ported yet): |
Progress update: I now checked Since a naive diff between files was getting too tedious, I defined my own diff script that specifically ignores certain expected differences. At the same time, it serves as documentation for what those expected differences are. The "big" files, namely the people and paper YAML files, are still left to do, but I hope I can finish this work before the end of the year. |
Following your comment from #4147, I agree it would be awesome to get this finished off. There have been at least a handful of changes to the original implementation and I fear the longer we let this sit the more we will diverge. I wonder if we should adopt a sink-or-swim approach: we delete the original implementation, and then we are simply forced to update new scripts in a lazy fashion as we need them. I have a few comments along that route:
|
In the branches, I already added a deprecation notice to the old library, so that all scripts that import it will show it. That could be an alternative to sink-or-swim, in that it will give us a reminder each time a script still uses legacy code. Also, regarding the build pipeline, doing a careful diff between the YAML files that are generated with the old code vs. the new code has already revealed numerous small oversights and bugs, so while it has been really tedious and time-consuming, I think it was helpful to ensure the new library is doing the Right Thing ™️ . There’s quite a bit of complexity and edge cases in our data and the way we interpret it ...
I think I will be ready to create a PR soon, though I doubt that the diffs will be useful.
Yes, you can look at it on the https://github.com/acl-org/acl-anthology/tree/build-pipeline-with-new-library branch, it should be 95% complete/correct. I am working on the remaining 5% :)
Do you mean you experienced long import times with the new module? That would be weird, as it’s designed from the ground up around lazy-loading, i.e. will only load information at the moment you need it, not during import, and even calling Caching is another thing that I wanted to look into but haven’t yet, but it should already be much faster as it is now. (EDIT: Caching should be particularly helpful for scripts that do something with the author index, as that requires loading all XML files currently.) |
Oh, but I should add that instantiating it without pointing it to an existing Anthology data directory will clone the Git repo, which will take some time of course. :) |
A while ago, we merged the new Python library into this repo, but the build pipeline still uses the legacy code.
Porting the build pipeline to the new library now happens on the https://github.com/acl-org/acl-anthology/tree/build-pipeline-with-new-library branch.
Roadmap
data/yaml/papers/*
files with new librarydata/yaml/volumes.yaml
with new librarydata/yaml/people/*
files with new librarydata/yaml/venues.yaml
with new librarydata/yaml/events.yaml
with new librarydata/yaml/sigs.yaml
with new librarypapers/*.yaml
for functional equivalencevolumes.yaml
for functional equivalencepeople/*.yaml
for functional equivalencename_variants.yaml
in all cases, and more consistently adding name variants in different scripts to the displayed canonical namevenues.yaml
for functional equivalenceevents.yaml
for functional equivalencesigs.yaml
for functional equivalenceurl: null
is omitted.Custom diff script to help with this: https://gist.github.com/mbollmann/827a079023ebdd18b4d06c28566fac0d
Performance optimizations
The text was updated successfully, but these errors were encountered: