Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch build pipeline to new Python library #3996

Open
14 of 15 tasks
mbollmann opened this issue Nov 2, 2024 · 5 comments
Open
14 of 15 tasks

Switch build pipeline to new Python library #3996

mbollmann opened this issue Nov 2, 2024 · 5 comments
Assignees
Milestone

Comments

@mbollmann
Copy link
Member

mbollmann commented Nov 2, 2024

A while ago, we merged the new Python library into this repo, but the build pipeline still uses the legacy code.

Porting the build pipeline to the new library now happens on the https://github.com/acl-org/acl-anthology/tree/build-pipeline-with-new-library branch.

Roadmap

  • Create data/yaml/papers/* files with new library
  • Create data/yaml/volumes.yaml with new library
  • Create data/yaml/people/* files with new library
  • Create data/yaml/venues.yaml with new library
  • Create data/yaml/events.yaml with new library
  • Create data/yaml/sigs.yaml with new library
  • Create BibTeX with new library
  • Inspect diffs of generated YAML files between old and new libraries for any functional changes
    • Check papers/*.yaml for functional equivalence
      • Tons of diffs.; see diff script.
    • Check volumes.yaml for functional equivalence
      • More events are listed than before. This seems more correct than before; e.g., W99-09 does not list an event "ws-1999", even though there is an XML file that explicitly defines and links it. The new pipeline explicitly lists it as an event under W99-09, which seems more correct.
      • Some unused keys are removed.
      • The order in which SIGs are listed is now alphabetical, which differs from how it was before.
    • Check people/*.yaml for functional equivalence
      • Lots of differences due to slight changes in name scoring, forcing the canonical name if set in name_variants.yaml in all cases, and more consistently adding name variants in different scripts to the displayed canonical name
    • Check venues.yaml for functional equivalence
      • Sorting of volumes behaves slightly differently, but more consistently (volumes are always sorted alphabetically by their collection ID, but order within a collection – i.e. XML file – is preserved).
    • Check events.yaml for functional equivalence
      • Sorting algorithm for associated volumes behaves slightly differently, but usually more correctly (Findings wasn't always moved above workshops before with joint events; main volumes of EMNLP 2018 were not moved to the top).
    • Check sigs.yaml for functional equivalence
      • Years in keys are not being quoted; url: null is omitted.

Custom diff script to help with this: https://gist.github.com/mbollmann/827a079023ebdd18b4d06c28566fac0d

Performance optimizations

  • Generating bibliography strings was so far done with citeproc-py. Replacing this with a custom Python function speeds up the generation of bibliography strings from four minutes to a few seconds in my local testing.
  • Switch from YAML to JSON for build pipeline #4153 (moved to separate issue now, since changing the data format complicates checking for functional equivalence)
@mbollmann mbollmann self-assigned this Nov 2, 2024
@mbollmann
Copy link
Member Author

It's also successfully building already (although the BibTeX part is not ported yet):
https://preview.aclanthology.org/build-pipeline-with-new-library/

@mbollmann
Copy link
Member Author

mbollmann commented Dec 14, 2024

Progress update: I now checked volumes.yaml for functional equivalence. All these checks did already uncover various minor bugs and oddities.

Since a naive diff between files was getting too tedious, I defined my own diff script that specifically ignores certain expected differences. At the same time, it serves as documentation for what those expected differences are.

The "big" files, namely the people and paper YAML files, are still left to do, but I hope I can finish this work before the end of the year.

@mjpost
Copy link
Member

mjpost commented Dec 22, 2024

Following your comment from #4147, I agree it would be awesome to get this finished off. There have been at least a handful of changes to the original implementation and I fear the longer we let this sit the more we will diverge.

I wonder if we should adopt a sink-or-swim approach: we delete the original implementation, and then we are simply forced to update new scripts in a lazy fashion as we need them.

I have a few comments along that route:

  • It would be helpful to create the PR so we could easily see and comment on the diff. I confess I don't really know what's changed in terms of the API.
  • If you ported create_hugo_yaml.py, that would be a big help to me as a reference for porting ingest.py and ingest_aclpub2.py. The other important ones will be related to adding DOIs and handling corrections, but I don't anticipate these would be difficult
  • I did install the pypi module and loaded it, and my first comment is that it would be really nice to address a weakness with the current module, which is that the import takes a long time. I wonder whether there is something we could do with lazy evaluation (ideally) or a timestamped cache that would help out with this

@mjpost mjpost added this to the 2024Q4 milestone Dec 22, 2024
@mjpost mjpost pinned this issue Dec 22, 2024
@mbollmann
Copy link
Member Author

mbollmann commented Dec 23, 2024

I wonder if we should adopt a sink-or-swim approach: we delete the original implementation, and then we are simply forced to update new scripts in a lazy fashion as we need them.

In the branches, I already added a deprecation notice to the old library, so that all scripts that import it will show it. That could be an alternative to sink-or-swim, in that it will give us a reminder each time a script still uses legacy code.

Also, regarding the build pipeline, doing a careful diff between the YAML files that are generated with the old code vs. the new code has already revealed numerous small oversights and bugs, so while it has been really tedious and time-consuming, I think it was helpful to ensure the new library is doing the Right Thing ™️ . There’s quite a bit of complexity and edge cases in our data and the way we interpret it ...

I have a few comments along that route:

  • It would be helpful to create the PR so we could easily see and comment on the diff. I confess I don't really know what's changed in terms of the API.

I think I will be ready to create a PR soon, though I doubt that the diffs will be useful. create_hugo_yaml.py is basically entirely rewritten, and that’s the main thing that changes for this switch.

  • If you ported create_hugo_yaml.py, that would be a big help to me as a reference for porting ingest.py and ingest_aclpub2.py. The other important ones will be related to adding DOIs and handling corrections, but I don't anticipate these would be difficult

Yes, you can look at it on the https://github.com/acl-org/acl-anthology/tree/build-pipeline-with-new-library branch, it should be 95% complete/correct. I am working on the remaining 5% :)

  • I did install the pypi module and loaded it, and my first comment is that it would be really nice to address a weakness with the current module, which is that the import takes a long time. I wonder whether there is something we could do with lazy evaluation (ideally) or a timestamped cache that would help out with this

Do you mean you experienced long import times with the new module? That would be weird, as it’s designed from the ground up around lazy-loading, i.e. will only load information at the moment you need it, not during import, and even calling .load_all() to load and parse all data files (which I use in the new create_hugo_yaml.py script) takes around 4 seconds on my home computer (10 seconds on my laptop).

Caching is another thing that I wanted to look into but haven’t yet, but it should already be much faster as it is now. (EDIT: Caching should be particularly helpful for scripts that do something with the author index, as that requires loading all XML files currently.)

@mbollmann
Copy link
Member Author

Do you mean you experienced long import times with the new module? That would be weird, as it’s entirely designed to only load information at the moment you need it,

Oh, but I should add that instantiating it without pointing it to an existing Anthology data directory will clone the Git repo, which will take some time of course. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants