Switch build pipeline to new Python library #3996

mbollmann · 2024-11-02T18:02:34Z

A while ago, we merged the new Python library into this repo, but the build pipeline still uses the legacy code.

Porting the build pipeline to the new library now happens on the https://github.com/acl-org/acl-anthology/tree/build-pipeline-with-new-library branch.

Roadmap

Custom diff script to help with this: https://gist.github.com/mbollmann/827a079023ebdd18b4d06c28566fac0d

Performance optimizations

Generating bibliography strings was so far done with citeproc-py. Replacing this with a custom Python function speeds up the generation of bibliography strings from four minutes to a few seconds in my local testing.
Switch from YAML to JSON for build pipeline #4153 (moved to separate issue now, since changing the data format complicates checking for functional equivalence)

The text was updated successfully, but these errors were encountered:

mbollmann · 2024-11-03T18:57:48Z

It's also successfully building already (although the BibTeX part is not ported yet):
https://preview.aclanthology.org/build-pipeline-with-new-library/

mbollmann · 2024-12-14T16:28:24Z

Progress update: I now checked volumes.yaml for functional equivalence. All these checks did already uncover various minor bugs and oddities.

Since a naive diff between files was getting too tedious, I defined my own diff script that specifically ignores certain expected differences. At the same time, it serves as documentation for what those expected differences are.

The "big" files, namely the people and paper YAML files, are still left to do, but I hope I can finish this work before the end of the year.

mjpost · 2024-12-22T14:56:23Z

Following your comment from #4147, I agree it would be awesome to get this finished off. There have been at least a handful of changes to the original implementation and I fear the longer we let this sit the more we will diverge.

I wonder if we should adopt a sink-or-swim approach: we delete the original implementation, and then we are simply forced to update new scripts in a lazy fashion as we need them.

I have a few comments along that route:

It would be helpful to create the PR so we could easily see and comment on the diff. I confess I don't really know what's changed in terms of the API.
If you ported create_hugo_yaml.py, that would be a big help to me as a reference for porting ingest.py and ingest_aclpub2.py. The other important ones will be related to adding DOIs and handling corrections, but I don't anticipate these would be difficult
I did install the pypi module and loaded it, and my first comment is that it would be really nice to address a weakness with the current module, which is that the import takes a long time. I wonder whether there is something we could do with lazy evaluation (ideally) or a timestamped cache that would help out with this

mbollmann · 2024-12-23T14:10:19Z

I wonder if we should adopt a sink-or-swim approach: we delete the original implementation, and then we are simply forced to update new scripts in a lazy fashion as we need them.

In the branches, I already added a deprecation notice to the old library, so that all scripts that import it will show it. That could be an alternative to sink-or-swim, in that it will give us a reminder each time a script still uses legacy code.

Also, regarding the build pipeline, doing a careful diff between the YAML files that are generated with the old code vs. the new code has already revealed numerous small oversights and bugs, so while it has been really tedious and time-consuming, I think it was helpful to ensure the new library is doing the Right Thing ™️ . There’s quite a bit of complexity and edge cases in our data and the way we interpret it ...

I have a few comments along that route:

It would be helpful to create the PR so we could easily see and comment on the diff. I confess I don't really know what's changed in terms of the API.

I think I will be ready to create a PR soon, though I doubt that the diffs will be useful. create_hugo_yaml.py is basically entirely rewritten, and that’s the main thing that changes for this switch.

If you ported create_hugo_yaml.py, that would be a big help to me as a reference for porting ingest.py and ingest_aclpub2.py. The other important ones will be related to adding DOIs and handling corrections, but I don't anticipate these would be difficult

Yes, you can look at it on the https://github.com/acl-org/acl-anthology/tree/build-pipeline-with-new-library branch, it should be 95% complete/correct. I am working on the remaining 5% :)

I did install the pypi module and loaded it, and my first comment is that it would be really nice to address a weakness with the current module, which is that the import takes a long time. I wonder whether there is something we could do with lazy evaluation (ideally) or a timestamped cache that would help out with this

Do you mean you experienced long import times with the new module? That would be weird, as it’s designed from the ground up around lazy-loading, i.e. will only load information at the moment you need it, not during import, and even calling .load_all() to load and parse all data files (which I use in the new create_hugo_yaml.py script) takes around 4 seconds on my home computer (10 seconds on my laptop).

Caching is another thing that I wanted to look into but haven’t yet, but it should already be much faster as it is now. (EDIT: Caching should be particularly helpful for scripts that do something with the author index, as that requires loading all XML files currently.)

mbollmann · 2024-12-23T14:27:56Z

Do you mean you experienced long import times with the new module? That would be weird, as it’s entirely designed to only load information at the moment you need it,

Oh, but I should add that instantiating it without pointing it to an existing Anthology data directory will clone the Git repo, which will take some time of course. :)

mbollmann added the enhancement label Nov 2, 2024

mbollmann self-assigned this Nov 2, 2024

mbollmann mentioned this issue Nov 11, 2024

Missing middle names in ingestion materials #3985

Open

mbollmann mentioned this issue Dec 21, 2024

AutoPR for paper information correction #4147

Open

5 tasks

mjpost added this to the 2024Q4 milestone Dec 22, 2024

mjpost pinned this issue Dec 22, 2024

mbollmann mentioned this issue Dec 23, 2024

Inconsistency with determining canonical name #4186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch build pipeline to new Python library #3996

Switch build pipeline to new Python library #3996

mbollmann commented Nov 2, 2024 •

edited

Loading

mbollmann commented Nov 3, 2024

mbollmann commented Dec 14, 2024 •

edited

Loading

mjpost commented Dec 22, 2024

mbollmann commented Dec 23, 2024 •

edited

Loading

mbollmann commented Dec 23, 2024

Switch build pipeline to new Python library #3996

Switch build pipeline to new Python library #3996

Comments

mbollmann commented Nov 2, 2024 • edited Loading

Roadmap

Performance optimizations

mbollmann commented Nov 3, 2024

mbollmann commented Dec 14, 2024 • edited Loading

mjpost commented Dec 22, 2024

mbollmann commented Dec 23, 2024 • edited Loading

mbollmann commented Dec 23, 2024

mbollmann commented Nov 2, 2024 •

edited

Loading

mbollmann commented Dec 14, 2024 •

edited

Loading

mbollmann commented Dec 23, 2024 •

edited

Loading