Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch build pipeline to new Python library #3996

Open
9 of 16 tasks
mbollmann opened this issue Nov 2, 2024 · 2 comments
Open
9 of 16 tasks

Switch build pipeline to new Python library #3996

mbollmann opened this issue Nov 2, 2024 · 2 comments
Assignees

Comments

@mbollmann
Copy link
Member

mbollmann commented Nov 2, 2024

A while ago, we merged the new Python library into this repo, but the build pipeline still uses the legacy code.

Porting the build pipeline to the new library now happens on the https://github.com/acl-org/acl-anthology/tree/build-pipeline-with-new-library branch.

Roadmap

  • Create data/yaml/papers/* files with new library
  • Create data/yaml/volumes.yaml with new library
  • Create data/yaml/people/* files with new library
  • Create data/yaml/venues.yaml with new library
  • Create data/yaml/events.yaml with new library
  • Create data/yaml/sigs.yaml with new library
  • Create BibTeX with new library
  • Inspect diffs of generated YAML files between old and new libraries for any functional changes
    • Check papers/*.yaml for functional equivalence
    • Check volumes.yaml for functional equivalence
    • Check people/*.yaml for functional equivalence
    • Check venues.yaml for functional equivalence
    • Check events.yaml for functional equivalence
      • Sorting algorithm for associated volumes behaves slightly differently, but usually more correctly (Findings wasn't always moved above workshops before with joint events; main volumes of EMNLP 2018 were not moved to the top).
    • Check sigs.yaml for functional equivalence
      • Years in keys are not being quoted; url: null is omitted.

Performance optimizations

  • Generating bibliography strings was so far done with citeproc-py. Replacing this with a custom Python function speeds up the generation of bibliography strings from four minutes to a few seconds in my local testing.
  • YAML serialization (even with CDumper) is significantly slower than JSON serialization with msgspec in my testing (by a factor of at least 20); since Hugo also supports JSON for data files, we should probably switch the build pipeline to write JSON files instead.
@mbollmann mbollmann self-assigned this Nov 2, 2024
@mbollmann
Copy link
Member Author

Pending my checks for functional equivalence, the new library is much faster in generating all the Hugo data files:

on master [?]
~/r/acl-anthology/bin $ time python create_hugo_yaml.py -c

________________________________________________________
Executed in  175.53 secs    fish           external
   usr time  173.80 secs  290.00 micros  173.80 secs
   sys time    1.25 secs   76.00 micros    1.25 secs


on build-pipeline-with-new-library [?]
~/r/acl-anthology/bin $ time python create_hugo_yaml.py -c

________________________________________________________
Executed in   73.86 secs    fish           external
   usr time   74.15 secs    0.00 micros   74.15 secs
   sys time    1.67 secs  935.00 micros    1.67 secs

@mbollmann
Copy link
Member Author

It's also successfully building already (although the BibTeX part is not ported yet):
https://preview.aclanthology.org/build-pipeline-with-new-library/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant