PyPi package #913

mbollmann · 2020-07-09T17:10:50Z

There was some discussion on whether we should make our anthology library into a PyPi package. This would make it easier for people to use our Python interface to the Anthology, e.g., to build external tools or run analyses. It might even encourage people to contribute and add functionality to the library itself.

Requirements to achieve this (from the top of my head):

A mechanism to download/update the Anthology XML data from within the Python package.
Many Python packages download external data as part of their functionality (e.g., NLTK, torchtext), and I've personally used GitPython to do exactly this with the ACL Anthology for my recent Anthology analysis paper. I believe this is completely solvable.
A proper documentation. If we want to promote our Python API in this way, we should have at least a succinct, user-friendly documentation that gets people started on how to use it. I believe that might be good thing to have anyway, to help future volunteers for the Anthology who might work on the Python API. I'd also be happy to help prepare it.
Faster loading as discussed in Faster loading of Anthology class #835 could be a major factor for usability. I have more ideas in this direction that I want to look into at some point, but maybe it's more of a "nice-to-have" than an actual blocker?

Most importantly, I think it would be great to gauge the community's interest in this. If you'd be interested in and see value in working with Anthology data through a pip-installable library, give a thumbs up here!

The text was updated successfully, but these errors were encountered:

akoehn · 2020-07-09T17:52:47Z

I think the most work is proper versioning and releases. Right now code and data are automatically synchronized because they are in the same repository, but we cannot guarantee that an old version of the library works with new data (and we should not try to change that) and we currently have no versioning at all.

Extracting the code into its own git repo and embedding it here creates a lot of overhead (speaking from experience with these setups in an academic setting) and I don't know how we would have version numbers & releases while keeping the code in here.

mbollmann · 2020-07-09T19:05:40Z

Great points, @akoehn.

Versioning would indeed require more thought. We could have a file in data/ indicating the minimum version of the library needed to work with it, so the library could warn its users when it's outdated and no longer compatible with the latest XML. But it'd certainly be more work.

Conversely, though, you could say that the lack of versioning currently makes it less attractive for people to build on our API, since it could change at any moment without clear documentation. That's why I'm wondering how many people would even be interested in this, to see if it makes sense to think about this.

Extracting the code into its own git repo and embedding it here creates a lot of overhead

Are you thinking of the git submodule approach here? I don't see a lot of problems with just adding our package to this repo's requirements.txt instead, but maybe I haven't fully thought this through.

I don't know how we would have version numbers & releases while keeping the code in here.

I'm not sure what problems you foresee here; version numbers for the Python package could be kept in a subdirectory where the package lives (say lib/), and releases to PyPi could be triggered manually by us when appropriate.

akoehn · 2020-07-10T09:33:08Z

Are you thinking of the git submodule approach here?

No, I meant another repo. The thing is that fixing a bug is straight-forward now. With a separate repository, you would need to check out acl-anthology and the anthology code, make changes to the code, publish it locally (or otherwise make sure it is used by acl-anthology) test whether your fix worked, repeat.

The easiest way would probably be to generate a pypi package from the current setup where the core anthology code base is together with the library part in one repository and we don't have to think about versioning all the time.

mbollmann · 2023-08-21T11:00:59Z

There's a first usable version of a PyPI library now: https://pypi.org/project/acl-anthology-py/

I'm currently developing this in a separate repo, but I've thought about the versioning issues and think it should probably be moved into this repo, as keeping it in sync with the data format here (XML schema etc.) does seem like a headache otherwise. I don't see a big problem with having version numbers & releases within this repo, though.

Over the coming weeks, I'll prepare a feature branch here that merges in this library, so that we can continue the discussion here.

mbollmann self-assigned this Aug 8, 2021

mbollmann mentioned this issue Dec 17, 2022

Feature: create a proper package #2301

Closed

mjpost added enhancement help wanted Interesting but beyond current volunteer bandwidth good first project Good projects for new contributors labels Dec 17, 2022

mjpost added this to the 2023Q1 milestone Dec 17, 2022

mjpost pinned this issue Dec 26, 2022

mjpost modified the milestones: 2023Q1, 2023Q3 Jul 13, 2023

mbollmann added triaged Next on the docket and removed help wanted Interesting but beyond current volunteer bandwidth good first project Good projects for new contributors labels Oct 21, 2023

mjpost modified the milestones: 2023Q3, 2023Q4, 2024Q1 Jan 23, 2024

mjpost modified the milestones: 2024Q1, 2024Q2 May 11, 2024

mjpost modified the milestones: 2024Q2, 2024Q4 Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyPi package #913

PyPi package #913

mbollmann commented Jul 9, 2020

akoehn commented Jul 9, 2020

mbollmann commented Jul 9, 2020

akoehn commented Jul 10, 2020

mbollmann commented Aug 21, 2023

PyPi package #913

PyPi package #913

Comments

mbollmann commented Jul 9, 2020

akoehn commented Jul 9, 2020

mbollmann commented Jul 9, 2020

akoehn commented Jul 10, 2020

mbollmann commented Aug 21, 2023