Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing middle names in ingestion materials #3985

Open
mbollmann opened this issue Oct 30, 2024 · 7 comments
Open

Missing middle names in ingestion materials #3985

mbollmann opened this issue Oct 30, 2024 · 7 comments
Assignees
Milestone

Comments

@mbollmann
Copy link
Member

For quite some time now, it has been a recurring issue that author’s middle names are not appearing in the ingestion material for the Anthology. This creates a lot of irritation among authors who keep having to file corrections, despite their name being entered fully and correctly on OpenReview, and a lot of work for the Anthology. I could easily find dozens of issues related to missing middle names within a few minutes of searching:

#3218
#3353
#3408
#3467
#3492
#3532
#3559
#3577
#3588
#3898
#3735
#3984

I remember hearing at some point that there was a conscious decision to automatically remove middle names from author names when preparing ingestion materials (@mjpost?), but I couldn’t quickly find the source of that information or any other discussion around this, so I’m opening this issue instead. I’m not even sure if the cause for this would be in our own ingestion scripts or in ACLPUB2 (and the issue better be opened there).

In any case, if there is a step in the process that automatically filters out middle names, I would strongly question that decision and suggest that it be changed.

@mjpost mjpost pinned this issue Nov 9, 2024
@mjpost mjpost added this to the 2024Q4 milestone Nov 9, 2024
@mjpost
Copy link
Member

mjpost commented Nov 9, 2024

Thanks for posting this. This is a recurring issue which is cropping up again with EMNLP ingestion. I'll post a summary of the history of this issue later today.

@mjpost
Copy link
Member

mjpost commented Nov 9, 2024

One question I have for you, @mbollmann: does the new Python module make it easy to get a list of full names, so that we could match against them? I am likely going to write something to handle this later today, and it would be nice to use the new module instead of the old code. I'll look to the module first but I'll have to prioritize speed, which favors the code base I currently understand.

@mbollmann
Copy link
Member Author

What exactly are you trying to achieve? I suspect there’s a better way to do this than starting from a list of full names. FWIW, you could have a look at the documentation (e.g. https://acl-anthology-py.readthedocs.io/en/stable/guide/accessing-authors/) and see if that helps; if not and you decide to fall back on the old codebase, I’m also happy to take a look at the script afterwards to suggest how I’d port it. But I’ll be away from a computer the next ~24 hours, so I can’t try out things in the meantime.

@mbollmann
Copy link
Member Author

mbollmann commented Nov 9, 2024

In any case, if there is a step in the process that automatically filters out middle names, I would strongly question that decision and suggest that it be changed.

For the record, my argument for changing that decision is that currently, it seems that authors have literally no control over getting their names right in the metadata; conversely, if we had the issue of middle names appearing in the metadata that authors don’t want, they have the control to change that themselves in their OpenReview profiles. Therefore I think the latter is the much preferable solution.

@mjpost
Copy link
Member

mjpost commented Nov 9, 2024

Here's the promised background:

  • Most of the problems stem from the transition to OpenReview (OR); under Softconf, the problem was more-or-less resolved, since it had two fields (first and last) corresponding to the Anthology representation, and through years of use people understood how they were used.
  • OpenReview instead has just a single name field, which requires parsing into our two components.
  • The name parsing is not done in the Anthology ingestion scripts, but in the aclpub2 packaging. So this issue was upstream of us, until recently.
  • aclpub2 parses the entire string found in the OR export into three fields: first_name, middle_name, and last_name.
  • Initially after the OR/aclpub2 transition, aclpub2 combined its first_name and middle_name fields into the Anthology <first> field. This led to a large number of corrections to remove middle names, which were present in the OR metadata but did not match the PDF or the user's wishes. I can't find the exact commit, but at the time, this led to a decision to throw out most or all middle names, apart from some heuristics, perhaps
  • Earlier this year, we got a change implemented where aclpub2 exports the full name using a "name" field, while also continuing to parse the name into three fields. The Anthology ingestion continues to use just the first and last name fields
  • I agree with your point above that it is better to include the information from OR, since users can change that, rather than make changes that they have no control over. The decision earlier to remove middle names, therefore, was the wrong one.
  • My thinking is that the Anthology should do the name parsing. Using the name field, as well as metadata—including ambiguous information such as affiliation but also unambiguous information such as ORCID—we should match names to the preferred one listed in the Anthology, which would give us the correct first name / last name split.
  • I also wonder if we could use LLMs here to resolve this: grab a screenshot of the top half of the first page of the PDF, and pass that with the list of metadata-provided names, and ask it to spit out a list of (last name, first name) pairs from the PDF.

Here is an example from the aclpub2 export, a papers.yaml file:

  - dblp_id: https://dblp.org/pid/96/4410
    emails: wcampbell@ll.mit.edu
    first_name: William
    last_name: Campbell
    middle_name: M.
    name: William M. Campbell
    username: ~William_M._Campbell1

We can see the middle name here has been dropped, likely heuristically. We also have information that could help resolve this user. Note that this user currently has two author pages, but that the version with the middle initial is correct. There is also a third variant here.

@mjpost
Copy link
Member

mjpost commented Nov 10, 2024

Note that in #4024 I restored the use of the middle name that is parsed out from aclpub2. This affected 888 name instances in EMNLP 2024 and workshops, and 743 individuals. This provides a sense of the magnitude of the decision here. I do agree, though, that using the name provided is the best approach, since it provides authors with full control over how their name presents.

@mbollmann
Copy link
Member Author

Thanks for the background, Matt! I think it’s good to have this documented in one place.

  • My thinking is that the Anthology should do the name parsing. Using the name field, as well as metadata—including ambiguous information such as affiliation but also unambiguous information such as ORCID—we should match names to the preferred one listed in the Anthology, which would give us the correct first name / last name split.

That would happen in the ingestion script, I assume? Maybe I can take a look at that one first after #3996 is finished.

  • I also wonder if we could use LLMs here to resolve this: grab a screenshot of the top half of the first page of the PDF, and pass that with the list of metadata-provided names, and ask it to spit out a list of (last name, first name) pairs from the PDF.

Maybe I’m a bit old-school here, but I would probably work with GROBID instead of going for LLMs. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants