-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author name URLs #623
Comments
Since ACL 2020 intends to share information with future conferences, it may be desirable to commit to making all current author pages stable (so you will always be matt-post and future Matt Posts will have to get a suffix). To handle ambiguous names, it may help to distinguish between names and people. So, if there's only one Matt Post, then
|
I also wonder if, rather than creating a new system of IDs on top of ORCID, START, DBLP, Semantic Scholar, and Google Scholar, should we adopt one of those existing systems of IDs? |
In principle I like the idea of adopting ORCID since uniquely identifying people is its entire purpose. START usernames are not as clean (I've seen people with multiple START IDs), and the others are automatically mined and therefore subject to error. But what about authors who don't have an ORCID? I suspect in any event we'll need a mixture of external and internal IDs. |
This is not an easy problem: We should definitely not roll our own ID schema, relying on semanticscholar seems to brittle (how long will they / their IDs be around?); ORCID seems to be the best option because it is the only ID explicitly made for this job and I am very much in favor of future conferences collecting ORCIDs for submissions. However, it is not easy to 1) find the orcids for already existing papers and 2) deal with people without orcid. The proposal by @davidweichiang seems sensible to me (existing author keeps URL on clashes), the /names/ URL would then be linked to from /people/ pages of people with multiple authors, similar to disambiguation sites on Wikipedia? And whoever has access to people organizing conferences: please lobby for orcid, it will make our lives easier in the long run :-) |
Hi all:
I'd also support ORCID. I had brought this up to TACL and CL before and
then I understood that MIT Press was pursuing this anyways, so the editors
on both CL and TACL stopped worrying about it.
I agree with Arne, not to create our own. This is exactly why ORCID was
created in the same guise as DOIs, and it will survive any one potential
parties' demise (the verdict is not so clear with Semantic Scholar, IMHO).
I also agree with Nathan in that we definitely need at least an internal
system.
I think we should use ORCID as a primary vehicle (and redirect folks to
those IDs where possible) but also retain our own author URLs for cases
where there are multiple namesakes; (matt-post, matt-post-2) . When and if
an author mints a ORCID and reveals it to us, we permanently forward the
existing namesake page to the ORCID (so matt-post gets redirected to
0000-0002-1297-6794 and we don't re-use matt-post again; the next matt-post
is matt-post-3
Cheers,
Min
…--
Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore
:: NUS School of Computing, AS6 05-12, 13 Computing Drive
Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) ::
kanmy@comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)
On Fri, Nov 8, 2019 at 7:16 PM Arne Köhn ***@***.***> wrote:
This is not an easy problem: We should definitely not roll our own ID
schema, relying on semanticscholar seems to brittle (how long will they /
their IDs be around?); ORCID seems to be the best option because it is the
only ID explicitly made for this job and I am very much in favor of future
conferences collecting ORCIDs for submissions. However, it is not easy to
1) find the orcids for already existing papers and 2) deal with people
without orcid.
The proposal by @davidweichiang <https://github.com/davidweichiang> seems
sensible to me (existing author keeps URL on clashes), the /names/ URL
would then be linked to from /people/ pages of people with multiple
authors, similar to disambiguation sites on Wikipedia?
And whoever has access to people organizing conferences: please lobby for
orcid, it will make our lives easier in the long run :-)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#623?email_source=notifications&email_token=AABU7263KCNSXUMI2OJ3NM3QSVDBLA5CNFSM4JJL7222YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDQNTTQ#issuecomment-551606734>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABU72ZKWT2CZMAR4ULXX3LQSVDBLANCNFSM4JJL722Q>
.
|
It seems ORCID is the way to go, when we have it. It's too bad that the ACL email that went out recently collected pretty much everything except ORCIDs. |
It’s not too late per se. I think we could encourage Rich Gerber at START
to add a field to the global profile to collect ORCID. It’d just not be
mandatory at this point.
- M
On Wed, 27 Nov 2019 at 08:33, Matt Post ***@***.***> wrote:
It seems ORCID is the way to go, when we have it. It's too bad that the
ACL email that went out recently collected pretty much everything *except*
ORCIDs.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#623?email_source=notifications&email_token=AABU723RXZIW4BMDNAM5Q53QVW55FA5CNFSM4JJL7222YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFH4KQI#issuecomment-558875969>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABU7225IDFY3VAM3ETPEZTQVW55FANCNFSM4JJL722Q>
.
--
- M
|
It would be good to have that in START, but without it being mandatory, I don't think anyone will fill it out. Though I bet we can triangulate them with all the other information we're getting. |
Very true. If there's an automatic triangulation software one of us
writes, we could have it validate the result by sending an email to the
START user.
Cheers,
Min
…--
Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore
:: NUS School of Computing, AS6 05-12, 13 Computing Drive
Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) ::
kanmy@comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)
On Wed, Nov 27, 2019 at 11:17 AM Matt Post ***@***.***> wrote:
It would be good to have that in START, but without it being mandatory, I
don't think anyone will fill it out. Though I bet we can triangulate them
with all the other information we're getting.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#623?email_source=notifications&email_token=AABU722Z3CZ6FGSGFOQG6ALQVXRETA5CNFSM4JJL7222YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFIEUPA#issuecomment-558910012>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABU726HV6TLWHV4VIBSKF3QVXRETANCNFSM4JJL722Q>
.
|
A good idea. Though I suspect that many people haven't even registered for an ORCID. I'm fairly trendy with these things and only recently did so myself. |
Semantic Scholar is the main source of information in the ACL form, and that one asks for ORCID on sign-up, so that could be a place to start. |
@mbollmann I don't understand your suggestion (in: I do not even know whether you made a suggestion). In my opinion, we (that is probably @mjpost) should lobby for adding (ideally: required) ORCID fields into future submission processes. Without that step, there will always be additional manual work to perform matching and I don't think that is sustainable in the long run. When ORCIDs are not introduced by conferences, there is little point in introducing them here. The point of ORCIDs is that they are clean; performing error-prone matching all the time on our end defeats its purpose. |
I was just pitching the idea that we could seed an initial ORCID database for the Anthology via Semantic Scholar. This does not include manual or error-prone matching because:
Now, I don't know how many people will have both claimed their SS page and added their ORCID, but it could be a start. Of course asking for them directly as part of the submission process should be the way to go in the future, I totally agree with you here, @akoehn. |
Yes, this is true. But: in our database, the ORCID is not a property of
the author (as the whole point is that we do not have author entities in
our data) but of each individual paper. We would therefore only obtain
ORCIDs for ACL 2020 papers, and that with a lot of trouble: we would
need to obtain the start username for every author, obtain the
start-user -> semantic scholar mapping, and then query semantic scholar.
We would then need to back-propagate this information for every accepted
paper. This seems to be quite a bit of work given that we probably will
only be able to map a small subset of ACL 2020 submissions that way (not
everyone has orcid, not everyone has a semantic scholar page, not every
page is claimed, etc.).
|
If the community wants to adopt ORCID then probably the best way is to make it a required part of the START global profile, in time for the ACL 2020 camera-ready deadline (IMO it would be too sudden a change to require it for the submission deadline).
We effectively have (imperfect) author entities through a combination of the author name strings in paper entries and the name_variants database. I assume we'd need to (a) propagate ORCIDs backwards or (b) go with a hybrid strategy that clusters by ORCID where available and continues to use the name_variants system for compatibility with legacy data (or future data from non-ACL/non-START events). If we want to be conservative about propagating ORCIDs backward, I suppose it might be possible to obtain START usernames on papers at least for recent major conferences, since START usernames are a more unique set of identifiers than the name strings (though some authors have multiple START profiles). Then these could be mapped to ORCIDs with growing coverage as more people update their global profiles for ACL 2020 and future venues. We could also email authors on an ad hoc basis to confirm that the Anthology isn't conflating them with other authors. This would allow cleaner back-propagation of ORCIDs. |
I concur that ORCID is the way to go. I would be in favor of making ORCID mandatory in START. |
Since ACL is a time for planning, I want to revisit this thread. Can we push for mandatory ORCIDs in START, maybe in time for EMNLP camera-ready? (@mjpost, would this require discussion among the ACL Exec?) Note that START in general (at least for workshops; I don't know about EMNLP) allows listing unregistered users as authors. So I think the policy should be that camera-ready submissions have ORCIDs for ALL authors, and if it is a registered user it would be loaded automatically from the START global profile. |
Good idea. A few thoughts:
I like how author pages are guessable. One idea is to use a single guessable name ID page, eg |
The simplest step forward might be to say that ORCIDs are attached as an extra field to papers, not Anthology author records directly, though of course any paper with an ORCID would allow us to unambiguously match against existing authors with the same ORCID on other papers (or to infer it's a new author if all existing authors by that name have papers with other ORCIDs). The Then we could allow manual disambiguation of past authorship by adding the ORCID for the paper. (Maybe there should be a UI for authors to do this themselves: manually verify their past papers. But if not it can be done directly in XML.) Thus any explicit ORCID in the XML would be trustworthy. Papers for which we don't have ORCIDs would continue to be assigned to semiautomatic author pages under the current system. Perhaps the verified/unverified distinction should be exposed to the user. |
How about this:
This way we can gradually add ORCIDs for people already in the database, for the vast majority of cases where the name is unambiguous. There will be a few cases where, as ORCIDs come in, we realize that existing names refer to more than one person. At that point, we will have to retroactively disambiguate by hand. As far as author URLs, I would say stick with |
Would this mean overloading the |
I was imagining that we would replace the current IDs with ORCIDs when we find out the ORCIDs. |
Would these be manually reviewed? Just want to be sure new sources of noise are distinguished from authoritative pieces of metadata. |
Yes. |
I like this, but what about the minor change of using One other thing this addresses: for authors we disambiguate manually, we can keep their ID that we choose for them. Should we ever get an ORCID for them, we can easily create a link to that as their canonical author page, so as to create link permanence. |
Issue #623 Now generates author pages with urls in form name/id for most people this looks like: people/d/david-chiang/david-chiang/ Matt Post has an ORCID in name_variants.yaml, so his page is: people/m/matt-post/0000-0002-1297-6794/ and then there is: people/y/yang-liu/yang-liu-edinburgh/ people/y/yang-liu/yang-liu-ict/ people/y/yang-liu/yang-liu-icsi/ people/y/yang-liu/yang-liu-umich/ I don't know how to make the old URLs people/m/matt-post/ resolve.
I can't find where I commented on this, but now that ACL 2020 is collecting links to ORCid, Semantic Scholar, and Anthology pages, I'm reminded that we don't have stable author page names. For example, if another Matt Post comes along, we have to fork the current pages.
I like Semantic Scholar's approach, where for example my page is:
I don't know how the integer is selected, but we could use a similar system, say starting at 1, and moving up from there. When there is no ambiguity, the base page would redirect to
/1
, e.g.,When there is ambiguity, the base page would then be used to hold variants with no assigned ID.
The text was updated successfully, but these errors were encountered: