-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve identification of committers' organizations #24
Comments
Including your real email address in public Git commits is a great way to invite spam. 🙁 Instead, the best practice is to commit using an anonymized address like |
This overhead seems worthwhile and a valuable service to the community. 👍👍 Perhaps GitHub would be willing to publish the aggregated data directly, if asked nicely.
Why? An annual ranking would be sufficient for most purposes. Certainly the ranking is not going to change substantially from day to day. |
Regarding the volume of GitHub API calls - we use Conditional Requests and cache responses for so many things on GitHub. As long as someone's profile doesn't change, you aren't charged an API call, for example. Might help make it more of a reality... At our company, we ask employees to try and maintain a professional profile, and we internally allow them to choose to tell us who they are...
Hard problem to do at scale, for sure. Happy to help brainstorm. |
Hey Jeff, Thanks for your comments! After much analysis, we concluded it was best to use the email address of the commit author to identify the organization to which they belong. Otherwise we loose almost 80% of made contributions, a lot of engineers don't note their companies in their profiles. |
I can confirm that we take the same approach at @Avanade - and whilst I love the OSCI tool, I worry that the contribution data could become inaccurate over time. Taking a recent update to the companies list as an example - Release v2022.09.0 (#144) · epam/OSCI@cbf6b35 (github.com) - if we assume that James, Mohit, Guilherme and Justin work at Credera, Infosys, Farfetch, and ebay – then only Mohit’s contribution would have been associated with Infosys, as the others are contributing with personal email addresses or user.noreply emails. We ask employees to use one GitHub account - and then complete the Organization field / be invited to the Avanade GitHub org. Employees log into both GitHub & our own corporate system. That way, if someone moves to another organization, they can keep their private commit history, which we feel provides them with a good "CV" and we'd want to support people wherever they choose to go in their career. This is a GitHub native feature too, if you use something like GitHub Enterprise - https://docs.github.com/en/enterprise-server@3.7/admin/user-management/managing-users-in-your-enterprise/viewing-people-in-your-enterprise |
The current OSCI implementation uses the email domain of the committer to identify their organization. Many developers do not use their company email address on GitHub, or do not make their email address public. However many of these people do include their organizational information in their GitHub user profiles.
We would like to improve the identification of committers organization using the data in their user profiles.
<<<>>>
We already made an experiment to do this, but with minimal success. This is described below.
The basic matching algorithm works like this:
If after applying the basic algorithm the matching did not occur, an extended algorithm was proposed:
Result of experiment:
For only 38% of all the users examined, we managed to match a company from their profile. The remaining profiles did not have a clear match. For milder match rules, only 5% is added.
It is also worth noting that for users where we managed to match their company from their profile, the company is the same as that received from the email in all cases.
Finally, this method of identifying company carries a large overhead. When implementing this approach, it will be necessary to download information for all users who made push events in 2020, their number (as of June 2020) is 5M - loading their profiles will take about 42 days calculating with GitHub API usage limits. We would also have to additionally load new profiles every day, and their download, in turn, may not fit into the daily usage limits.
The text was updated successfully, but these errors were encountered: