-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify GCI users in repository information #114
Conversation
86560eb
to
b26f70e
Compare
lib/queries/index.js
Outdated
@@ -0,0 +1,30 @@ | |||
module.exports.REPO_INFO_QUERY = ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store in a repo_info.graphql
?
the variables are part of the language.
.concat(repo.watchers.map(u => u.login.toLowerCase())) | ||
.concat(repo.stargazers.map(u => u.login.toLowerCase())) | ||
.concat( | ||
repo.forks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I could create account https://github.com/WinningIt , fork a Ubuntu repo, and my fake account gets linked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you could if you wanted to be malicious, but with the current implementation you could create that same fake account and make a BS issue on any Ubuntu repository and the account would be linked. Or even just do a PR review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An issue is a very noticable action. It is a new thing, and the malicious person is the creator.
A review is not quite as noticeable, but it would need to be done on an obscure open PR in order to not be noticed.
Either would likely be seen as intentional confusion, maybe suitable for GitHub to investigate as 'abuse'.
stars/watches/forks are almost invisible. And it is very hard to file 'abuse' with GitHub for doing those actions.
Your issue is a specific attempt to do #8 .
The biggest problem is that so far this algorithm has a lower confidence level than the previous algorithms, but it is being used first. This could be used to override the existing algorithms, which are more reliable.
Once you have a potential match, which wasnt found using the existing more reliable algorithms, you need to look at the match profiles to determine how you can increase the confidence level of your match. The more effort you require of the abuser, the more likely their abuse can only be viewed as intentional or at least highly suspicious.
Probably also a good idea to annotate each match with the matching method used.
API hits isn't relevant now, as the hits can grow over time. You can include probable matches in the yaml which are not included in the rendered page, as they are needing more analysis deferred until a subsequent build has extra API calls to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about running the same checks we do on the user found this way as we do for users found by removing spaces in the display name? i.e. we make sure the user has at least a commit, issue, PR or review on a repository of the org
That way we can have the same confidence in this prediction as we do for all the others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with ^ is that it cuts the number of matches in half and gets rid of some valid matches, but it does result in a much more accurate result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the total number of matches increased?
If so, loosing some valid matches is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it only increases by 1 if we verify using our current method, but will increase by 3 if we allow contributions to any of the GCI orgs to count (not just the one they are a leader for).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 is enough to meet the requirements of the task
lib/scrape.js
Outdated
displayName, | ||
shortName | ||
) | ||
if (userFromRepo) return userFromRepo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I could create account https://github.com/WinningIt , fork a Ubuntu repo, and my fake account gets linked?
Add a commit to forcibly clear the cache of unlinked usernames so we can see it working. We'll no doubt need to redo this occasionally. |
lib/queries/index.js
Outdated
|
||
module.exports.REPO_INFO_QUERY = fs | ||
.readFileSync(`${__dirname}/repo_info.graphql`) | ||
.toString() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll be doing this very often, might as well just create a function to load .graphql
Probably put that in utils.js
file or there might be an existing npm packages to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. It'll be like 3 lines of code to do it ourselves so may as well keep it light weight.
@@ -0,0 +1,3 @@ | |||
const { loadQuery } = require('../utils') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCI Leaders now using Babel, isn't it? Use import
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only frontend code (/static
) uses Babel at the moment, so this isn't yet possible.
lib/queries/index.js
Outdated
@@ -0,0 +1,3 @@ | |||
const { loadQuery } = require('../utils') | |||
|
|||
module.exports.REPO_INFO_QUERY = loadQuery('repo_info') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use
export const REPO_INFO_QUERY = // ...
or
export {
REPO_INFO_QUERY: // ...
}
I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll make a PR after this to setup Babel for the backend and start using ES2017 syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright.
lib/scrape.js
Outdated
const GCI_API_BASE = 'https://codein.withgoogle.com/api' | ||
|
||
let COMPETITION_OPEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Group that let
and const
. See https://github.com/airbnb/javascript#variables--const-let-group.
lib/scrape.js
Outdated
}) | ||
}) | ||
|
||
logins = logins.filter((item, pos, self) => self.indexOf(item) == pos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use ===
instead of ==
. See https://github.com/airbnb/javascript#comparison--eqeqeq.
if ( | ||
!( | ||
existingUser && | ||
existingUser.github_updated && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Airbnb, but this is a styling thing in our ESLint config that we should probably change in another PR.
@@ -0,0 +1,4 @@ | |||
const fs = require('fs') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use import
instead?
lib/utils.js
Outdated
@@ -0,0 +1,4 @@ | |||
const fs = require('fs') | |||
|
|||
module.exports.loadQuery = name => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use export const loadQuery
instead.
43cc2d4
to
abef9af
Compare
9c4c252
to
3377081
Compare
lib/scrape.js
Outdated
const GCI_API_BASE = 'https://codein.withgoogle.com/api' | ||
|
||
const MIN_SEARCH_SCORE = 10 | ||
|
||
// The time to cache GitHub usernames for in milliseconds | ||
const GITHUB_CACHE_TIME = 2 * 24 * 60 * 60 * 1000 | ||
const BUST_GITHUB_CACHE = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this will keep busting the cache until someone sets this to false.
Each cache bust should be a once-off operation which automatically reverts to non-cache busting on the next build.
There are ways to use of elements of the merge that can be used to detect whether the bust was complete.
Or store the cache bust timestamp in the deploy, and dont let it occur again within a short period of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using Travis to check the commit name for a specific keyword work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually netlify doesn't expose the commit message, so we're going to have to check with git log -1 --pretty=%B
. Another alternative is an env variable so that the cache busting can't be abused by someone other than an admin?
bad5c3b
to
a36b823
Compare
|
As we increase our API usage, these functions might occasionally fail for rate limiting purposes. We don't want the entire script to crash due to this rate limiting, so catch them with an empty value.
This begins to identify users in several meta repository stats, including stargazers, watchers and forks. bust-cache Closes coala#100
@gitmate-bot fastforward |
Hey! I'm GitMate.io! This pull request is being fastforwarded automatically. Please DO NOT push while fastforward is in progress or your changes would be lost permanently |
Automated fastforward failed! Please fastforward your pull request manually via the command line. Reason:
|
@gitmate-bot fastforward |
Hey! I'm GitMate.io! This pull request is being fastforwarded automatically. Please DO NOT push while fastforward is in progress or your changes would be lost permanently |
Automated fastforward with GitMate.io was successful! 🎉 |
This begins to identify users in several meta repository stats,
including stargazers, watchers and forks.
Closes #100
A few notes:
The number of users might not be higher in the build until the cache expires. This is a separate issue: we should bypass caching stuff if there is nogithub_account
property on the user and jump straight to the checking stuff.RepositoryOwner
) without making another request (to get theUser
from theRepositoryOwner
). That would involve several hundred more requests and probably being rate limited by GitHub, unless someone can figure out a way to do it all in one query – I wasn't able to after a few hours of trial and error.findGitHubUserInOrg
for every user, we only have to search pre-fetched information for a large number of users.package-lock.json
changes that get fixed in Create RSS feed of data changes #95 – this definitely should not have 900+ deletions.