Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify GCI users in repository information #114

Merged
merged 3 commits into from
Jan 2, 2018

Conversation

andrewda
Copy link
Member

@andrewda andrewda commented Dec 23, 2017

This begins to identify users in several meta repository stats,
including stargazers, watchers and forks.

Closes #100


A few notes:

  • (fixed in ed4f74e) The number of users might not be higher in the build until the cache expires. This is a separate issue: we should bypass caching stuff if there is no github_account property on the user and jump straight to the checking stuff.
  • We don't check the time of the watch/star events in order to see if it happened during GCI (as @jayvdb wanted in Find users in repository information #100 (comment)) because that information isn't returned to us through the API.
  • As far as I can tell, we are unable to access the name of the fork owner (GraphQL type: RepositoryOwner) without making another request (to get the User from the RepositoryOwner). That would involve several hundred more requests and probably being rate limited by GitHub, unless someone can figure out a way to do it all in one query – I wasn't able to after a few hours of trial and error.
  • This vastly reduces the overall number of API calls we have to make. Instead of calling findGitHubUserInOrg for every user, we only have to search pre-fetched information for a large number of users.
  • This sets the base for Use GraphQL API #111, meaning it'll be a low difficulty issue for anyone with GraphQL experience, and a medium difficulty issue for anyone without experience.
  • There are a lot of package-lock.json changes that get fixed in Create RSS feed of data changes #95 – this definitely should not have 900+ deletions.

@andrewda andrewda force-pushed the repo-info branch 2 times, most recently from 86560eb to b26f70e Compare December 23, 2017 22:45
@jayvdb jayvdb requested review from wisn and blazeu December 24, 2017 01:51
@@ -0,0 +1,30 @@
module.exports.REPO_INFO_QUERY = `
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store in a repo_info.graphql ?
the variables are part of the language.

.concat(repo.watchers.map(u => u.login.toLowerCase()))
.concat(repo.stargazers.map(u => u.login.toLowerCase()))
.concat(
repo.forks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I could create account https://github.com/WinningIt , fork a Ubuntu repo, and my fake account gets linked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you could if you wanted to be malicious, but with the current implementation you could create that same fake account and make a BS issue on any Ubuntu repository and the account would be linked. Or even just do a PR review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue is a very noticable action. It is a new thing, and the malicious person is the creator.
A review is not quite as noticeable, but it would need to be done on an obscure open PR in order to not be noticed.
Either would likely be seen as intentional confusion, maybe suitable for GitHub to investigate as 'abuse'.

stars/watches/forks are almost invisible. And it is very hard to file 'abuse' with GitHub for doing those actions.

Your issue is a specific attempt to do #8 .

The biggest problem is that so far this algorithm has a lower confidence level than the previous algorithms, but it is being used first. This could be used to override the existing algorithms, which are more reliable.

Once you have a potential match, which wasnt found using the existing more reliable algorithms, you need to look at the match profiles to determine how you can increase the confidence level of your match. The more effort you require of the abuser, the more likely their abuse can only be viewed as intentional or at least highly suspicious.

Probably also a good idea to annotate each match with the matching method used.

API hits isn't relevant now, as the hits can grow over time. You can include probable matches in the yaml which are not included in the rendered page, as they are needing more analysis deferred until a subsequent build has extra API calls to use.

Copy link
Member Author

@andrewda andrewda Dec 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about running the same checks we do on the user found this way as we do for users found by removing spaces in the display name? i.e. we make sure the user has at least a commit, issue, PR or review on a repository of the org

That way we can have the same confidence in this prediction as we do for all the others

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with ^ is that it cuts the number of matches in half and gets rid of some valid matches, but it does result in a much more accurate result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the total number of matches increased?
If so, loosing some valid matches is ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it only increases by 1 if we verify using our current method, but will increase by 3 if we allow contributions to any of the GCI orgs to count (not just the one they are a leader for).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is enough to meet the requirements of the task

lib/scrape.js Outdated
displayName,
shortName
)
if (userFromRepo) return userFromRepo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I could create account https://github.com/WinningIt , fork a Ubuntu repo, and my fake account gets linked?

@jayvdb
Copy link
Member

jayvdb commented Dec 24, 2017

Add a commit to forcibly clear the cache of unlinked usernames so we can see it working. We'll no doubt need to redo this occasionally.


module.exports.REPO_INFO_QUERY = fs
.readFileSync(`${__dirname}/repo_info.graphql`)
.toString()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll be doing this very often, might as well just create a function to load .graphql

Probably put that in utils.js file or there might be an existing npm packages to do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. It'll be like 3 lines of code to do it ourselves so may as well keep it light weight.

@@ -0,0 +1,3 @@
const { loadQuery } = require('../utils')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCI Leaders now using Babel, isn't it? Use import instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only frontend code (/static) uses Babel at the moment, so this isn't yet possible.

@@ -0,0 +1,3 @@
const { loadQuery } = require('../utils')

module.exports.REPO_INFO_QUERY = loadQuery('repo_info')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use

export const REPO_INFO_QUERY = // ...

or

export {
  REPO_INFO_QUERY: // ...
}

I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll make a PR after this to setup Babel for the backend and start using ES2017 syntax.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright.

lib/scrape.js Outdated
const GCI_API_BASE = 'https://codein.withgoogle.com/api'

let COMPETITION_OPEN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib/scrape.js Outdated
})
})

logins = logins.filter((item, pos, self) => self.indexOf(item) == pos)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use === instead of ==. See https://github.com/airbnb/javascript#comparison--eqeqeq.

if (
!(
existingUser &&
existingUser.github_updated &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Airbnb, but this is a styling thing in our ESLint config that we should probably change in another PR.

@@ -0,0 +1,4 @@
const fs = require('fs')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use import instead?

lib/utils.js Outdated
@@ -0,0 +1,4 @@
const fs = require('fs')

module.exports.loadQuery = name =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use export const loadQuery instead.

@andrewda andrewda force-pushed the repo-info branch 2 times, most recently from 43cc2d4 to abef9af Compare December 24, 2017 06:56
@andrewda andrewda force-pushed the repo-info branch 4 times, most recently from 9c4c252 to 3377081 Compare December 24, 2017 18:28
lib/scrape.js Outdated
const GCI_API_BASE = 'https://codein.withgoogle.com/api'

const MIN_SEARCH_SCORE = 10

// The time to cache GitHub usernames for in milliseconds
const GITHUB_CACHE_TIME = 2 * 24 * 60 * 60 * 1000
const BUST_GITHUB_CACHE = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this will keep busting the cache until someone sets this to false.

Each cache bust should be a once-off operation which automatically reverts to non-cache busting on the next build.
There are ways to use of elements of the merge that can be used to detect whether the bust was complete.
Or store the cache bust timestamp in the deploy, and dont let it occur again within a short period of time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using Travis to check the commit name for a specific keyword work?

Copy link
Member Author

@andrewda andrewda Dec 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually netlify doesn't expose the commit message, so we're going to have to check with git log -1 --pretty=%B. Another alternative is an env variable so that the cache busting can't be abused by someone other than an admin?

@andrewda andrewda force-pushed the repo-info branch 3 times, most recently from bad5c3b to a36b823 Compare December 25, 2017 08:43
@jayvdb
Copy link
Member

jayvdb commented Dec 25, 2017

repo_info.graphql should be renamed to include github as that is what it is. GitLab doesnt support it yet I believe, due to patent issues , but it could, or other uses of GraphQL could arise, and it doesnt hurt to be explicit in the filename.

As we increase our API usage, these functions might occasionally
fail for rate limiting purposes. We don't want the entire script to
crash due to this rate limiting, so catch them with an empty value.
This begins to identify users in several meta repository stats,
including stargazers, watchers and forks.

bust-cache

Closes coala#100
@jayvdb
Copy link
Member

jayvdb commented Jan 2, 2018

@gitmate-bot fastforward

@gitmate-bot
Copy link
Collaborator

Hey! I'm GitMate.io! This pull request is being fastforwarded automatically. Please DO NOT push while fastforward is in progress or your changes would be lost permanently ⚠️

@gitmate-bot
Copy link
Collaborator

Automated fastforward failed! Please fastforward your pull request manually via the command line.

Reason:

Command: git push upstream HEAD:master

Exit Code: 1

Cause:

remote: error: GH006: Protected branch update failed for refs/heads/master.        
remote: error: 1 review requesting changes by reviewers with write access.        
To <hidden_oauth_token>
 ! [remote rejected] HEAD -> master (protected branch hook declined)
error: failed to push some refs to '<hidden_oauth_token>'


@jayvdb
Copy link
Member

jayvdb commented Jan 2, 2018

ack 6a17aac 97acdb8 b339627

@jayvdb
Copy link
Member

jayvdb commented Jan 2, 2018

@gitmate-bot fastforward

@gitmate-bot
Copy link
Collaborator

Hey! I'm GitMate.io! This pull request is being fastforwarded automatically. Please DO NOT push while fastforward is in progress or your changes would be lost permanently ⚠️

@gitmate-bot gitmate-bot merged commit 97acdb8 into coala:master Jan 2, 2018
@gitmate-bot
Copy link
Collaborator

Automated fastforward with GitMate.io was successful! 🎉

@andrewda andrewda deleted the repo-info branch January 2, 2018 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Find users in repository information
5 participants