Harvester

This is a script for collecting public search results for filename or file-extension queries on GitHub.

It's used by GitHub Linguist contributors to gauge real-world usage of languages and file extensions, especially when submitting new languages for registration on GitHub.

Due to weird access restrictions, the script must be run

from a browser context,
from a page hosted on github.com,
whilst signed-in with a registered GitHub account.

Attempting to load search results from an unauthorised or headless context will fail. See for yourself by opening this page while signed in, and compare it with what an incognito browser window sees.

Usage

Copy harvester.js to your clipboard.
Navigate to a GitHub-hosted page in your browser.
Remember, the URL's domain must be github.com due to CORS restrictions.
In your browser's console,
1. Paste the contents of harvester.js
  This defines the commands you'll use in the next step. It might request your permission to display desktop notifications. This is used to notify you when a harvest has finished.
2. Run harvest(" … ") to begin a search.
  To search for entire filenames instead of extensions, prepend the query with filename::
```
harvest("filename:.bashrc");
```
  Arguments are optional if Harvester's running from a search results page. E.g., if you have this page open in your browser, harvest(); is the same as
```
harvest("extension:asy", "NOT SymbolType");
```
  For any other page, a query must be specified.
Wait for it to finish.
This may take a while, depending on how many results there are. You'll see a desktop notification when it finishes.
Run copy(that) in the browser console.
This copies the collected URLs to your clipboard.

Bookmarklet

If you find yourself using this script often, consider adding bookmarklet.js to your browser's toolbar as a bookmarklet.

Note: This won't work on Firefox.

Ideally, this script would load the latest version of harvester.js and attach it to the page. This isn't possible due to CORS restrictions, so the entire script needs to be embedded as a single URL.

JavaScript interface

Running harvester.js adds three properties to global context:

`window.harvest(query[, searchHack])`

The function used for starting a search. It takes two arguments:

query
Your search query: either an extension or a filename.
```
harvest("extension:foo"); // Extension
harvest("filename:foo");  // Filename
```
Because extensions are more frequently searched for than filenames, the "extension:" prefix is optional. Ergo, the first line above can be shortened to just this:
```
harvest("foo"); // Extension
```
This is the format used throughout the rest of this documentation.
searchHack
An optional legitimate search query to include. The default is "NOT nothack" followed by random hex digits:
```
"NOT nothack" + Math.random(1e6).toString(16).replace(/\./, "").toUpperCase();
```
However, sometimes you want to narrow searches down to files which contain a certain substring. In those cases, you include the second parameter:
```
// Match `*.foo` files which contain the word "bar".
harvest("foo", "bar");
```
Sidenote: The "nothack" above is necessitated by the requirement that advanced searches include specific search criteria. This makes site-wide searching of extensions impossible, so this hacky workaround is used instead.

`window.silo`

An Object where successful searches are cached, keyed by query. Its contents look like this:

window.silo = {
	"extension:foo": {
		length: 6528,
		
		"/user/repo/blob/6eb5537/path/file.foo": "https://raw.githubusercontent.com/…",
		… 6527 more results
	},
};

The silo contains a helper method called reap which extracts, sorts, and joins a list of results as a string. It's called internally when accessing window.that to extract a sorted URL list.

The silo exists to provide some way of resuming an interrupted harvest, such as in the case of a lost connection. It isn't some persistent storage mechanism: navigating to another page causes its contents to be lost.

`window.that`

Reference to the results of the last successful harvest. Meant for use with the console's copy command:

copy(that);

Which is essentially a shortcut for

copy(silo.reap("extension:foo"));

Downloading files

The list copied by running copy(that); is a plain-text list of URLs that can be passed to wget(1), curl(1), or a similar utility to download files en masse:

# Using `wget` (recommended)
wget -nv -i /path/to/url.list

# Using `curl`
sed -e "s/'/%27/g" /path/to/url.list | xargs -n1 curl -# -O

Alternatively, to preserve the directory hierarchy while downloading files, you can use the following command:

wget -nv -x -i /path/to/url.list

Helpful scripts

Some useful shell commands to help with reporting in-the-wild usage on GitHub:

Listing unique repositories

This reads from urls.log and writes the output to unique-repos.log.

grep < urls.log -iEoe '^https?://raw\.githubusercontent\.com/([^/]+/){2}' \
| sort | uniq | sed -Ee 's,raw\.(github)usercontent,\1,g' > unique-repos.log

Listing unique users

This reads from unique-repos.log and saves to unique-users.log:

grep < unique-repos.log -oE '^https://github\.com\/[^/]+' \
| sort | uniq > unique-users.log

Tallying results

This prints a summary of how many unique users and repositories were found in total.

wc -l unique-{repos,users}.log | grep -vE '\stotal$' | grep -oE '^\s*[0-9]+' |\
xargs printf '\
Unique repos: %s
Unique users: %s
'

The following utilities are also of interest:

gh-search
Opens the URL of an extension/filename search using the system's default browser.
```
gh-search -e foo; # Search by extension
gh-search -f foo; # Search by filename
```
fixext
Fixes the suffixes added by wget(1) when downloading files with the same name.
```
$ fixext foo *
Renamed: saved.foo.1 -> saved.1.foo
```

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.editorconfig		.editorconfig
.eslintignore		.eslintignore
.eslintrc.json		.eslintrc.json
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
bookmarklet.js		bookmarklet.js
harvester.js		harvester.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harvester

Usage

Bookmarklet

JavaScript interface

`window.harvest(query[, searchHack])`

`window.silo`

`window.that`

Downloading files

Helpful scripts

Listing unique repositories

Listing unique users

Tallying results

About

Releases

Packages

Contributors 2

Languages

License

Alhadis/Harvester

Folders and files

Latest commit

History

Repository files navigation

Harvester

Usage

Bookmarklet

JavaScript interface

window.harvest(query[, searchHack])

window.silo

window.that

Downloading files

Helpful scripts

Listing unique repositories

Listing unique users

Tallying results

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`window.harvest(query[, searchHack])`

`window.silo`

`window.that`

Packages