This is a script for collecting public search results for filename or file-extension queries on GitHub.
It's used by GitHub Linguist contributors to gauge real-world usage of languages and file extensions, especially when submitting new languages for registration on GitHub.
Due to weird access restrictions, the script must be run
- from a browser context,
- from a page hosted on
github.com
, - whilst signed-in with a registered GitHub account.
Attempting to load search results from an unauthorised or headless context will fail. See for yourself by opening this page while signed in, and compare it with what an incognito browser window sees.
-
Copy
harvester.js
to your clipboard. -
Navigate to a GitHub-hosted page in your browser.
Remember, the URL's domain must begithub.com
due to CORS restrictions. -
In your browser's console,
-
Paste the contents of
harvester.js
This defines the commands you'll use in the next step. It might request your permission to display desktop notifications. This is used to notify you when a harvest has finished. -
Run
harvest(" … ")
to begin a search.
To search for entire filenames instead of extensions, prepend the query withfilename:
:harvest("filename:.bashrc");
Arguments are optional if Harvester's running from a search results page. E.g., if you have this page open in your browser,
harvest();
is the same asharvest("extension:asy", "NOT SymbolType");
For any other page, a query must be specified.
-
-
Wait for it to finish.
This may take a while, depending on how many results there are. You'll see a desktop notification when it finishes. -
Run
copy(that)
in the browser console.
This copies the collected URLs to your clipboard.
If you find yourself using this script often, consider adding bookmarklet.js
to your browser's toolbar as a bookmarklet.
Note: This won't work on Firefox.
Ideally, this script would load the latest version of harvester.js
and attach it to the page.
This isn't possible due to CORS restrictions, so the entire script needs to be embedded as a single URL.
Running harvester.js
adds three properties to global context:
The function used for starting a search. It takes two arguments:
-
query
Your search query: either an extension or a filename.harvest("extension:foo"); // Extension harvest("filename:foo"); // Filename
Because extensions are more frequently searched for than filenames, the
"extension:"
prefix is optional. Ergo, the first line above can be shortened to just this:harvest("foo"); // Extension
This is the format used throughout the rest of this documentation.
-
searchHack
An optional legitimate search query to include. The default is"NOT nothack"
followed by random hex digits:"NOT nothack" + Math.random(1e6).toString(16).replace(/\./, "").toUpperCase();
However, sometimes you want to narrow searches down to files which contain a certain substring. In those cases, you include the second parameter:
// Match `*.foo` files which contain the word "bar". harvest("foo", "bar");
Sidenote: The
"nothack"
above is necessitated by the requirement that advanced searches include specific search criteria. This makes site-wide searching of extensions impossible, so this hacky workaround is used instead.
An Object
where successful searches are cached, keyed by query. Its contents look like this:
window.silo = {
"extension:foo": {
length: 6528,
"/user/repo/blob/6eb5537/path/file.foo": "https://raw.githubusercontent.com/…",
… 6527 more results
},
};
The silo
contains a helper method called reap
which extracts, sorts, and joins a list of results as a string.
It's called internally when accessing window.that
to extract a sorted URL list.
The silo
exists to provide some way of resuming an interrupted harvest, such as in the case of a lost connection.
It isn't some persistent storage mechanism: navigating to another page causes its contents to be lost.
Reference to the results of the last successful harvest.
Meant for use with the console's copy
command:
copy(that);
Which is essentially a shortcut for
copy(silo.reap("extension:foo"));
The list copied by running copy(that);
is a plain-text list of URLs that can be passed to
wget(1)
,
curl(1)
,
or a similar utility to download files en masse:
# Using `wget` (recommended)
wget -nv -i /path/to/url.list
# Using `curl`
sed -e "s/'/%27/g" /path/to/url.list | xargs -n1 curl -# -O
Alternatively, to preserve the directory hierarchy while downloading files, you can use the following command:
wget -nv -x -i /path/to/url.list
Some useful shell commands to help with reporting in-the-wild usage on GitHub:
This reads from urls.log
and writes the output to unique-repos.log
.
grep < urls.log -iEoe '^https?://raw\.githubusercontent\.com/([^/]+/){2}' \
| sort | uniq | sed -Ee 's,raw\.(github)usercontent,\1,g' > unique-repos.log
This reads from unique-repos.log
and saves to unique-users.log
:
grep < unique-repos.log -oE '^https://github\.com\/[^/]+' \
| sort | uniq > unique-users.log
This prints a summary of how many unique users and repositories were found in total.
wc -l unique-{repos,users}.log | grep -vE '\stotal$' | grep -oE '^\s*[0-9]+' |\
xargs printf '\
Unique repos: %s
Unique users: %s
'
The following utilities are also of interest:
-
gh-search
Opens the URL of an extension/filename search using the system's default browser.gh-search -e foo; # Search by extension gh-search -f foo; # Search by filename
-
fixext
Fixes the suffixes added bywget(1)
when downloading files with the same name.$ fixext foo * Renamed: saved.foo.1 -> saved.1.foo