a project repository for a paper: On Measuring Vulnerable JavaScript Functions in the Wild
First part of my project: crawling data from Snyk and from googleDB to compile dataset of vulnerable functions.
Dataset from Snyk
Warning The scripts were developed around January 2021 and use HTML crawling to extract data from Snyk vulnerable entries. Since then Snyk slightly changed the interface, hence minor adaptation is required to queries for HTML elements. Edits are welcome.
- Collect meta information from all Snyk entries for JavaScript (npm). For that go “to vulnerable dataset/snyk” and execute
node snykPages.js path/to/output/in/json
- Crawl vulnerable and fixed functions from snyk entries that have a link to Github Commits. For that go to “vulnerable dataset/snyk/commits” and run
node fetchCommitFuncs.js path/to/input path/to/output/
where path to input is the result of snykPages.js
- Crawl vulnerable and fixed functions from snyk entries that point to Github Vulnerable Code. For that go to “vulnerable dataset/snyk/vulnerable code” and run
node fetchVulnCodeFuncs path/to/input path/to/output
where path to input is the result of running snykPages.js .
- From both resulting files filter out some obviously false results by running
node vulnerable dataset/filterVulnDS.mjs path/to/input path/to/filteredOutput
Dataset from GoogleDB
Warning The project was discontinued on December 13, 2021. The scripts, however, still work.
- Collect links to vulnerable entries. For that go to “vulnerable dataset/googleDB” and run
node crawlLinks.mjs path/to/output 0
where 0 is a table that we are interested in. (This vulnerability registry has 2 tables on their website: in the first one (index =0) they store vulnerabilities with complete information, including a link to a GitHub commit. In the second one (index=1) they store other vulnerabilities that lack the link and/or necessary metadata”. We only crawl table 0, because we need GitHub commit links).
- Get necessary metadata and links from 1. for that in the same folder run
node processLinks.mjs path/to/input path/to/output
where path to input is the path to result of 1.
- Same as with snyk, crawl GitHub commits. Go to “vulnerable dataset/googleDB/commits” and run
node fetchCommitFuncs.mjs path/to/input path/to/output
where onput is result from 2.
- Perform filtering as in step 3 for Snyk.
Next step is to analyse manually all resulting datasets of vulnerable & fixed functions. For that we developed a framework that makes the process easier. To use the framework go to “tool for manual verification”, make sure you installed all npm modules (“npm i”), and simply execute “npm start”. This will open a web application that is pretty intuitive. You can upload any of the resulting datasets to the framework and play around with the interface.
Note You might need to work out react compatibility issues.
P.S. You can use this tool with other datasets, the dataset just needs to be in roughly the same format (you can omit fields if you don’t need them).
Warning This process relies on npm package all-the-package-names version 1.3904.0. The following versions of the package do not include sorting by the "dependent count".
Go to “crawl real-world functions/npm” and run
node getModules.js
It will create a JSON file in “./data/npmFunctions.json” (if you want to change the location – you can do it in the code of getModules.js) with functions and their GitHub links.
-
Crawl extension Ids from Chrome. Open “inspect => console” in chome and insert the script from “crawl real-world functions/ext/scriptForChromeInspectConsole.txt”. in the request on line “19” change the string, inserting your email and your token, in line 14 change your client data. To get your token go to “ https://chrome.google.com/webstore/category/extensions“, open developer’s console on the “network tab”, find any xhr request, click on it and click on the “payload” tab. In the “form data” field there will be “login” and “t” fileds with the necessary information.
-
Collect the ids (if the request doesn’t work, try analysing similar requests in network tab and comparing fields)
-
in folder “crawl real-world functions/ext” run
node crawlZips.mjs path/to/listofIDs
to download zip folders of extensions
- in the same folder execute
node extractZips.mjs
to unpack extensions.
- lastly, execute “node extractFunctions.mjs” to create a json file with all functions and links to their location.
To run semgrep detection go to “semgrep/” and run
node detectRedos&protoSemgrep.mjs path/to/input path/to/output
This script works with input data in JSON format, where each data entry is an object with mandatory filed “{function: “code”}” and any other optional fields. It returns modified version of the input file, where objects are flagged with * “protoPollution: true and/or “redos”: true” * (if pattern detected), and * "matches" * field with the concrete location of the flagged vulnerability.
to recreate crypto hash and fuzzy hash comparison experiment go to “CryptoFuzzy/”.
- Firstly, execute
node tokenize&Cryptohash.mjs path/to/input path/to/output
to create tokens and calculate a hash for each function. Run this script 2 times: first time for ground truth (in my case – manually confirmed dataset of vulnerable functions from Snyk and googleDB); second time – on target functions (in my case – real-world functions).
- to compare cryptographic hash run
python3 compareCrypto.py path/to/vulnDS path/to/target path/to/output
- To compare fuzzy hashes run
python3 compareFuzzyHash.py path/to/vulnDS path/to/target path/to/output
(note, that the fuzzy hashes themselves are created in this script, as opposed to cryptoHash).
This is a part of my extended paper version. The scripts work with the results of the semgrep detection for npm packages. the main executed file is located at taintAnalysis/automated-clean.js. Follow the instructions in the comments throughout the script to adapt it to your needs.