Skip to content

Latest commit

 

History

History
130 lines (98 loc) · 6.63 KB

File metadata and controls

130 lines (98 loc) · 6.63 KB

JavaScript-vulnerability-detection

a project repository for a paper: On Measuring Vulnerable JavaScript Functions in the Wild

Creating vulnerable dataset

First part of my project: crawling data from Snyk and from googleDB to compile dataset of vulnerable functions.

Dataset from Snyk

Warning The scripts were developed around January 2021 and use HTML crawling to extract data from Snyk vulnerable entries. Since then Snyk slightly changed the interface, hence minor adaptation is required to queries for HTML elements. Edits are welcome.

  1. Collect meta information from all Snyk entries for JavaScript (npm). For that go “to vulnerable dataset/snyk” and execute
node snykPages.js path/to/output/in/json
  1. Crawl vulnerable and fixed functions from snyk entries that have a link to Github Commits. For that go to “vulnerable dataset/snyk/commits” and run
node fetchCommitFuncs.js path/to/input path/to/output/

where path to input is the result of snykPages.js

  1. Crawl vulnerable and fixed functions from snyk entries that point to Github Vulnerable Code. For that go to “vulnerable dataset/snyk/vulnerable code” and run
node fetchVulnCodeFuncs path/to/input path/to/output

where path to input is the result of running snykPages.js .

  1. From both resulting files filter out some obviously false results by running
node vulnerable dataset/filterVulnDS.mjs path/to/input path/to/filteredOutput

Dataset from GoogleDB

Warning The project was discontinued on December 13, 2021. The scripts, however, still work.

  1. Collect links to vulnerable entries. For that go to “vulnerable dataset/googleDB” and run
node crawlLinks.mjs path/to/output 0

where 0 is a table that we are interested in. (This vulnerability registry has 2 tables on their website: in the first one (index =0) they store vulnerabilities with complete information, including a link to a GitHub commit. In the second one (index=1) they store other vulnerabilities that lack the link and/or necessary metadata”. We only crawl table 0, because we need GitHub commit links).

  1. Get necessary metadata and links from 1. for that in the same folder run
node processLinks.mjs path/to/input path/to/output

where path to input is the path to result of 1.

  1. Same as with snyk, crawl GitHub commits. Go to “vulnerable dataset/googleDB/commits” and run
node fetchCommitFuncs.mjs path/to/input path/to/output

where onput is result from 2.

  1. Perform filtering as in step 3 for Snyk.

Verifying vulnerable dataset

Next step is to analyse manually all resulting datasets of vulnerable & fixed functions. For that we developed a framework that makes the process easier. To use the framework go to “tool for manual verification”, make sure you installed all npm modules (“npm i”), and simply execute “npm start”. This will open a web application that is pretty intuitive. You can upload any of the resulting datasets to the framework and play around with the interface.

Note You might need to work out react compatibility issues.

P.S. You can use this tool with other datasets, the dataset just needs to be in roughly the same format (you can omit fields if you don’t need them).

Crawling real-world functions

Npm modules

Warning This process relies on npm package all-the-package-names version 1.3904.0. The following versions of the package do not include sorting by the "dependent count".

Go to “crawl real-world functions/npm” and run

node getModules.js

It will create a JSON file in “./data/npmFunctions.json” (if you want to change the location – you can do it in the code of getModules.js) with functions and their GitHub links.

Extensions

  1. Crawl extension Ids from Chrome. Open “inspect => console” in chome and insert the script from “crawl real-world functions/ext/scriptForChromeInspectConsole.txt”. in the request on line “19” change the string, inserting your email and your token, in line 14 change your client data. To get your token go to “ https://chrome.google.com/webstore/category/extensions“, open developer’s console on the “network tab”, find any xhr request, click on it and click on the “payload” tab. In the “form data” field there will be “login” and “t” fileds with the necessary information.

  2. Collect the ids (if the request doesn’t work, try analysing similar requests in network tab and comparing fields)

  3. in folder “crawl real-world functions/ext” run

node crawlZips.mjs path/to/listofIDs

to download zip folders of extensions

  1. in the same folder execute
node extractZips.mjs

to unpack extensions.

  1. lastly, execute “node extractFunctions.mjs” to create a json file with all functions and links to their location.

Semgrep analysis

To run semgrep detection go to “semgrep/” and run

node detectRedos&protoSemgrep.mjs path/to/input path/to/output

This script works with input data in JSON format, where each data entry is an object with mandatory filed “{function: “code”}” and any other optional fields. It returns modified version of the input file, where objects are flagged with * “protoPollution: true and/or “redos”: true” * (if pattern detected), and * "matches" * field with the concrete location of the flagged vulnerability.

Crypto hash and fuzzy hash comparison

to recreate crypto hash and fuzzy hash comparison experiment go to “CryptoFuzzy/”.

  1. Firstly, execute
node tokenize&Cryptohash.mjs path/to/input path/to/output

to create tokens and calculate a hash for each function. Run this script 2 times: first time for ground truth (in my case – manually confirmed dataset of vulnerable functions from Snyk and googleDB); second time – on target functions (in my case – real-world functions).

  1. to compare cryptographic hash run
python3 compareCrypto.py path/to/vulnDS path/to/target path/to/output
  1. To compare fuzzy hashes run
python3 compareFuzzyHash.py path/to/vulnDS path/to/target path/to/output

(note, that the fuzzy hashes themselves are created in this script, as opposed to cryptoHash).

Taint Analysis for npm packages

This is a part of my extended paper version. The scripts work with the results of the semgrep detection for npm packages. the main executed file is located at taintAnalysis/automated-clean.js. Follow the instructions in the comments throughout the script to adapt it to your needs.