Leakage Analysis

A static analysis tool to detect test data leakage in Python notebooks

This is the tool of the ASE'22 paper: Data Leakage in Notebooks: Static Detection and Better Processes. An online demo is also available. For our evaluation scripts and materials, please refer to this repo.

How to build

Install souffle, the datalog engine we use for our main analysis. Make sure that souffle could be directly invoked in command line.
Pull and build our customized version of [pyright], the type inference engine we use: git submodule update --init --recursive (please refer to the submodule for building the project).
Install required Python packages in requirements.txt. We use Python 3.8 for our tool; different Python versions might result in different parsed AST and unexpected errors.

How to use

Run analysis for a single Python file: python3 -m src.main /path/to/file
Run analysis for all Python files in a directory: python3 -m src.run /path/to/dir
More information could be found using the -h flag.

How to build and run Docker image

Pull our customized version of pyright, the type inference engine we use: git submodule update --init --recursive.
Add all used Python libraries to requirements.txt, which will be installed in the container and used by pyright.
Build Docker image: docker build -t leakage-analysis .
Run Docker image: docker run -v /path/to/dir:/path/to/dir leakage-analysis /path/to/dir/$FILE -o. All to-be-analyzed notebooks should be converted to Python files and stored in /path/to/dir.

How to read output

For a given input file test.py, an output html file test.html will be generated if -o flag is specified.

In test.html, we show the analysis results alongside input code. A summary table on detected leakage issues is shown on the top. Users could also utilize the interactive buttons to highlight relevant code and navigate through different code segments.

Internal Structure

Given a Python file, src/main.py first parses the input into AST. Then it feeds AST to a GlobalCollector instance (from global_collector.py) that collects global variables we could not rename in later transformations, which we will ignore later.

Next, it feeds AST to a CodeTransformer instance (from irgen.py) that translates original Python code to a simpler version that 1) breaks down complex statements to multiple simpler ones, and 2) translates code to the static single assignment (SSA) form.

Then it calls the type inference engine on the transformed code file. With type inference information, it converts the code file to datalog facts the final analysis could read, using FactGenerator from factgen.py.

Finally, it performs datalog analysis (main.dl) on generated facts and outputs results in the same directory.

Directory Structure

src
├── factgen.py: convert transformed code to datalog facts
├── global_collector.py: collect global variables
├── __init__.py
├── irgen.py: transform code to simpler SSA form
├── main.dl: main datalog analysis that analyzes leakage
├── main.py: run analysis on a single file
├── render.py: output a html file based on analysis results and original code
├── run.py: run analysis on multiple files
└── scope.py: manage variable scopes for renaming purposes

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
pyright @ e3d7d9c		pyright @ e3d7d9c
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Leakage Analysis

How to build

How to use

How to build and run Docker image

How to read output

Internal Structure

Directory Structure

About

Releases

Packages

Languages

License

malusamayo/leakage-analysis

Folders and files

Latest commit

History

Repository files navigation

Leakage Analysis

How to build

How to use

How to build and run Docker image

How to read output

Internal Structure

Directory Structure

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages