PyAssessment

Concolic execution based automatic grading tool for Python functions.

About This Project

PyAssessment is an automatic grading tool which gives a score to a student implementation based on its semantic similarity with a reference implementation. This tool can be used in the form a web service.

This code is a modified version of:

PyJudge: An automatic grading tool that takes a reference implementation and a student implementation, and finds input(s) that generate a different output.
PyExZ3: A Dynamic Symbolic Execution Engine for Python.

This repository will be the deliverable of my final project.

Getting Started (Docker)

Make sure you have docker installed.
Start the server using this command:

docker-compose up --build

Visit http://localhost:5000/. Ignore the IP given in the server log.

Getting Started (Python)

Make sure you have Python of version at least 3.10 installed (due to type hinting).
Install the requirements.

pip install -r requirements.txt

For MacOS, open setup.sh and change the path according to your local machine then run:

. grader/setup.sh

Make sure you are in the repository root. Set the python path to current directory:

set PYTHONPATH=.

Start the server using this command:

python web_service/src/main.py

Visit http://localhost:5000/. Ignore the IP given in the server log.

Endpoints

Full documentation can be accessed here (TBD).

Usage

python grade.py <reference_implementation> <student_implementation> [options]

Example Usage

python grade.py test/max_3/max_3.py test/max_3/max_3_1.py

It should return something like this and save the result to res folder.

DOT files containing the exploration paths can be seen in logs folder as student.dot and reference.dot. DOT files can be viewed using graphviz (online services such as this can also be used).

Reference: max_3.max_3
Grading: max_3_1.max_3_1
======
RESULT
======
tested:
{
    (('a', 0), ('b', 0), ('c', 0)) : [0, 0, And(a >= b, a >= c), And(a <= b, b <= a), 'Exploration'],
    (('a', 0), ('b', -1), ('c', 0)) : [0, 0, And(a >= b, a >= c), And(a > b, a <= c, b <= a), 'PathDeviation'],
    (('a', 0), ('b', 2), ('c', 0)) : [2, 2, And(a < b, b >= a, b >= c), And(a <= b, b > a, b > c), 'Exploration'],
    (('a', -1), ('b', 0), ('c', 0)) : [0, 0, And(a < b, b >= a, b >= c), And(a <= b, b > a, b <= c), 'PathDeviation'],
    (('a', 0), ('b', 0), ('c', 1)) : [1, 1, And(a >= b, a < c, b >= a, b < c), And(a <= b, b <= a), 'Exploration'],
    (('a', 0), ('b', 0), ('c', -1)) : [0, -1, And(a >= b, a >= c), And(a <= b, b <= a), 'PathEquivalence'],
    (('a', 1), ('b', 2), ('c', 3)) : [3, 3, And(a < b, b >= a, b < c), And(a <= b, b > a, b <= c), 'Exploration'],
    (('a', 1), ('b', 0), ('c', 2)) : [2, 2, And(a >= b, a < c, b < a), And(a > b, a <= c, b <= a), 'Exploration'],
    (('a', 2), ('b', 0), ('c', 0)) : [2, 2, And(a >= b, a >= c), And(a > b, a > c), 'Exploration'],
    (('a', 0), ('b', 1), ('c', 0)) : [1, 1, And(a < b, b >= a, b >= c), And(a <= b, b > a, b > c), 'Exploration'],
    (('a', 10), ('b', 0), ('c', 12)) : [12, 12, And(a >= b, a < c, b < a), And(a > b, a <= c, b <= a), 'Exploration'],
    (('a', 4), ('b', 5), ('c', 8)) : [8, 8, And(a < b, b >= a, b < c), And(a <= b, b > a, b <= c), 'Exploration'],
}

tested from path dev or path eq:
{
    (('a', 0), ('b', -1), ('c', 0)) : [0, 0, And(a >= b, a >= c), And(a > b, a <= c, b <= a), 'PathDeviation'],
    (('a', -1), ('b', 0), ('c', 0)) : [0, 0, And(a < b, b >= a, b >= c), And(a <= b, b > a, b <= c), 'PathDeviation'],
    (('a', 0), ('b', 0), ('c', -1)) : [0, -1, And(a >= b, a >= c), And(a <= b, b <= a), 'PathEquivalence'],
}

wrong:
{
    (('a', 0), ('b', 0), ('c', -1)) : [0, -1, And(a >= b, a >= c), And(a <= b, b <= a), 'PathEquivalence'],
}

wrong from path dev or path eq:
{
    (('a', 0), ('b', 0), ('c', -1)) : [0, -1, And(a >= b, a >= c), And(a <= b, b <= a), 'PathEquivalence'],
}

grade:
91.66666666666666% (11/12)

path constraints:
{
    (And(a >= b, a >= c), And(a <= b, b <= a)) : 0.5,
    (And(a >= b, a >= c), And(a > b, a <= c, b <= a)) : 1,
    (And(a < b, b >= a, b >= c), And(a <= b, b > a, b > c)) : 1,
    (And(a < b, b >= a, b >= c), And(a <= b, b > a, b <= c)) : 1,
    (And(a >= b, a < c, b >= a, b < c), And(a <= b, b <= a)) : 1,
    (And(a < b, b >= a, b < c), And(a <= b, b > a, b <= c)) : 1,
    (And(a >= b, a < c, b < a), And(a > b, a <= c, b <= a)) : 1,
    (And(a >= b, a >= c), And(a > b, a > c)) : 1,
}

path constraint grade:
93.75% (7.5/8)

feedback:
Please check line(s) 2, 4, 7 in your program.

Options

  -h, --help            show this help message and exit
  -g GRADER, --grader=GRADER
                        Grader to be used. ['random', 'whitebox' (default)]
  -l LOGFILE, --log=LOGFILE
                        Save log output to a file.
  -m MAX_ITERS, --max-iters=MAX_ITERS
                        Run specified number of iterations (0 for unlimited).
                        Should be used for looping or recursive programs.
  -t MAX_TIME, --max-time=MAX_TIME
                        Maximum time for exploration (0 for unlimited). Expect
                        maximum execution time to be around three times the
                        amount.
  -q, --quiet           Quiet mode. Does not print path constraints. Should be
                        activated for looping or recursive programs as
                        printing z3 expressions can be time consuming.

Comparing with Random Input

One of the goal of exploring this approach is to see if it can cover edge cases where random input generation can't. To see if it does that on a particular problem, try generating random inputs and compare the result with PyAssessment.

python grade.py <reference_implementation> <student_implementation> -g random

Test All in Directory

Define the problem, reference solution, and student solutions in problems.py.
Run python run_tests.py [test_directory]

python run_tests.py test

A message All tests passed! will be printed if all tests passed. The json result of all tests will be saved in the res folder.

Generate Report

Make sure that all json results are present in the res folder. It should work fine if you have run the test all command.
Run python generate_report.py
The report will be generated in the res folder with filename report.csv.

Cleanup

Cleans up the res and logs folder.

python clean.py

Timelimits

For both the web service and grade.py, there is a hard time limit of 10 seconds. This is used to handle the case where the student implementation is not responding due to infinite loops or recursion.

How does it do that?

TODO: Will do after finishing the project (obviously).

Limitation

Implementation must be in the form of a python function with a return statement (not a procedure).
The function must use integer(s) as input argument(s).

Literature

TODO: Will put a link to the paper after finishing the project.

TODO

Improve tracing (when path constraints are equal but return value is wrong)

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
docs		docs
grader		grader
logs		logs
res		res
test		test
web_service		web_service
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
clean.py		clean.py
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
generate_report.py		generate_report.py
grade.py		grade.py
gunicorn_config.py		gunicorn_config.py
locustfile.py		locustfile.py
problems.py		problems.py
requirements.txt		requirements.txt
run_tests.py		run_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyAssessment

About This Project

Getting Started (Docker)

Getting Started (Python)

Endpoints

Usage

Example Usage

Options

Comparing with Random Input

Test All in Directory

Generate Report

Cleanup

Timelimits

How does it do that?

Limitation

Literature

TODO

About

Releases

Packages

Contributors 2

Languages

moondemon68/PyAssessment

Folders and files

Latest commit

History

Repository files navigation

PyAssessment

About This Project

Getting Started (Docker)

Getting Started (Python)

Endpoints

Usage

Example Usage

Options

Comparing with Random Input

Test All in Directory

Generate Report

Cleanup

Timelimits

How does it do that?

Limitation

Literature

TODO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages