This is the repository containing source code applying computer vision techniques for source code understanding.
Through pip
pip install -r requirements.txt
The experiment of this repository is conducted on CodeNet avaialble from [https://github.com/IBM/Project_CodeNet].
This encodes source code file in a matrix with each element/position encoded by the character encoding order, e.g. configurable as ASCII or utf-8 (only first 3 bytes).
Extract ASCII codepoint images
python data/code2pixel.py Project_CodeNet/data $CODEPOINT_IMAGE_DIR --ascii-only --image-format npy --keep-source-structure
See other options
python data/code2pixel.py --help
After the images/matrices are extracted, one can prepare the dataset to be ready for experiments.
Prepare dataset descriptor with the extracted codepoint images above, holding out 7000 datapoints for validation and testing
python data/prepare_codenet.py Project_CodeNet/metadata/ $CODEPOINT_IMAGE_DIR $DATASET_DESCRIPTOR --val 7000 --test 70000
Build vocab with minimum frequency of 2, from the train subset:
python3 build_vocab.py $DATASET_DESCRIPTOR Project_CodeNet/data/ --min_freq 2 --subset train --ascii_only
The main entry point of training is the train.py
script and model hyperparameters and training recipes can almost entirely be specified in hparams/default.yml
or other yaml config files. For example, to run a ConvVit model experiment from the above prepared dataset.
python train.py $EXP_TAG $DATASET_DESCRIPTOR --hparams hparams/default.yml
where
$EXP_TAG
is the tag name of the experiment entry in the hparams yaml config file, e.g. conv_vit for ConvVit Model.
@article{Shi2022CV4CodeSU,
title={CV4Code: Sourcecode Understanding via Visual Code Representations},
author={Ruibo Shi and Lili Tao and Rohan Saphal and Fran Silavong and Sean J. Moran},
journal={ArXiv},
year={2022},
volume={abs/2205.08585}
}