This repository conatins the code for the paper CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context at LREC-COLING 2024.
You may also want to check out our other paper on CrossCodeEval at NeurIPS 2023.
We propose CoCoMIC, a framework that incorporates cross-file context to learn the in-file and cross-file context jointly on top of pretrained code LMs.
CCFinder parses the project hierarchy and code components to extract project information. CCFinder further builds a project context graph to represent the details of each component and the interactions among them. When an incomplete program requests completion, the tool will retrieve the neighbors of the pinpointed entities from the graph as the cross-file context of the current file.
We propose a novel model architecture built on top of existing code LMs with joint attention to in-file and retrieved cross-file context. First, the model will compress cross-file context and build its representations. Second, when generating code completion, the model will attend to both the compressed cross-file context and the concrete in-file context.
conda create -n cocomic python=3.9.13
conda activate cocomic
pip install torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
pip install -r requirements.txt
mkdir ts_package;
cd ts_package;
# Download the tree-sitter package
git clone https://github.com/tree-sitter/tree-sitter-python.git;
# checkout to 0.20.0 version
git checkout 2b9e9e0d231d5dd9f491d47f704817baee7d5af0;
cd ..;
# Build tree-sitter
python build_ts_lib.py
If the the commands finishes successfully, you should be able to see a ccfinder/cc_extractor/build/
directory now under the package root, which contains a .so
file
Input
The input is a folder containing the source code of a Python project.
Command
export PYTHONPATH=$(pwd)/ccfinder;
python ccfinder/build_crossfile_context.py --input_project <PATH_TO_PROJECT_FOLDER> --output_dir <OUTPUT_FOLDER>
Output
|---<OUTPUT_FOLDER>
|---<PROJECT_NAME>_project_context.json # Hierarchical Project Context
|---<PROJECT_NAME>_project_context.node # List of Project Entities
|---<PROJECT_NAME>_project_context.edge.jsonl # List of Entity Relations
|---<PROJECT_NAME>_project_context.graph # Intermediate Graph Structure
|---<PROJECT_NAME>_project_context.graph.adj.pk # Intermediate Graph Structure in the format of Adjacency Matrix
|---<PROJECT_NAME>_retrieved_nodes.json # Retrieved Project Entities for each file in the project
Input
When you have a list of prompts for a specific project to complete, please format them with the following attribute and save them in a <PROJECT_NAME>_prompts.jsonl
file, where each line corresponds to one sample
{"prompt":"...<IN-FILE CONTEXT>...","groundtruth":"...","metadata":{"file":"<ABSOLUTE_FILE_PATH>"}}
Command
python ccfinder/create_sample_w_cfc.py \
--retrieved_entity_file <PATH_TO>/<PROJECT_NAME>_retrieved_nodes.json \
--prompt_file <PATH_TO>/<PROJECT_NAME>_prompts.jsonl \
--output_file <PATH_TO>/<PROJECT_NAME>_prompts_with_cfc.jsonl
Output
The output will be a .jsonl
file, and each line corresponds to a sample with its cross-file context in the following format
{
"prompt": " ... ",
"groundtruth": " ... ",
"retrieved_nodes": [
" ...<PROJECT_ENTITY>... ",
],
"retrieved_edges": [
["<ENTITY_IDX>", "<RELATION_TYPE>", "<RELATION__TYPE_IDX>", "<ENTITY_IDX>"],
],
"metadata": {
"file":"<ABSOLUTE_FILE_PATH>"
}
}