CoderEval is a pragmatic code generation benchmark to evaluate the performace of generative pre-trained models. Compared with the widely-used HumanEval benchmark from OpenAI, CoderEval can be used to evaluate the performance of models against pragmatic code generation beyond just generating standalone functions.
CoderEval now supports Python and Java, with 230 functions from 43 Python projects and 230 methods from 10 Java projects. For each function/method, we extract the original docstring/comment, the signature, the code implementation, and the corresponding test code (if exists) to form one function-level code generation task. CoderEval can be viewed as a superset of HumanEval.
To make CoderEval pragmatic and diverse, we select code generation tasks from functions that are tested by the original developer of various open-source projects. First, we select candidate projects by crawling the tags of all projects on GitHub and selecting projects with the most-frequent 14 tags and with high stars. Then, we extract all functions in selected projects, and only keep the ones that meet all the following conditions: 1) is not a test, interface, or deprecated function. 2) with a function-level comment in English. 3) can run successfully in the verification platform and pass the original test cases. After getting the selected functions, We get projects from high to low according to the number of filtered functions contained in each project. Finally, we involve 230 functions from 43 projects in Python and 230 methods from 10 projects in Java. Together they form the current version of CoderEval.
To gain insights about the functions that really matter (indicated by testing from original developers) in real open-source projects, we analyze the dataset in three aspects: 1) natural language description or task specification, 2) contextual dependency, and 3) runnable-level.
To study the effect of different prompts and mitigate the memory effect of language models, we recruit 13 professional software engineers with at least 3 years experience of Python/Java, show them the code without original comments and let them provide a human_labeled version of the function description through a two-fold cross-validation process.
One of the major differences between HumanEval and CoderEval is that we take the contextual dependency into consideration. When generating function-level code, it is often the case that a token reference error will lead to an error in the generation of the entire function. Therefore, providing the information actually called in the function can help CoderEval verify the model's awareness of context information in a more fine-grained way.
Based on the contextual dependency, we further classify the code generation tasks into six runnable-levels, that is, the scope that the function can run successfully. These levels include: self_contained, slib_runnable, plib_runnable, class_runnable, file_runnable, and project_runnable.
Different with HumanEval, we need an evaluation platform to provide a ready runtime environment with automatic programs to execute and verify the code generated by code generation models, we choose to base it on a Linux Docker image, which can provide a virtual and safe sandbox to enable easy duplication and prevent harmful execution.
To evaluate generated Python code, we need to clone and set up environments for all 43 projects. To avoid Python and library version conflicts, under each repository's root directory, we first use pyenv to set the local Python version to the highest version specified in the CI configuration or document, then use venv to create an individual virtual environment for the current project. After that, we use pip to install all dependencies and trigger the original tests in the project to verify the runtime environment. With the environment ready and given a model to evaluate, we write a program to automatically inject the generated code into the project, correctly invoke the function with test input, and compare the actual output with the expected output. To invoke the target function correctly, the program can generate the necessary test entry point to set up some necessary prerequisites (like initializing the object before calling a member function) and load the input/output data into memory via deserialization. Given n code snippets generated by the model, it then sequentially replaces the original function with each of them to simulate a scenario when the developer accepts the generated code and tests it in an actual IDE. After the replacement, the entry point code will be triggered, and the running output (e.g., return values, exceptions, errors, and console output) will be captured. If the function returns with no error or exception, the platform will compare the actual output with the expected output, otherwise, the unexpected termination will be treated as a test failure. After all code snippets for all tasks are tested, the results of all tasks will be used to compute the final Pass@K metric.
Similar to the evaluation platform for Python, we also need to clone the involved projects first. We ensure that all projects in CoderEval for Java can be executed with Java 1.8 version, so we do not need to install different versions of Java runtime. Different from Python, Java programs need to be compiled before execution, so we write a program to automatically compile the test file (i.e., NewTestFile) in advance, and then dynamically replace and recompile the file that the method to be tested belongs to. With the environment ready and given a model to evaluate, the platform uses the javac command to compile changed files incrementally, and the java command to execute the bytecode of NewTestFile. Given n methods generated by the model, the platform will also automatically replace the original method with each of them, and try to recompile the residing file via the javac command. If the compilation fails, the test will be treated as a failure. If there is no error or exception after compilation, the platform will invoke the java command to execute the NewTestFile. The return value of the command directly indicates the behavioral correctness of the generation code.
We open source CoderEval for Python and CoderEval for Java as json files. Users can use MongoDB import to view data, or directly read json to view and use data.
PanGu-Coder is a pre-trained language model for the task of text-to-code generation. PanGu-Coder is based on the PanGu-$\alpha$ architecture, it proposes a two-stage training strategy. In the stage, PanGu-Coder was trained on Python code via Causal Language Modeling (CLM). All available data for the first training stage were concatenated and split at a given maximum length (or window size). For the second stage of training, PanGu-Coder accepted each sample as docstring-code pair. PanGu-Coder-FT is obtained by finetuning PanGu-Coder using the data from a combination of competitive programming problems and code from continuous integration tests. Compared with PanGu-Coder, PanGu-CoderFT shows better performance on HumanEval.
CodeGen is a series of conversational text-to-code large language models. CodeGen used three-stage training to produce three models. The first one produced CodeGen-NL, which was trained on a natural language dataset named The Pile. The second one produced CodeGen-Multi, which was then further trained on a multiple-programming-language dataset named BigQuery. The third one produced CodeGen-Mono, which was built upon CodeGen-Multi with additional training on Python-only code.
Codex is the first work to use large generative pre-trained models to generate complete functions from natural language. Since the Codex did not open-source the model and we do not know the model parameter scale with 300M from the codex in OpenAI, so we use the code-davinci-002 in our experiments.
Since the particularity of test cases in CoderEval, test cases are embedded in complete test code of the project. Due to the restricted regulations in our corporate company, we can't publish it temporarily now, but we will provide API service soon.