GHPR Dataset: A Dataset for Software Defect Prediction

The GHPR dataset is used in our empirical studies and evaluation. We identify 3026 bug fixing based on Pull Requests(PRs) in Github. Each bug fixing is treated as a record in the dataset.

From the view of supervised learning, we can consider the defective and fixed file version in each record as two learning instances. the GHPR dataset totally have 6052 learning instances which contain 3026 defective instances and 3026 non-defective instances.

Data Format

Two publication formats are provided in ghprdata.zip:

ghprdata.csv
ghprdata.sql

The CSV file is simple and well-supported in Python using numpy or pandas. Because the databases support large datasets better than CSV files do, we also provide a SQL file of which the character set is utf-8.

Data Feature

Each record includes the following 16 features.

Feature	Description
PROJECT_NAME	The name of the project
PROJECT_OWNER	The owner of the project
PROJECT_DESCRIPTION	The description of the project provided by the owner
PROJECT_LABEL	The label of the project provided by the owner
PROJECT_LANGUAGE	The programming language of the project
SHA_FIXED	The ID of the version after the defect being fixed
SHA_BUG	The ID of the version before the defect being fixed
DIFF_CODE	The defect-related code we recognize from the content of the fixed file
COMMIT_DESCRIPTION	The description of the commit in the defect-related PR
COMMIT_TIME	The time of the commit in the defect-related PR
OLD_CONTENT	The content of the defect-related file before defects being fixed
NEW_CONTENT	The content of the defect-related file after defects being fixed
OLD_PATH	The old path of the changed files
NEW_PATH	The new path of the changed files
PR_TITLE	The title of the defect related PR
PR_DESCRIPTION	The description of the defect related PR

baseline

We calculate 21 static metrics for the total 6052 instances in GHPR dataset, which is the data used for the baseline approaches. All metrics were calculated by the open-source tool mauricioaniche/ck. The description of the metrics as fellow.

Feature	Description
CBO	Coupling between objects. Counts the number of dependencies a class has.
WMC	Weight Method Class or McCabe's complexity. It counts the number of branch instructions in a class.
DIT	Depth Inheritance Tree. It counts the number of "fathers" a class has. All classes have DIT at least 1 (everyone inherits java.lang.Object).
rfc	Response for a Class. Counts the number of unique method invocations in a class.
lcom	Lack of Cohesion of Methods. Calculates LCOM metric.
totalMethods	Counts the number of methods.
totalFields	Counts the number of fields.
NOSI	Number of static invocations. Counts the number of invocations to static methods.
LOC	Lines of code. It counts the lines of count, ignoring empty lines.
returnQty	Quantity of returns. The number of return instructions.
loopQty	Quantity of loops. The number of loops (i.e., for, while, do while, enhanced for).
comparisonsQty	Quantity of comparisons. The number of comparisons (i.e., == and !=).
tryCatchQty	Quantity of try/catches. The number of try/catches.
parenthesizedExpsQty	Quantity of parenthesized expressions. The number of expressions inside parenthesis.
stringLiteralsQty	String literals. The number of string literals (e.g., "John Doe").
numbersQty	Quantity of Number. The number of numbers (i.e., int, long, double, float) literals.
assignmentsQty	Quantity of Variables. Number of declared variables.
mathOperationsQty	Quantity of Math Operations: The number of math operations (times, divide, remainder, plus, minus, left shit, right shift).
variablesQty	Quantity of Variables. Number of declared variables.
maxNestedBlocks	Max nested blocks. The highest number of blocks nested together.
uniqueWordsQty	Number of unique words. Number of unique words in the source code.

References

Please cite our paper if you use this dataset in your publication: @ARTICLE{***, author={Jiaxi Xu;Fei Wang;Jun Ai}, journal={IEEE Transactions on Reliability}, title={Defect Prediction With Semantics and Context Features of Codes Based on Graph Representation Learning}, year={2021}, }

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
GHPR.zip		GHPR.zip
README.md		README.md
baseline.csv		baseline.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GHPR Dataset: A Dataset for Software Defect Prediction

Data Format

Data Feature

baseline

References

About

Releases

Packages

Contributors 2

feiwww/GHPR_dataset

Folders and files

Latest commit

History

Repository files navigation

GHPR Dataset: A Dataset for Software Defect Prediction

Data Format

Data Feature

baseline

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages