Dataset, scripts, and additional material for the paper "Best-Answer Prediction in Technical Q&A Sites"
F. Calefato, F. Lanubile, and N. Novielli (2018) “An Empirical Assessment of Best-Answer Prediction Models in Technical Q&A Sites.” Empirical Software Engineering Journal, DOI: 10.1007/s10664-018-9642-5
@Article{Calefato2018,
author="Calefato, Fabio
and Lanubile, Filippo
and Novielli, Nicole",
title="An empirical assessment of best-answer prediction models in technical Q{\&}A sites",
journal="Empirical Software Engineering",
year="2018",
month="Aug",
day="07",
issn="1573-7616",
doi="10.1007/s10664-018-9642-5",
url="https://doi.org/10.1007/s10664-018-9642-5"
}
Original dumps refer to the data extracted "as is" from the following technical Q&A sites:
- Modern platforms
- Stack Overflow (SO)
- Yahoo! Answer (YA) (category: Programming & Design)
- SAP Community Network (SCN) (topics: Hana, Mobility, NetWeaver, Cloud, OLTP, Streaming, Analytic, PowerBuilder, Cross, 3d, Frontend, ABAP)
- Legacy platforms
The data dumps and the description of their file formats are available from here.
The datasets containing the features extracted from the data dump of each Q&A site are available for download here. A description of each feature is also avaialble.
To ensure proper execution, first run the following commands to check for the presence and eventually install all the required packages for R and Python.
$ RScript requirements.R
$ pip install -r requirements.txt
To start the automated parameter tuning via caret
, run the run-tuning.sh
script as described below.
$ run-tuning.sh models_file data_file
- The
models_file
param indicates the file containing (one per line) a list of models (learners) to be tuned. See the filemodels/models.txt
for an example. - The
data_file
param indicates the file containing the data to be used for the tuning stage. - As output, a TXT file will be created under the
output/tuning/
subfolder for each tuned model, containing the best param configuration and execution times.
Note. The tuning step is very time consuming and will take several hours for each model; the more models in the input file, the longer the script will take to finish.
To compute the default AUC performance with the default parameter setting is obtained running the script below.
$ sh run-default-predictions.sh path/to/input/so-dataset.csv path/to/models/models.txt
- As output, the file
output/untuned/AUC-all-models.txt
will be created with the AUC values.
Note. The prediction step is very time consuming and will take several hours to complete; the more models in the input file and the larger the dataset chosen, the longer the script will take to finish.
To cluster model by AUC performance into non-overlapping groups, run the following scripts:
$ python collect-metrics.py --in path/to/metrics/folder.txt --out outfile --ext file_extension --sep field_sep --runs N
path/to/metrics/folder.txt
- where the tuning script stored the execution log per model for each runoutfile
- the name of file where to store the following main metrics per model per run:- AUC
- F1
- G-mean
- Balance
- Time taken
file_extension
- the extension of the output file, chosen in{txt, csv, xls}
field_sep
- the character used to separate fields in the output file, either,
or;
N
- the number of runs used in the tuning step (e.g., 10, 100)
$ Rscript skesd-test metrics_outfiles runsN
metrics_outfiles
- the file with metrics generated by the Python script at the previous steprunsN
- the number of runs, must match the same param from the previous step
The following script perform wrapper-based feature selection using the R package Boruta
; for the sake of completeness, it will also perform Correlation-based Feature Selection (CFS).
$ Rscript feature-selection.R dataset_file dataset_name featN
dataset_file
- the dataset used for feature selectiondataset_name
- the name of the dataset, chosen in{so, docusign, dwolla, scn, yahoo}
;so
by defaultfeatN
- the number of feature to select, 10 by default- As output, the script will generate the file
output/feature-selection/feature-subset.txt
containing:- The output of
Boruta
- The output of
CFS
, with both Spearman and Pearson correlation values
- The output of
Once the models have been tuned, you can execute the best-answer prediction experiment. Run the run-predictions.sh
script as described below.
$ run-predictions.sh training_file models_file data_file
- The
training_file
param indicates the file containing the dataset for training the learners. - The
models_file
param indicates the file containing (one per line) a list of models (learners) to be used in the prediction experiment. - The
data_file
param indicates the file containing the test dataset. - As output, the following folder and files will be created:
output/cm
- containing a TXT file for each test set and model with the confusion matrixoutput/misclassifications
- containing a TXT file for each test set and model with listing the cases where wrong predictions (errors) occurredoutput/plots
- containing a ROC plot image file for each test set and model specified as input
Note. Before running the prediction experiment, the file test.R
must be manually edited in order customize the tuneGrid
var (dataframe
) containing the best param configuration for each learner model. As of now, the script contains the grids for the 4 models in the file models/top-cluster.txt
.
When executed without running the .sh files (e.g., via RStudio or Rscript), these scripts by default open the test file input/example.csv
, which contains a few hundred lines from the Stack Overflow dataset. This test file is intended to show how the scripts work in general and the output they produce. Beware of the longest execution time when running the scripts with the other input files.