From 5ff15fe5d85a0cd2022ff05544348acc1634a089 Mon Sep 17 00:00:00 2001
From: yueyueL <yue.liu1@monash.edu>
Date: Wed, 13 Dec 2023 22:52:52 +1100
Subject: [PATCH] xai

---
 docs/venus.md       |   2 +-
 docs/xai_lm4code.md | 210 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 211 insertions(+), 1 deletion(-)
 create mode 100644 docs/xai_lm4code.md

diff --git a/docs/venus.md b/docs/venus.md
index 1aa604d..9e58b55 100644
--- a/docs/venus.md
+++ b/docs/venus.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Relevant Venues
-nav_order: 5
+nav_order: 6
 ---
 # LM4Code Venues
 {: .no_toc }
diff --git a/docs/xai_lm4code.md b/docs/xai_lm4code.md
new file mode 100644
index 0000000..78c226e
--- /dev/null
+++ b/docs/xai_lm4code.md
@@ -0,0 +1,210 @@
+---
+layout: default
+title: XAI for LM4Code
+nav_order: 5
+---
+# Explainable AI for LM4Code
+{: .no_toc }
+
+## Table of contents
+{: .no_toc .text-delta }
+
+1. TOC
+{:toc}
+
+---
+
+
+## Surveys for General XAI
+- Interpretable Machine Learning, 2020, [paper](https://christophm.github.io/interpretable-ml-book/)
+- Explainable AI (XAI): Core ideas, techniques, and solutions, CSUR 2023, [paper](https://orca.cardiff.ac.uk/id/eprint/152361/1/3561048.pdf)
+- Techniques for interpretable machine learning, Communications of the ACM, 2019, [paper](https://dl.acm.org/doi/fullHtml/10.1145/3359786)
+- Post-hoc interpretability for neural nlp: A survey, CSUR 2023, [paper](https://dl.acm.org/doi/pdf/10.1145/3546577)
+- Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead, Nature Machine Intelligence, 2019, [paper](https://www.nature.com/articles/s42256-019-0048-x)
+- Interpretable machine learning: definitions, methods, and applications, 2019, [paper](https://arxiv.org/pdf/1901.04592.pdf?fbclid=IwAR2frcHrhLc4iaH5-TmKKq263NVvAKHtG4uQoiVNDeLAG3QFzdje-yzZjiQ)
+- Evaluating explanation methods for deep learning in security, EuroS&P 2020, [paper](https://ieeexplore.ieee.org/abstract/document/9230374/)
+- A survey of methods for explaining black box models, ACM Computing Surveys, 2020, [paper](https://dl.acm.org/doi/abs/10.1145/3236009)
+- On the explainability of natural language processing deep models, CSUR, 2022, [paper](https://dl.acm.org/doi/abs/10.1145/3529755)
+- From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI, CSUR, 2023, [paper](https://dl.acm.org/doi/10.1145/3583558), [repository](https://github.com/utwente-dmb/xai-papers)
+
+## Surveys for XAI for LM4Code
+- A systematic literature review of explainable AI for software engineering, 2023, [paper](https://arxiv.org/abs/2302.06065)
+- Explainable ai for se: Challenges and future directions, 2023, [paper](https://ieeexplore.ieee.org/abstract/document/10109341/)
+
+## Repositories for General XAI
+- Awesome-explainable-AI
+    - [Repository](https://github.com/wangyongjie-ntu/Awesome-explainable-AI)
+    - A collection of research materials on explainable AI/ML
+- awesome-machine-learning-interpretability
+    - [Repository](https://github.com/jphall663/awesome-machine-learning-interpretability)
+    - A maintained and curated list of practical and awesome responsible machine learning resources.
+- lopusz/awesome-interpretable-machine-learning
+    - [Repository](https://github.com/lopusz/awesome-interpretable-machine-learning)
+    - Opinionated list of resources facilitating model interpretability (introspection, simplification, visualization, explanation).
+- Explainable Artificial Intelligence
+    - [Repository](https://interpretable-ml-class.github.io/)
+    - A course on Explainable Artificial Intelligence (XAI) and Interpretable Machine Learning (IML) at Harvard University
+- utwente-dmb/xai-papers
+    - [Repository](https://github.com/utwente-dmb/xai-papers)
+    - This is an exploration and visualisation website for a categorization of explainable AI papers, hosted on Github pages. The initial set of XAI papers was collected and labelled by Nauta et al. (2023) as part of a large-scale literature review on the evaluation of Explainable AI, published in ACM Computing Surveys. 
+- samzabdiel/XAI
+    - [Repository](https://github.com/samzabdiel/XAI)
+    - Papers and code of Explainable AI esp. w.r.t. Image classificiation
+
+## Tools for XAI
+- shap
+    - [Repository](https://github.com/shap/shap)
+    - SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.
+    - Paper: [A Unified Approach to Interpreting Model Predictions](https://arxiv.org/abs/1705.07874)
+- marcotcr/lime
+    - [Repository](https://github.com/marcotcr/lime)
+    - Lime is about explaining what machine learning classifiers (or models) are doing. At the moment, it supports explaining individual predictions for text classifiers or classifiers that act on tables (numpy arrays of numerical or categorical data) or images, with a package called lime (short for local interpretable model-agnostic explanations).
+    - Paper: [“Why Should I Trust You?” Explaining the Predictions of Any Classifier](https://arxiv.org/abs/1602.04938)
+- marcotcr/anchor
+    - [Repository](https://github.com/marcotcr/anchor)
+    - Code for "High-Precision Model-Agnostic Explanations" paper
+    - Paper: [High-Precision Model-Agnostic Explanations](https://homes.cs.washington.edu/~marcotcr/aaai18.pdf)
+- Transformer-Explainability
+    - [Repository](https://github.com/hila-chefer/Transformer-Explainability)
+    - Official PyTorch implementation for Transformer Interpretability Beyond Attention Visualization, a novel method to visualize classifications by Transformer based networks.
+- Ecco
+    - [Repository](https://github.com/jalammar/ecco)
+    - Explain, analyze, and visualize NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BERT, RoBERTA, T5, and T0).
+- Captum
+    - [Repository](https://github.com/pytorch/captum)
+    - Captum is a model interpretability and understanding library for PyTorch. Captum means comprehension in Latin and contains general purpose implementations of integrated gradients, saliency maps, smoothgrad, vargrad and others for PyTorch models. It has quick integration for models built with domain-specific libraries such as torchvision, torchtext, and others.
+- AIX360
+    - [Repository](https://github.com/Trusted-AI/AIX360)
+    - The AI Explainability 360 toolkit is an open-source library that supports interpretability and explainability of datasets and machine learning models. The AI Explainability 360 Python package includes a comprehensive set of algorithms that cover different dimensions of explanations along with proxy explainability metrics. The AI Explainability 360 toolkit supports tabular, text, images, and time series data.
+
+
+
+## Papers for XAI4SE
+- [DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems](https://arxiv.org/pdf/1803.07519)
+	- ASE 2018
+	- Abstract: Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems.
+- [Neural-augmented static analysis of Android communication](https://dl.acm.org/doi/pdf/10.1145/3236024.3236066)
+	- FSE 2018
+	- Abstract: We address the problem of discovering communication links between applications in the popular Android mobile operating system, an important problem for security and privacy in Android. Any scalable static analysis in this complex setting is bound to produce an excessive amount of false-positives, rendering it impractical. To improve precision, we propose to augment static analysis with a trained neural-network model that estimates the probability that a communication link truly exists. We describe a neural-network architecture that encodes abstractions of communicating objects in two applications and estimates the probability with which a link indeed exists. At the heart of our architecture are type-directed encoders (TDE), a general framework for elegantly constructing encoders of a compound data type by recursively composing encoders for its constituent types. We evaluate our approach on a large corpus of Android applications, and demonstrate that it achieves very high accuracy. Further, we conduct thorough interpretability studies to understand the internals of the learned neural networks.
+- [Operational calibration: debugging confidence errors for DNNs in the field](https://dl.acm.org/doi/pdf/10.1145/3368089.3409696)
+	- FSE 2019
+	- Abstract: Trained DNN models are increasingly adopted as integral parts of software systems, but they often perform deficiently in the field. A particularly damaging problem is that DNN models often give false predictions with high confidence, due to the unavoidable slight divergences between operation data and training data. To minimize the loss caused by inaccurate confidence, operational calibration, i.e., calibrating the confidence function of a DNN classifier against its operation domain, becomes a necessary debugging step in the engineering of the whole system. Operational calibration is difficult considering the limited budget of labeling operation data and the weak interpretability of DNN models. We propose a Bayesian approach to operational calibration that gradually corrects the confidence given by the model under calibration with a small number of labeled operation data deliberately selected from a larger set of unlabeled operation data. The approach is made effective and efficient by leveraging the locality of the learned representation of the DNN model and modeling the calibration as Gaussian Process Regression. Comprehensive experiments with various practical datasets and DNN models show that it significantly outperformed alternative methods, and in some difficult tasks it eliminated about 71% to 97% high-confidence (>0.9) errors with only about 10% of the minimal amount of labeled operation data needed for practical learning techniques to barely work
+- [From typestate verification to interpretable deep models (invited talk abstract)](https://doi.org/10.1145/3293882.3338992)
+	- ISSTA 2019
+	- Abstract: The paper ``Effective Typestate Verification in the Presence of Aliasing'' was published in the International Symposium on Software Testing and Analysis (ISSTA) 2006 Proceedings, and has now been selected to receive the ISSTA 2019 Retrospective Impact Paper Award. The paper described a scalable framework for verification of typestate properties in real-world Java programs. The paper introduced several techniques that have been used widely in the static analysis of real-world programs. Specifically, it introduced an abstract domain combining access-paths, aliasing information, and typestate that turned out to be simple, powerful, and useful. We review the original paper and show the evolution of the ideas over the years. We show how some of these ideas have evolved into work on machine learning for code completion, and discuss recent general results in machine learning for programming.
+- [Towards Robust Production Machine Learning Systems: Managing Dataset Shift](https://doi.org/10.1145/3324884.3415281)
+	- ASE 2020
+	- Abstract: The advances in machine learning (ML) have stimulated the integration of their capabilities into software systems. However, there is a tangible gap between software engineering and machine learning practices, that is delaying the progress of intelligent services development. Software organisations are devoting effort to adjust the software engineering processes and practices to facilitate the integration of machine learning models. Machine learning researchers as well are focusing on improving the interpretability of machine learning models to support overall system robustness. Our research focuses on bridging this gap through a methodology that evaluates the robustness of machine learning-enabled software engineering systems. In particular, this methodology will automate the evaluation of the robustness properties of software systems against dataset shift problems in ML. It will also feature a notification mechanism that facilitates the debugging of ML components.
+- [Explainable AI for Software Engineering](https://arxiv.org/pdf/2012.01614)
+	- ASE 2020
+	- Abstract: The success of software engineering projects largely depends on complex decision-making. For example, which tasks should a developer do first, who should perform this task, is the software of high quality, is a software system reliable and resilient enough to deploy, etc. However, erroneous decision-making for these complex questions is costly in terms of money and reputation. Thus, Artificial Intelligence/Machine Learning (AI/ML) techniques have been widely used in software engineering for developing software analytics tools and techniques to improve decision-making, developer productivity, and software quality. However, the predictions of such AI/ML models for software engineering are still not practical (i.e., coarse-grained), not explainable, and not actionable. These concerns often hinder the adoption of AI/ML models in software engineering practices. In addition, many recent studies still focus on improving the accuracy, while a few of them focus on improving explainability. Are we moving in the right direction? How can we better improve the SE community (both research and education)?In this tutorial, we first provide a concise yet essential introduction to the most important aspects of Explainable AI and a hands-on tutorial of Explainable AI tools and techniques. Then, we introduce the fundamental knowledge of defect prediction (an example application of AI for Software Engineering). Finally, we demonstrate three successful case studies on how Explainable AI techniques can be used to address the aforementioned challenges by making the predictions of software defect prediction models more practical, explainable, and actionable. The materials are available at https://xai4se.github.io.
+- [Real-time incident prediction for online service systems](https://doi.org/10.1145/3368089.3409672)
+	- FSE 2020
+	- Abstract: Incidents in online service systems could dramatically degrade system availability and destroy user experience. To guarantee service quality and reduce economic loss, it is essential to predict the occurrence of incidents in advance so that engineers can take some proactive actions to prevent them. In this work, we propose an effective and interpretable incident prediction approach, called eWarn, which utilizes historical data to forecast whether an incident will happen in the near future based on alert data in real time. More specifically, eWarn first extracts a set of effective features (including textual features and statistical features) to represent omen alert patterns via careful feature engineering. To reduce the influence of noisy alerts (that are not relevant to the occurrence of incidents), eWarn then incorporates the multi-instance learning formulation. Finally, eWarn builds a classification model via machine learning and generates an interpretable report about the prediction result via a state-of-the-art explanation technique (i.e., LIME). In this way, an early warning signal along with its interpretable report can be sent to engineers to facilitate their understanding and handling for the incoming incident. An extensive study on 11 real-world online service systems from a large commercial bank demonstrates the effectiveness of eWarn, outperforming state-of-the-art alert-based incident prediction approaches and the practice of incident prediction with alerts. In particular, we have applied eWarn to two large commercial banks in practice and shared some success stories and lessons learned from real deployment.
+- [AMS: generating AutoML search spaces from weak specifications](https://dl.acm.org/doi/pdf/10.1145/3368089.3409700)
+	- FSE 2020
+	- Abstract: We consider a usage model for automated machine learning (AutoML) in which users can influence the generated pipeline by providing a weak pipeline specification: an unordered set of API components from which the AutoML system draws the components it places into the generated pipeline. Such specifications allow users to express preferences over the components that appear in the pipeline, for example a desire for interpretable components to appear in the pipeline. We present AMS, an approach to automatically strengthen weak specifications to include unspecified complementary and functionally related API components, populate the space of hyperparameters and their values, and pair this configuration with a search procedure to produce a strong pipeline specification: a full description of the search space for candidate pipelines. ams uses normalized pointwise mutual information on a code corpus to identify complementary components, BM25 as a lexical similarity score over the target API's documentation to identify functionally related components, and frequency distributions in the code corpus to extract key hyperparameters and values. We show that strengthened specifications can produce pipelines that outperform the pipelines generated from the initial weak specification and an expert-annotated variant, while producing pipelines that still reflect the user preferences captured in the original weak specification.
+- [TauJud: test augmentation of machine learning in judicial documents](https://doi.org/10.1145/3395363.3404364)
+	- ISSTA 2020
+	- Abstract: The booming of big data makes the adoption of machine learning ubiquitous in the legal field. As we all know, a large amount of test data can better reflect the performance of the model, so the test data must be naturally expanded. In order to solve the high cost problem of labeling data in natural language processing, people in the industry have improved the performance of text classification tasks through simple data amplification techniques. However, the data amplification requirements in the judgment documents are interpretable and logical, as observed from CAIL2018 test data with over 200,000 judicial documents. Therefore, we have designed a test augmentation tool called TauJud specifically for generating more effective test data with uniform distribution over time and location for model evaluation and save time in marking data. The demo can be found at https://github.com/governormars/TauJud.
+- [Towards Robust Production Machine Learning Systems: Managing Dataset Shift](https://doi.org/10.1145/3324884.3415281)
+	- ASE 2020
+	- Abstract: The advances in machine learning (ML) have stimulated the integration of their capabilities into software systems. However, there is a tangible gap between software engineering and machine learning practices, that is delaying the progress of intelligent services development. Software organisations are devoting effort to adjust the software engineering processes and practices to facilitate the integration of machine learning models. Machine learning researchers as well are focusing on improving the interpretability of machine learning models to support overall system robustness. Our research focuses on bridging this gap through a methodology that evaluates the robustness of machine learning-enabled software engineering systems. In particular, this methodology will automate the evaluation of the robustness properties of software systems against dataset shift problems in ML. It will also feature a notification mechanism that facilitates the debugging of ML components.
+- [Explainable AI for Software Engineering](https://arxiv.org/pdf/2012.01614)
+	- ASE 2020
+	- Abstract: The success of software engineering projects largely depends on complex decision-making. For example, which tasks should a developer do first, who should perform this task, is the software of high quality, is a software system reliable and resilient enough to deploy, etc. However, erroneous decision-making for these complex questions is costly in terms of money and reputation. Thus, Artificial Intelligence/Machine Learning (AI/ML) techniques have been widely used in software engineering for developing software analytics tools and techniques to improve decision-making, developer productivity, and software quality. However, the predictions of such AI/ML models for software engineering are still not practical (i.e., coarse-grained), not explainable, and not actionable. These concerns often hinder the adoption of AI/ML models in software engineering practices. In addition, many recent studies still focus on improving the accuracy, while a few of them focus on improving explainability. Are we moving in the right direction? How can we better improve the SE community (both research and education)?In this tutorial, we first provide a concise yet essential introduction to the most important aspects of Explainable AI and a hands-on tutorial of Explainable AI tools and techniques. Then, we introduce the fundamental knowledge of defect prediction (an example application of AI for Software Engineering). Finally, we demonstrate three successful case studies on how Explainable AI techniques can be used to address the aforementioned challenges by making the predictions of software defect prediction models more practical, explainable, and actionable. The materials are available at https://xai4se.github.io.
+- [Real-time incident prediction for online service systems](https://doi.org/10.1145/3368089.3409672)
+	- FSE 2020
+	- Abstract: Incidents in online service systems could dramatically degrade system availability and destroy user experience. To guarantee service quality and reduce economic loss, it is essential to predict the occurrence of incidents in advance so that engineers can take some proactive actions to prevent them. In this work, we propose an effective and interpretable incident prediction approach, called eWarn, which utilizes historical data to forecast whether an incident will happen in the near future based on alert data in real time. More specifically, eWarn first extracts a set of effective features (including textual features and statistical features) to represent omen alert patterns via careful feature engineering. To reduce the influence of noisy alerts (that are not relevant to the occurrence of incidents), eWarn then incorporates the multi-instance learning formulation. Finally, eWarn builds a classification model via machine learning and generates an interpretable report about the prediction result via a state-of-the-art explanation technique (i.e., LIME). In this way, an early warning signal along with its interpretable report can be sent to engineers to facilitate their understanding and handling for the incoming incident. An extensive study on 11 real-world online service systems from a large commercial bank demonstrates the effectiveness of eWarn, outperforming state-of-the-art alert-based incident prediction approaches and the practice of incident prediction with alerts. In particular, we have applied eWarn to two large commercial banks in practice and shared some success stories and lessons learned from real deployment.
+- [AMS: generating AutoML search spaces from weak specifications](https://dl.acm.org/doi/pdf/10.1145/3368089.3409700)
+	- FSE 2020
+	- Abstract: We consider a usage model for automated machine learning (AutoML) in which users can influence the generated pipeline by providing a weak pipeline specification: an unordered set of API components from which the AutoML system draws the components it places into the generated pipeline. Such specifications allow users to express preferences over the components that appear in the pipeline, for example a desire for interpretable components to appear in the pipeline. We present AMS, an approach to automatically strengthen weak specifications to include unspecified complementary and functionally related API components, populate the space of hyperparameters and their values, and pair this configuration with a search procedure to produce a strong pipeline specification: a full description of the search space for candidate pipelines. ams uses normalized pointwise mutual information on a code corpus to identify complementary components, BM25 as a lexical similarity score over the target API's documentation to identify functionally related components, and frequency distributions in the code corpus to extract key hyperparameters and values. We show that strengthened specifications can produce pipelines that outperform the pipelines generated from the initial weak specification and an expert-annotated variant, while producing pipelines that still reflect the user preferences captured in the original weak specification.
+- [TauJud: test augmentation of machine learning in judicial documents](https://doi.org/10.1145/3395363.3404364)
+	- ISSTA 2020
+	- Abstract: The booming of big data makes the adoption of machine learning ubiquitous in the legal field. As we all know, a large amount of test data can better reflect the performance of the model, so the test data must be naturally expanded. In order to solve the high cost problem of labeling data in natural language processing, people in the industry have improved the performance of text classification tasks through simple data amplification techniques. However, the data amplification requirements in the judgment documents are interpretable and logical, as observed from CAIL2018 test data with over 200,000 judicial documents. Therefore, we have designed a test augmentation tool called TauJud specifically for generating more effective test data with uniform distribution over time and location for model evaluation and save time in marking data. The demo can be found at https://github.com/governormars/TauJud.
+- [Vulnerability detection with fine-grained interpretations](https://arxiv.org/pdf/2106.10478)
+	- FSE 2021
+	- Abstract: Despite the successes of machine learning (ML) and deep learning (DL)-based vulnerability detectors (VD), they are limited to providing only the decision on whether a given code is vulnerable or not, without details on what part of the code is relevant to the detected vulnerability. We present IVDetect, an interpretable vulnerability detector with the philosophy of using Artificial Intelligence (AI) to detect vulnerabilities, while using Intelligence Assistant (IA) to provide VD interpretations in terms of vulnerable statements. For vulnerability detection, we separately consider the vulnerable statements and their surrounding contexts via data and control dependencies. This allows our model better discriminate vulnerable statements than using the mixture of vulnerable code and contextual code as in existing approaches. In addition to the coarse-grained vulnerability detection result, we leverage interpretable AI to provide users with fine-grained interpretations that include the sub-graph in the Program Dependency Graph (PDG) with the crucial statements that are relevant to the detected vulnerability. Our empirical evaluation on vulnerability databases shows that IVDetect outperforms the existing DL-based approaches by 43%–84% and 105%–255% in top-10 nDCG and MAP ranking scores. IVDetect correctly points out the vulnerable statements relevant to the vulnerability via its interpretation in 67% of the cases with a top-5 ranked list. IVDetect improves over the baseline interpretation models by 12.3%–400% and 9%–400% in accuracy.
+- [Understanding neural code intelligence through program simplification](https://arxiv.org/pdf/2106.03353)
+	- FSE 2021
+	- Abstract: A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of "transparent/interpretable-AI". However, these approaches are often specific to a particular set of network architectures, even requiring access to the network's parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND's extracted features may help understand neural CI systems' predictions and learned behavior.
+- [Explaining mispredictions of machine learning models using rule induction](https://doi.org/10.1145/3468264.3468614)
+	- FSE 2021
+	- Abstract: While machine learning (ML) models play an increasingly prevalent role in many software engineering tasks, their prediction accuracy is often problematic. When these models do mispredict, it can be very difficult to isolate the cause. In this paper, we propose a technique that aims to facilitate the debugging process of trained statistical models. Given an ML model and a labeled data set, our method produces an interpretable characterization of the data on which the model performs particularly poorly. The output of our technique can be useful for understanding limitations of the training data or the model itself; it can also be useful for ensembling if there are multiple models with different strengths. We evaluate our approach through case studies and illustrate how it can be used to improve the accuracy of predictive models used for software engineering tasks within Facebook.
+- [Generalizable and interpretable learning for configuration extrapolation](https://dl.acm.org/doi/pdf/10.1145/3468264.3468603)
+	- FSE 2021
+	- Abstract: Modern software applications are increasingly configurable, which puts a burden on users to tune these configurations for their target hardware and workloads. To help users, machine learning techniques can model the complex relationships between software configuration parameters and performance. While powerful, these learners have two major drawbacks: (1) they rarely incorporate prior knowledge and (2) they produce outputs that are not interpretable by users. These limitations make it difficult to (1) leverage information a user has already collected (e.g., tuning for new hardware using the best configurations from old hardware) and (2) gain insights into the learner’s behavior (e.g., understanding why the learner chose different configurations on different hardware or for different workloads). To address these issues, this paper presents two configuration optimization tools, GIL and GIL+, using the proposed generalizable and interpretable learning approaches. To incorporate prior knowledge, the proposed tools (1) start from known configurations, (2) iteratively construct a new linear model, (3) extrapolate better performance configurations from that model, and (4) repeat. Since the base learners are linear models, these tools are inherently interpretable. We enhance this property with a graphical representation of how they arrived at the highest performance configuration. We evaluate GIL and GIL+ by using them to configure Apache Spark workloads on different hardware platforms and find that, compared to prior work, GIL and GIL+ produce comparable, and sometimes even better performance configurations, but with interpretable results.
+- [XAI tools in the public sector: a case study on predicting combined sewer overflows](https://doi.org/10.1145/3468264.3468547)
+	- FSE 2021
+	- Abstract: Artificial intelligence and deep learning are becoming increasingly prevalent in contemporary software solutions. Explainable artificial intelligence (XAI) tools attempt to address the black box nature of the deep learning models and make them more understandable to humans. In this work, we apply three state-of-the-art XAI tools in a real-world case study. Our study focuses on predicting combined sewer overflow events for a municipal wastewater treatment organization. Through a data driven inquiry, we collect both qualitative information via stakeholder interviews and quantitative measures. These help us assess the predictive accuracy of the XAI tools, as well as the simplicity, soundness, and insightfulness of the produced explanations. Our results not only show the varying degrees that the XAI tools meet the requirements, but also highlight that domain experts can draw new insights from complex explanations that may differ from their previous expectations.
+- [White-Box Analysis over Machine Learning: Modeling Performance of Configurable Systems](http://arxiv.org/pdf/2101.05362)
+	- ICSE 2021
+	- Abstract: Performance-influence models can help stakeholders understand how and where configuration options and their interactions influence the performance of a system. With this understanding, stakeholders can debug performance behavior and make deliberate configuration decisions. Current black-box techniques to build such models combine various sampling and learning strategies, resulting in tradeoffs between measurement effort, accuracy, and interpretability. We present Comprex, a white-box approach to build performance-influence models for configurable systems, combining insights of local measurements, dynamic taint analysis to track options in the implementation, compositionality, and compression of the configuration space, without relying on machine learning to extrapolate incomplete samples. Our evaluation on 4 widely-used, open-source projects demonstrates that Comprex builds similarly accurate performance-influence models to the most accurate and expensive black-box approach, but at a reduced cost and with additional benefits from interpretable and local models.
+- [NeuronFair: Interpretable White-Box Fairness Testing through Biased Neuron Identification](https://doi.org/10.1145/3510003.3510123)
+	- ICSE 2021
+	- Abstract: Deep neural networks (DNNs) have demonstrated their outper-formance in various domains. However, it raises a social concern whether DNNs can produce reliable and fair decisions especially when they are applied to sensitive domains involving valuable re-source allocation, such as education, loan, and employment. It is crucial to conduct fairness testing before DNNs are reliably de-ployed to such sensitive domains, i.e., generating as many instances as possible to uncover fairness violations. However, the existing testing methods are still limited from three aspects: interpretabil-ity, performance, and generalizability. To overcome the challenges, we propose NeuronFair, a new DNN fairness testing framework that differs from previous work in several key aspects: (1) inter-pretable - it quantitatively interprets DNNs' fairness violations for the biased decision; (2) effective - it uses the interpretation results to guide the generation of more diverse instances in less time; (3) generic - it can handle both structured and unstructured data. Extensive evaluations across 7 datasets and the corresponding DNNs demonstrate NeuronFair's superior performance. For instance, on structured datasets, it generates much more instances (~ ×5.84) and saves more time (with an average speedup of 534.56%) compared with the state-of-the-art methods. Besides, the instances of NeuronFair can also be leveraged to improve the fairness of the biased DNNs, which helps build more fair and trustworthy deep learning systems. The code of NeuronFair is open-sourced at https:/github.com/haibinzheng/NeuronFair.
+- [DeepHyperion: exploring the feature space of deep learning-based systems through illumination search](https://arxiv.org/pdf/2107.06997)
+	- ISSTA 2021
+	- Abstract: Deep Learning (DL) has been successfully applied to a wide range of application domains, including safety-critical ones. Several DL testing approaches have been recently proposed in the literature but none of them aims to assess how different interpretable features of the generated inputs affect the system's behaviour. In this paper, we resort to Illumination Search to find the highest-performing test cases (i.e., misbehaving and closest to misbehaving), spread across the cells of a map representing the feature space of the system. We introduce a methodology that guides the users of our approach in the tasks of identifying and quantifying the dimensions of the feature space for a given domain. We developed DeepHyperion, a search-based tool for DL systems that illuminates, i.e., explores at large, the feature space, by providing developers with an interpretable feature map where automatically generated inputs are placed along with information about the exposed behaviours.
+- [𝜀-weakened robustness of deep neural networks](https://doi.org/10.1145/3533767.3534373)
+	- ISSTA 2021
+	- Abstract: Deep neural networks have been widely adopted for many real-world applications and their reliability has been widely concerned. This paper introduces a notion of ε-weakened robustness (briefly as ε-robustness) for analyzing the reliability and some related quality issues of deep neural networks. Unlike the conventional robustness, which focuses on the “perfect” safe region in the absence of adversarial examples, ε-weakened robustness focuses on the region where the proportion of adversarial examples is bounded by user-specified ε. The smaller the value of ε is, the less vulnerable a neural network is to be fooled by a random perturbation. Under such a robustness definition, we can give conclusive results for the regions where conventional robustness ignores. We propose an efficient testing-based method with user-controllable error bounds to analyze it. The time complexity of our algorithms is polynomial in the dimension and size of the network. So, they are scalable to large networks. One of the important applications of our ε-robustness is to build a robustness enhanced classifier to resist adversarial attack. Based on this theory, we design a robustness enhancement method with good interpretability and rigorous robustness guarantee. The basic idea is to resist perturbation with perturbation. Experimental results show that our robustness enhancement method can significantly improve the ability of deep models to resist adversarial attacks while maintaining the prediction performance on the original clean data. Besides, we also show the other potential value of ε-robustness in neural networks analysis.
+- [Reentrancy Vulnerability Detection and Localization: A Deep Learning Based Two-phase Approach](https://dl.acm.org/doi/pdf/10.1145/3551349.3560428)
+	- ASE 2022
+	- Abstract: Smart contracts have been widely and rapidly used to automate financial and business transactions together with blockchains, helping people make agreements while minimizing trusts. With millions of smart contracts deployed on blockchain, various bugs and vulnerabilities in smart contracts have emerged. Following the rapid development of deep learning, many recent studies have used deep learning for vulnerability detection to conduct security checks before deploying smart contracts. These approaches show effective results on detecting whether a smart contract is vulnerable or not whereas their results on locating suspicious statements responsible for the detected vulnerability are still unsatisfactory. To address this problem, we propose a deep learning based two-phase smart contract debugger for reentrancy vulnerability, one of the most severe vulnerabilities, named as ReVulDL: Reentrancy Vulnerability Detection and Localization. ReVulDL integrates the vulnerability detection and localization into a unified debugging pipeline. For the detection phase, given a smart contract, ReVulDL uses a graph-based pre-training model to learn the complex relationships in propagation chains for detecting whether the smart contract contains a reentrancy vulnerability. For the localization phase, if a reentrancy vulnerability is detected, ReVulDL utilizes interpretable machine learning to locate the suspicious statements in smart contract to provide interpretations of the detected vulnerability. Our large-scale empirical study on 47,398 smart contracts shows that ReVulDL achieves promising results in detecting reentrancy vulnerabilities (e.g., outperforming 16 state-of-the-art vulnerability detection approaches) and locating vulnerable statements (e.g., 70.38% of the vulnerable statements are ranked within Top-10).
+- [ThirdEye: Attention Maps for Safe Autonomous Driving Systems](https://dl.acm.org/doi/pdf/10.1145/3551349.3556968)
+	- ASE 2022
+	- Abstract: Automated online recognition of unexpected conditions is an indispensable component of autonomous vehicles to ensure safety even in unknown and uncertain situations. In this paper we propose a runtime monitoring technique rooted in the attention maps computed by explainable artificial intelligence techniques. Our approach, implemented in a tool called ThirdEye, turns attention maps into confidence scores that are used to discriminate safe from unsafe driving behaviours. The intuition is that uncommon attention maps are associated with unexpected runtime conditions. In our empirical study, we evaluated the effectiveness of different configurations of ThirdEye at predicting simulation-based injected failures induced by both unknown conditions (adverse weather and lighting) and unsafe/uncertain conditions created with mutation testing. Results show that, overall, ThirdEye can predict 98% misbehaviours, up to three seconds in advance, outperforming a state-of-the-art failure predictor for autonomous vehicles.
+- [Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing](https://dl.acm.org/doi/pdf/10.1145/3551349.3561157)
+	- ASE 2022
+	- Abstract: Contemporary DNN testing works are frequently conducted using metamorphic testing (MT). In general, de facto MT frameworks mutate DNN input images using semantics-preserving mutations and determine if DNNs can yield consistent predictions. Nevertheless, we find that DNNs may rely on erroneous decisions (certain components on the DNN inputs) to make predictions, which may still retain the outputs by chance. Such DNN defects would be neglected by existing MT frameworks. Erroneous decisions, however, would likely result in successive mis-predictions over diverse images that may exist in real-life scenarios. This research aims to unveil the pervasiveness of hidden DNN defects caused by incorrect DNN decisions (but retaining consistent DNN predictions). To do so, we tailor and optimize modern eXplainable AI (XAI) techniques to identify visual concepts that represent regions in an input image upon which the DNN makes predictions. Then, we extend existing MT-based DNN testing frameworks to check the consistency of DNN decisions made over a test input and its mutated inputs. Our evaluation shows that existing MT frameworks are oblivious to a considerable number of DNN defects caused by erroneous decisions. We conduct human evaluations to justify the validity of our findings and to elucidate their characteristics. Through the lens of DNN decision-based metamorphic relations, we re-examine the effectiveness of metamorphic transformations proposed by existing MT frameworks. We summarize lessons from this study, which can provide insights and guidelines for future DNN testing.
+- [A real-world case study for automated ticket team assignment using natural language processing and explainable models](https://dl.acm.org/doi/pdf/10.1145/3551349.3561164)
+	- ASE 2022
+	- Abstract: In the context of software development, managing and organizing agile boards of multi-disciplinary teams distributed around the world is a great challenge, especially regarding the process of assigning tickets to the correct team roles. Incorrectly assigned tickets can result in significant resource waste in any project and directly influence delivery outcomes and project costs. This work proposes a method for ticket analysis and automatic team assignment using Natural Language Processing and explainable Machine Learning models. Results show that the models perform well on a real-world team assignment task and provide insights into their decision.
+- [XSA: eXplainable Self-Adaptation](https://dl.acm.org/doi/pdf/10.1145/3551349.3559552)
+	- ASE 2022
+	- Abstract: Self-adaptive systems increasingly rely on machine learning techniques as black-box models to make decisions even when the target world of interest includes uncertainty and unknowns. Because of the lack of transparency, adaptation decisions, as well as their effect on the world, are hard to explain. This often hinders the ability to trace unsuccessful adaptations back to understandable root causes. In this paper, we introduce our vision of explainable self-adaptation. We demonstrate our vision by instantiating our ideas on a running example in the robotics domain and by showing an automated proof-of-concept process providing human-understandable explanations for successful and unsuccessful adaptations in critical scenarios.
+- [Actionable and interpretable fault localization for recurring failures in online service systems](https://dl.acm.org/doi/pdf/10.1145/3540250.3549092)
+	- FSE 2022
+	- Abstract: Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across/within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. More specifically, 1) they can identify the underlying root causes and take mitigation actions when pinpointing a group of indicative metrics on the faulty component; 2) their diagnosis knowledge is roughly based on how one failure might affect the components in the whole system. Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures in online service systems. For a specific online service system, DejaVu takes historical failures and dependencies in the system as input and trains a localization model offline; for an incoming failure, the trained model online recommends where the failure occurs (i.e., the faulty components) and which kind of failure occurs (i.e., the indicative group of metrics) (thus actionable), which are further interpreted both globally and locally (thus interpretable). Based on the evaluation on 601 failures from three production systems and one open-source benchmark, in less than one second, DejaVu can rank the ground truths at 1.66∼5.03-th among a long candidate list on average, outperforming baselines by 54.52%.
+- [Default: Mutual Information-based Crash Triage for Massive Crashes](https://doi.org/10.1145/3510003.3512760)
+	- ICSE 2022
+	- Abstract: With the considerable success achieved by modern fuzzing in-frastructures, more crashes are produced than ever before. To dig out the root cause, rapid and faithful crash triage for large numbers of crashes has always been attractive. However, hindered by the practical difficulty of reducing analysis imprecision without compromising efficiency, this goal has not been accomplished. In this paper, we present an end-to-end crash triage solution Default, for accurately and quickly pinpointing unique root cause from large numbers of crashes. In particular, we quantify the “crash relevance” of program entities based on mutual information, which serves as the criterion of unique crash bucketing and allows us to bucket massive crashes without pre-analyzing their root cause. The quantification of “crash relevance” is also used in the shortening of long crashing traces. On this basis, we use the interpretability of neural networks to precisely pinpoint the root cause in the shortened traces by evaluating each basic block's impact on the crash label. Evaluated with 20 programs with 22216 crashes in total, Default demonstrates remarkable accuracy and performance, which is way beyond what the state-of-the-art techniques can achieve: crash de-duplication was achieved at a super-fast processing speed - 0.017 seconds per crashing trace, without missing any unique bugs. After that, it identifies the root cause of 43 unique crashes with no false negatives and an average false positive rate of 9.2%.
+- [ExAIS: Executable AI Semantics](https://doi.org/10.1145/3510003.3510112)
+	- ICSE 2022
+	- Abstract: Neural networks can be regarded as a new programming paradigm, i.e., instead of building ever-more complex programs through (often informal) logical reasoning in the programmers' mind, complex ‘AI’ systems are built by optimising generic neural network models with big data. In this new paradigm, AI frameworks such as Ten-sorFlow and PyTorch play a key role, which is as essential as the compiler for traditional programs. It is known that the lack of a proper semantics for programming languages (such as C), i.e., a correctness specification for compilers, has contributed to many problematic program behaviours and security issues. While it is in general hard to have a correctness specification for compilers due to the high complexity of programming languages and their rapid evolution, we have a unique opportunity to do it right this time for neural networks (which have a limited set of functions, and most of them have stable semantics). In this work, we report our effort on providing a correctness specification of neural net-work frameworks such as TensorFlow. We specify the semantics of almost all TensorFlow layers in the logical programming language Prolog. We demonstrate the usefulness of the semantics through two applications. One is a fuzzing engine for TensorFlow, which features a strong oracle and a systematic way of generating valid neural networks. The other is a model validation approach which enables consistent bug reporting for TensorFlow models.
+- [What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code](https://doi.org/10.1145/3510003.3510050)
+	- ICSE 2022
+	- Abstract: Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.
+- [Almost correct invariants: synthesizing inductive invariants by fuzzing proofs](https://doi.org/10.1145/3533767.3534381)
+	- ISSTA 2022
+	- Abstract: Real-life programs contain multiple operations whose semantics are unavailable to verification engines, like third-party library calls, inline assembly and SIMD instructions, special compiler-provided primitives, and queries to uninterpretable machine learning models. Even with the exceptional success story of program verification, synthesis of inductive invariants for such "open" programs has remained a challenge. Currently, this problem is handled by manually "closing" the program---by providing hand-written stubs that attempt to capture the behavior of the unmodelled operations; writing stubs is not only difficult and tedious, but the stubs are often incorrect---raising serious questions on the whole endeavor. In this work, we propose Almost Correct Invariants as an automated strategy for synthesizing inductive invariants for such "open" programs. We adopt an active learning strategy where a data-driven learner proposes candidate invariants. In deviation from prior work that attempt to verify invariants, we attempt to falsify the invariants: we reduce the falsification problem to a set of reachability checks on non-deterministic programs; we ride on the success of modern fuzzers to answer these reachability queries. Our tool, Achar, automatically synthesizes inductive invariants that are sufficient to prove the correctness of the target programs. We compare Achar with a state-of-the-art invariant synthesis tool that employs theorem proving on formulae built over the program source. Though Achar is without strong soundness guarantees, our experiments show that even when we provide almost no access to the program source, Achar outperforms the state-of-the-art invariant generator that has complete access to the source. We also evaluate Achar on programs that current invariant synthesis engines cannot handle---programs that invoke external library calls, inline assembly, and queries to convolution neural networks; Achar successfully infers the necessary inductive invariants within a reasonable time.
+- [LiRTest: augmenting LiDAR point clouds for automated testing of autonomous driving systems](https://doi.org/10.1145/3533767.3534397)
+	- ISSTA 2022
+	- Abstract: With the tremendous advancement of Deep Neural Networks (DNNs), autonomous driving systems (ADS) have achieved significant development and been applied to assist in many safety-critical tasks. However, despite their spectacular progress, several real-world accidents involving autonomous cars even resulted in a fatality. While the high complexity and low interpretability of DNN models, which empowers the perception capability of ADS, make conventional testing techniques inapplicable for the perception of ADS, the existing testing techniques depending on manual data collection and labeling become time-consuming and prohibitively expensive. In this paper, we design and implement LiRTest, the first automated LiDAR-based autonomous vehicles testing tool. LiRTest implements the ADS-specific metamorphic relation and equips affine and weather transformation operators that can reflect the impact of the various environmental factors to implement the relation. We experiment LiRTest with multiple 3D object detection models to evaluate its performance on different tasks. The experiment results show that LiRTest can activate different neurons of the object detection models and effectively detect their erroneous behaviors under various driving conditions. Also, the results confirm that LiRTest can improve the object detection precision by retraining with the generated data.
+- [Robin: A Novel Method to Produce Robust Interpreters for Deep Learning-Based Code Classifiers](https://arxiv.org/pdf/2309.10644)
+	- ASE 2023
+	- Abstract: Deep learning has been widely used in source code classification tasks, such as code classification according to their functionalities, code authorship attribution, and vulnerability detection. Unfortunately, the black-box nature of deep learning makes it hard to interpret and understand why a classifier (i.e., classification model) makes a particular prediction on a given example. This lack of interpretability (or explainability) might have hindered their adoption by practitioners because it is not clear when they should or should not trust a classifier's prediction. The lack of interpretability has motivated a number of studies in recent years. However, existing methods are neither robust nor able to cope with out-of-distribution examples. In this paper, we propose a novel method to produce Robust interpreters for a given deep learning-based code classifier; the method is dubbed Robin. The key idea behind Robin is a novel hybrid structure combining an interpreter and two approximators, while leveraging the ideas of adversarial training and data augmentation. Experimental results show that on average the interpreter produced by Robin achieves a 6.11% higher fidelity (evaluated on the classifier), 67.22% higher fidelity (evaluated on the approximator), and 15.87x higher robustness than that of the three existing interpreters we evaluated. Moreover, the interpreter is 47.31% less affected by out-of-distribution examples than that of LEMNA.
+- [Generative Type Inference for Python](https://arxiv.org/pdf/2307.09163)
+	- ASE 2023
+	- Abstract: Python is a popular dynamic programming language, evidenced by its ranking as the second most commonly used language on GitHub. However, its dynamic type system can lead to potential type errors, leading researchers to explore automatic type inference approaches for Python programs. Existing type inference approaches can be generally grouped into three categories, i.e., rule-based, supervised, and cloze- style approaches. The rule-based type inference approaches can ensure the accuracy of predicted variable types, but they suffer from low coverage problems caused by dynamic features and external calls. Supervised type inference approaches, while feature-agnostic and able to mitigate the low coverage problem, require large, high- quality annotated datasets and are limited to pre-defined types. As zero-shot approaches, the cloze-style approaches reformulate the type inference problem into a fill-in-the-blank problem by leveraging the general knowledge in powerful pre-trained code models. However, their performance is limited since they ignore the domain knowledge from static typing rules which reflect the inference logic. What is more, their predictions are not interpretable, hindering developers' understanding and verification of the results. This paper introduces Typegen, a few-shot generative type inference approach that incorporates static domain knowledge from static analysis. Typegen creates chain-of-thought (COT) prompts by translating the type inference steps of static analysis into prompts based on the type dependency graphs (TDGs), enabling language models to learn from how static analysis infers types. By combining COT prompts with code slices and type hints, TypegEnconstructs example prompts from human annotations. Typeg Enonly requires very few annotated examples to teach language models to generate similar COT prompts via in-context learning. Moreover, Typeg Enenhances the interpretability of results through the use of the input- explanation-output strategy, which generates both explanations and type predictions in COT prompts. Experiments show that Typegen outperforms the best baseline Type4Py by 10.0% for argument type prediction and 22.5 % in return value type prediction in terms of top-l Exact Match by using only five examples. Furthermore, Typeg Enachieves substantial improvements of 27 % to 84 % compared to the zero-shot performance of large language models with parameter sizes ranging from 1.3B to 175B in terms of top-I Exact Match.
+- [InfeRE: Step-by-Step Regex Generation via Chain of Inference](https://arxiv.org/pdf/2308.04041)
+	- ASE 2023
+	- Abstract: Automatically generating regular expressions (abbrev. regexes) from natural language description (NL2RE) has been an emerging research area. Prior studies treat regex as a linear sequence of tokens and generate the final expressions autoregressively in a single pass. They did not take into account the step-by-step internal text-matching processes behind the final results. This significantly hinders the efficacy and interpretability of regex generation by neural language models. In this paper, we propose a new paradigm called InfeRE, which decomposes the generation of regexes into chains of step-bystep inference. To enhance the robustness, we introduce a self-consistency decoding mechanism that ensembles multiple outputs sampled from different models. We evaluate InfeRE on two publicly available datasets, NL-RX-Turk and KB13, and compare the results with state-of-the-art approaches and the popular tree-based generation approach TRANX. Experimental results show that InfeRE substantially outperforms previous baselines, yielding 16.3% and 14.7% improvement in DFA@5 accuracy on two datasets, respectively.
+- [Scalable Industrial Control System Analysis via XAI-Based Gray-Box Fuzzing](https://doi.org/10.1109/ASE56229.2023.00161)
+	- ASE 2023
+	- Abstract: Conventional approaches to analyzing industrial control systems have relied on either white-box analysis or black-box fuzzing. However, white-box methods rely on sophisticated domain expertise, while black-box methods suffers from state explosion and thus scales poorly when analyzing real ICS involving a large number of sensors and actuators. To address these limitations, we propose XAI-based gray-box fuzzing, a novel approach that leverages explainable AI and machine learning modeling of ICS to accurately identify a small set of actuators critical to ICS safety, which result in significant reduction of state space without relying on domain expertise. Experiment results show that our method accurately explains the ICS model and significantly speeds-up fuzzing by 64x when compared to conventional black-box methods.
+- [How Early Participation Determines Long-Term Sustained Activity in GitHub Projects?](https://dl.acm.org/doi/pdf/10.1145/3611643.3616349)
+	- FSE 2023
+	- Abstract: Although the open source model bears many advantages in software development, open source projects are always hard to sustain. Previous research on open source sustainability mainly focuses on projects that have already reached a certain level of maturity (e.g., with communities, releases, and downstream projects). However, limited attention is paid to the development of (sustainable) open source projects in their infancy, and we believe an understanding of early sustainability determinants is crucial for project initiators, incubators, newcomers, and users. In this paper, we aim to explore the relationship between early participation factors and long-term project sustainability. We leverage a novel methodology combining the Blumberg model of performance and machine learning to predict the sustainability of 290,255 GitHub projects. Specificially, we train an XGBoost model based on early participation (first three months of activity) in 290,255 GitHub projects and we interpret the model using LIME. We quantitatively show that early participants have a positive effect on project's future sustained activity if they have prior experience in OSS project incubation and demonstrate concentrated focus and steady commitment. Participation from non-code contributors and detailed contribution documentation also promote project's sustained activity. Compared with individual projects, building a community that consists of more experienced core developers and more active peripheral developers is important for organizational projects. This study provides unique insights into the incubation and recognition of sustainable open source projects, and our interpretable prediction approach can also offer guidance to open source project initiators and newcomers.
+- [Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation](https://arxiv.org/pdf/2302.07445)
+	- ICSE 2023
+	- Abstract: Due to convenience, open-source software is widely used. For beneficial reasons, open-source maintainers often fix the vulnerabilities silently, exposing their users unaware of the updates to threats. Previous works all focus on black-box binary detection of the silent dependency alerts that suffer from high false-positive rates. Open-source software users need to analyze and explain AI prediction themselves. Explainable AI becomes remarkable as a complementary of black-box AI models, providing details in various forms to explain AI decisions. Noticing there is still no technique that can discover silent dependency alert on time, in this work, we propose a framework using an encoder-decoder model with a binary detector to provide explainable silent dependency alert prediction. Our model generates 4 types of vulnerability key aspects including vulnerability type, root cause, attack vector, and impact to enhance the trustworthiness and users' acceptance to alert prediction. By experiments with several models and inputs, we confirm CodeBERT with both commit messages and code changes achieves the best results. Our user study shows that explainable alert predictions can help users find silent dependency alert more easily than black-box predictions. To the best of our knowledge, this is the first research work on the application of Explainable AI in silent dependency alert prediction, which opens the door of the related domains.
+- [Leveraging Feature Bias for Scalable Misprediction Explanation of Machine Learning Models](https://doi.org/10.1109/ICSE48619.2023.00135)
+	- ICSE 2023
+	- Abstract: Interpreting and debugging machine learning models is necessary to ensure the robustness of the machine learning models. Explaining mispredictions can help significantly in doing so. While recent works on misprediction explanation have proven promising in generating interpretable explanations for mispredictions, the state-of-the-art techniques “blindly” deduce misprediction explanation rules from all data features, which may not be scalable depending on the number of features. To alleviate this problem, we propose an efficient misprediction explanation technique named Bias Guided Misprediction Diagnoser (BGMD), which leverages two prior knowledge about data: a) data often exhibit highly-skewed feature distributions and b) trained models in many cases perform poorly on subdataset with under-represented features. Next, we propose a technique named MAPS (Mispredicted Area UPweight Sampling). MAPS increases the weights of subdataset during model retraining that belong to the group that is prone to be mispredicted because of containing under-represented features. Thus, MAPS make retrained model pay more attention to the under-represented features. Our empirical study shows that our proposed BGMD outperformed the state-of-the-art misprediction diagnoser and reduces diagnosis time by 92%. Furthermore, MAPS outperformed two state-of-the-art techniques on fixing the machine learning model's performance on mispredicted data without compromising performance on all data. All the research artifacts (i.e., tools, scripts, and data) of this study are available in the accompanying website [1].
+- [CoLeFunDa: Explainable Silent Vulnerability Fix Identification](https://doi.org/10.1109/ICSE48619.2023.00214)
+	- ICSE 2023
+	- Abstract: It is common practice for OSS users to leverage and monitor security advisories to discover newly disclosed OSS vulnerabilities and their corresponding patches for vulnerability remediation. It is common for vulnerability fixes to be publicly available one week earlier than their disclosure. This gap in time provides an opportunity for attackers to exploit the vulnerability. Hence, OSS users need to sense the fix as early as possible so that the vulnerability can be remediated before it is exploited. However, it is common for OSS to adopt a vulnerability disclosure policy which causes the majority of vulnerabilities to be fixed silently, meaning the commit with the fix does not indicate any vulnerability information. In this case even if a fix is identified, it is hard for OSS users to understand the vulnerability and evaluate its potential impact. To improve early sensing of vulnerabilities, the identification of silent fixes and their corresponding explanations (e.g., the corresponding common weakness enumeration (CWE) and exploitability rating) are equally important. However, it is challenging to identify silent fixes and provide explanations due to the limited and diverse data. To tackle this challenge, we propose CoLeFunDa: a framework consisting of a Contrastive Learner and FunDa, which is a novel approach for Function change Data augmentation. FunDa first increases the fix data (i.e., code changes) at the function level with unsupervised and supervised strategies. Then the contrastive learner leverages contrastive learning to effectively train a function change encoder, FCBERT, from diverse fix data. Finally, we leverage FCBERT to further fine-tune three downstream tasks, i.e., silent fix identification, CWE category classification, and exploitability rating classification, respectively. Our result shows that CoLeFunDa outperforms all the state-of-art baselines in all downstream tasks. We also conduct a survey to verify the effectiveness of CoLeFunDa in practical usage. The result shows that CoLeFunDa can categorize 62.5% (25 out of 40) CVEs with correct CWE categories within the top 2 recommendations.
+
+