Skip to content

Commit

Permalink
fix some typos
Browse files Browse the repository at this point in the history
  • Loading branch information
olivier-bernard-creatis committed Oct 21, 2024
1 parent 88b0e17 commit 843eae4
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions collections/_posts/2024-10-20-tabular-explainability.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ pdf: "https://arxiv.org/pdf/2302.14278"

![](/collections/images/tabular_explainability/tab_exp_1.jpg)

* There exists a total of $$N \times h$$ attention matrices for a standart transformer encoder composed by $$N$$ blocs and $$h$$ heads
* A standard transformer encoder is composed of $$N \times h$$ attention matrices, with $$N$$ the number of blocs and $$h$$ the number of heads per bloc

See [the tutorial on transformers](https://creatis-myriad.github.io/tutorials/2022-06-20-tutorial_transformer.html) for more details.

Expand All @@ -61,7 +61,7 @@ See [the tutorial on transformers](https://creatis-myriad.github.io/tutorials/20

$$\mathcal{L}= - \sum_{i=1}^{n} y_i \log \left( \hat{y}_i \right) \, + \, \lambda \sum_{l=1}^{M} \sum_{j,k=1}^{m} a^{l}_{j,k} \log \left( a^{l}_{j,k} \right)$$

* The first term forces the student prediction ($$\hat{y}_i$$) to be close to the one of the master ($$y_i$$)
* The first term forces the student prediction $$\hat{y}_i$$ to be close to the one of the master $$y_i$$
* The second term forces the entropy of each attention matrix to be low => it forces the information contained in each attention matrix to be concentrated on few cells => it forces the attention matrices to be sparse !

 
Expand All @@ -70,20 +70,20 @@ $$\mathcal{L}= - \sum_{i=1}^{n} y_i \log \left( \hat{y}_i \right) \, + \, \lambd

* Maps the attention matrices across encoder layers into a directed acyclic graph (DAG)
* The DAG is defined as $$D=(V,A)$$, where $$V$$ and $$A$$ are the set of vertices and arcs that compose the graph $$D$$
* The vertices $$V= \bigcup_{l=0}^{M} \{ v^l_c \}$$ correspond to groups of features
* The arcs $$\left( v^{l-1}_{\hat{c}}, v^{l}_{\tilde{c}}\right) \in A$$ correspond to attention values $$a^l_{\hat{c},\tilde{c}}$$, where $$\hat{c}, \tilde{c} \in {1,\cdots,m}$$
* The vertices $$V= \bigcup_{l=0}^{M} \{ v^l_c \}$$ correspond to groups of features, where $$c \in \{1,\cdots,m\}$$
* The arcs $$\left( v^{l-1}_{\hat{c}}, v^{l}_{\tilde{c}}\right) \in A$$ correspond to attention values $$a^l_{\hat{c},\tilde{c}}$$, where $$\hat{c}, \tilde{c} \in \{1,\cdots,m\}$$

![](/collections/images/tabular_explainability/from_attention_to_graph.jpg)

* The maximum probability path $$p$$ is found using Dijkstra’s algorithm and is of the form $$p=\{ v^{0}_{i_0}, v^{1}_{i_1}, \cdots, v^{M}_{i_M} \}$$
* The arc cost is $$- \log\left( a^l_{j,k} \right)$$ for $$a^l_{j,k} > 0$$, yielding path cost $$- \log\left( \prod_{l=1}^{M} a^l_{i_{l-1},i_{l}} \right)$$
* The authors focus on the group corresponding to the most relevant input for the final prediction, i.e. group $$c=i_0$$
* The authors focus on the group of features corresponding to the most relevant input for the final prediction, i.e. group of features $$c=i_0$$

> Explanations to the student’s predictions are provided by finding the most relevant group for the classification
> Explanations to the student’s predictions are provided by finding the most relevant group of features for the classification
task, i.e. the group $$c=i_0$$ corresponding to the first vertex $$v^0_{i_0}$$ of the maximum probability path $$p$$ in graph $$D$$

* A single group does not always provide all the relevant information to make a prediction
* Additional groups are ranked iteratively, i.e. in each iteration the starting point $$v^0_{i_0}$$ of the previously found highest probability path is eliminated from the graph and then search for the respective next highest probability path in $$D$$
* A single group of features does not always provide all the relevant information to make a prediction
* Additional groups of features are ranked iteratively, i.e. in each iteration the starting point $$v^0_{i_0}$$ of the previously found highest probability path is eliminated from the graph and then search for the respective next highest probability path in $$D$$
* In the experiments, two best groups were used as most to explain predictions

 
Expand Down
Binary file modified collections/images/tabular_explainability/tab_exp_1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 843eae4

Please sign in to comment.