How attentive are GATs?

Key ideas

Rank of attention in GATv1 is unconditional on query node
Because of this, some graph problems cannot be expressed by GAT

Introduction

Proof GAT doesn't compute a dynamic type of attention. It's only static.
- Static means for any query node, the attention is monotonic with respect to its neighbors.

Preliminaries

Generic GNN layer:
GAT attention scoring:
GAT attention function:
GAT layer:

Static vs dynamic & limited expressivity of GAT

Static attention:
- Family of functions computing scoring for a set of key vectors and query vectors. For every f there's always a highest scoring key.
- This is limiting because regardless of the query, every function has a key that's always selected
- BUT different keys have different relevance to different queries in real problems. How to express this?
Dynamic attention:
- Every key has different scoring based on the query node. Notice dynamic attention families will have strict subsets of static attention.

Need for a new scoring function

There exists a j_max wo that a_2(Wh_j) is maximal for all j in V.
In that case, due to the monotonicity of LeakyReLU, for every query node i, the node j_max is leading to the maximal value of the distribution.
To avoid this, we can apply the "a" learned attention layer after applying the LeakyReLU non linearity.

Evaluation

Certain problems, like the following, cannot be learned with static attention (proven in appendix).