diff --git a/latex/splash2024/splash.pdf b/latex/splash2024/splash.pdf index bcc0c667..a94ea823 100644 Binary files a/latex/splash2024/splash.pdf and b/latex/splash2024/splash.pdf differ diff --git a/latex/splash2024/splash.tex b/latex/splash2024/splash.tex index ba2a7188..7e0ffe49 100644 --- a/latex/splash2024/splash.tex +++ b/latex/splash2024/splash.tex @@ -912,7 +912,29 @@ Then, instead of top-down incremental sampling, we can create a randomized $\varphi'$ from $\varphi$ by sampling integers uniformly without replacement from $\mathbb{Z}_{|T|}$ and decode them into whole parse trees. Assuming $\varphi(T, \cdot)$ is in fact bijective, then letting $\bm\varphi(i) = \bigcup_{j\in[1, i]} \{\varphi'(T, j)\}$ will satisfy Def.~\ref{def:linear-convergence} by construction. If the language being sampled is sufficiently small, we can enumerate every tree, otherwise, sample them uniformly without replacement, or with replacement using a PCFG. %This procedure is the basis for our enumerate sampler and the method we use to decode repairs from the intersection grammar. - \subsection{Ranked repair}\label{sec:ranking} + The previous technique will exhaustively enumerate all parse trees in a given $\mathbb{T}_2$, but in an arbitrary order. While sufficient for the purpose of decoding small finite languages, it can be improved for large finite languages by decoding repairs in order of likelihood. To decode the top-k maximum likelihood results without extracting all repairs and reranking, one can instead define an automata algebra over $\mathcal{M}^{|V|}$, propagating a vector of automata, each automata $A$ indexed by the nonterminal $v: V$ as follows: + + \begin{align} + X \oplus Z &\mapsto \bigcup_{v \in V}\big\{\langle v, A_v \rangle \mid \mathcal{L}(A_v) = \mathcal{L}(X[v]) \cup \mathcal{L}(Z[v])\big\}\\ + X \otimes Z &\mapsto \big\{\langle v, A_v \rangle \mid \mathcal{L}(A_v) = \mathcal{L}(X[x]) \times \mathcal{L}(Z[z]), (v \rightarrow xz) \in P\big\} + \end{align}\\ + + where the unit nonterminals occupying the first upper diagonal are constructed as follows: + + \begin{equation} + \begin{footnotesize} + \Lambda(s: \underline\Sigma^n) \mapsto \Big\{\langle v, A_v\rangle\mid v: V, \mathcal{L}(A_v) = \{\varepsilon\}\Big\} \otimes \begin{cases} + \big\{\langle v, A_v \rangle \mid \mathcal{L}(A_v) = \Sigma \big\} & \text{if $s$ is a hole,} \vspace{5pt}\\ + \big\{\langle v, A_v \rangle \mid \mathcal{L}(A_v) = \{s\}, (v \rightarrow s)\in P\big\} & \text{otherwise.} + \end{cases} + \end{footnotesize} + \end{equation} + + Constructively, letting $+, *: \mathcal{M}\times \mathcal{M} \rightarrow \mathcal{M}$ be the automata operators corresponding to language union and concatenation satisfying $\mathcal{L}(A_1 + A_2) = \mathcal{L}(A_1)\cup\mathcal{L}(A_2)$, and $\mathcal{L}(A_1 * A_2) = \mathcal{L}(A_1)\times\mathcal{L}(A_2)$. This can be implemented using the standard textbook construction, recalling that NFA are closed under these operations. + + Given a PCFG with known transition probabilities, $\Lambda^* \circ S$ would then yield an equivalent WFSA recognizing $\Sigma^n\cap\mathcal{L}(G)$, which can be determinized into a labeled transition system (LTS) and decoded using k-best paths to obtain the top-k maximum likelihood repairs. While this procedure requires a more complex datastructure, it is more sample efficient than the tree sampler and does not require a separate reranking step. + + \clearpage\subsection{Ranked repair}\label{sec:ranking} Returning to the ranked repair problem (Def.~\ref{def:ranked-repair}), the above procedure returns a set of syntactically consistent repairs, and we need an ordering over them. We note that any metric is sufficient, such as the log-likelihood of the repair under a large language model or the probability under a PCFG. We implement the simplest solution: the likelihood of a low-order Markov chain. This solution is computationally fast, and as we will show, yields competitive results in practice. diff --git a/src/jvmTest/kotlin/ai/hypergraph/kaliningraph/automata/WFSATest.kt b/src/jvmTest/kotlin/ai/hypergraph/kaliningraph/automata/WFSATest.kt index 5204f264..bc87f975 100644 --- a/src/jvmTest/kotlin/ai/hypergraph/kaliningraph/automata/WFSATest.kt +++ b/src/jvmTest/kotlin/ai/hypergraph/kaliningraph/automata/WFSATest.kt @@ -64,12 +64,12 @@ class WFSATest { .replace("null", "ε") // null label = ε-transition /* - ./gradlew jvmTest --tests "ai.hypergraph.kaliningraph.automata.WFSATest.testLBHRepair" + ./gradlew jvmTest --tests "ai.hypergraph.kaliningraph.automata.WFSATest.testPTreeVsWFSA" */ @Test - fun testLBHRepair() { - val toRepair = "NAME : NEWLINE NAME = STRING NEWLINE NAME = NAME . NAME ( STRING ) NEWLINE" - val radius = 1 + fun testPTreeVsWFSA() { + val toRepair = "NAME = NAME ( STRING ) NEWLINE NAME = NAME ( STRING ) NEWLINE NAME = NAME ( STRING NUMBER NAME = [ NAME , NAME , NAME ] NEWLINE" + val radius = 2 val pt = Grammars.seq2parsePythonCFG.makeLevPTree(toRepair, radius, shortS2PParikhMap) val repairs = pt.sampleStrWithoutReplacement().distinct().take(100).toSet() println("Found ${repairs.size} repairs by enumerating PTree") @@ -85,7 +85,7 @@ class WFSATest { addTransition(s1, s2, a.root, 1.0) } } - )?.also { println("\n" + it.toDot().alsoCopy() + "\n") } + )?.also { println("\n" + Operations.determinizeER(it).toDot().alsoCopy() + "\n") } .also { println("Total: ${Automata.transitions(it).size} arcs, ${Automata.states(it).size}") } .let { Automata.bestStrings(it, 1000).map { it.label.joinToString(" ") }.toSet() } }.also {