Skip to content

Commit

Permalink
improved documentation based on JOSS review
Browse files Browse the repository at this point in the history
  • Loading branch information
FlyingWorkshop committed Dec 12, 2024
1 parent 74fd210 commit 9fab4da
Show file tree
Hide file tree
Showing 4 changed files with 25 additions and 21 deletions.
7 changes: 6 additions & 1 deletion docs/src/math/appendix/inverses.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,9 @@ f(g(\theta))
\end{aligned}
```

The converse is similar, so $f = g^{-1}$.
The converse is similar, so $f = g^{-1}$.

## Remark

Since $f$ is the inverse link function $g^{-1}$ and $\mu = g(\theta)$, we also have $f(\mu) = \theta$.

32 changes: 14 additions & 18 deletions docs/src/math/bregman.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,25 @@
# Bregman Divergences

Bregman divergences [Bregman](@cite) play a central role in the probabilistic framework of EPCA. They generalize the concept of distance between two points but do not necessarily satisfy the properties of a traditional metric (such as symmetry or the triangle inequality). Instead, Bregman divergences provide a flexible way to measure differences between data points, making them useful in applications like clustering, optimization, and information theory.

Understanding Bregman divergences is essential for EPCA because they link the exponential family of probability distributions to loss functions used in optimization, allowing us to generalize PCA to non-Gaussian data.
The EPCA objective is formulated as a Bregman divergence [Bregman](@cite). Bregman divergences are a measure of difference between two points (often probability distributions); however, they are not proper metrics, because they do not always satisfy symmetry and the triangle inequality.

## Definition

Formally, the Bregman divergence [Bregman](@cite) $B_F$ associated with a function $F(\theta)$ is defined as
Formally, the Bregman divergence $B_F$ associated with a function $F(\theta)$ is defined as

```math
B_F(p \| q) = F(p) - F(q) - \langle f(p), p - q \rangle
```

where
* $F(\mu)$ is a strictly convex and continuously differentiable function,
* $f(\mu) = \nabla_\mu F(\mu)$ is the convex conjugate (defined later) of $F$,
* $f(\mu) = \nabla_\mu F(\mu)$ is the gradient of $F$,
* and $\langle \cdot, \cdot \rangle$ denotes an inner product.

Intuitively, the Bregman divergence expresses the difference at $p$ between $F$ and its first-order Taylor expansion about $q$.

### Properties
### Aside: Properties

Unlike traditional metrics, Bregman divergences are not generally symmetric (i.e., $B_F(p \| q) \neq B_F(q \| p)$) and do not usually satsify the triangle inequality. However, they are always non-negative ($B_F(p \| q) \geq 0$) and equal $0$ if and only if $p = q$.
Bregman divergences vanish $B_F(p \| q) = 0$ if and only if their inputs also vanish $p = q = 0$. They are also always non-negative $B_F(p \| q) \geq 0$ for all $p, q \in \mathrm{domain}(F)$.

!!! info
While the full EPCA objective is always non-negative, the `EPCA` loss may be negative because `ExpFamilyPCA.jl` uses transformed objectives that are equivalent but not equal to minimizing a sum of Bregman divergences.
Expand All @@ -47,45 +45,43 @@ The log-partition function $G(\theta)$ ensures that the probability distribution

### Key Parameters

1. **Natural Parameter** ($\theta$): This parameter controls the distribution’s shape in its canonical form. For example, the natural parameter for the Poisson distribution is $\log \lambda$.
2. **Expectation Parameter** ($\mu$): This is the expected value of the sufficient statistic,[^1] computed as the mean of the data under the distribution. In exponential family distributions, it is related to the natural parameter through the gradient of the log-partition function:
1. The **natural parameter** $\theta$ controls the distribution’s shape in its canonical form. For example, the natural parameter for the Poisson distribution is $\log \lambda$.
2. The **expectation parameter** $\mu$ is the expected value of the sufficient statistic,[^1] computed as the mean of the data under the distribution. In exponential family distributions, it is related to the natural parameter through the **link function** $g$:

```math
\mu = \mathbb{E}_{\theta}[X] = \nabla_\theta G(\theta) = g(\theta)
```
where $E_\theta$ is the expectation with respect to the distribution $p_\theta$. A derivation is provided in the [appendix](./appendix/expectation.md). Similarly, we also have $\theta = f(\mu)$.
where $E_\theta$ is the expectation with respect to the distribution $p_\theta$ (see [appendix](./appendix/expectation.md)). Similarly, we also have $\theta = f(\mu)$ (see [appendix])(./appendix/inverses.md).

[^1]: The sufficient statistic for the natural exponential family is simply the identity.

## The Legendre Transform
## Convex Conjugation

To understand the relationship between the natural parameters $\theta$ and the expectation parameters $\mu$, we introduce the concept of convex conjugates and the Legendre transform. For a convex function $F$, its convex conjugate (or dual) $F^*$ is defined as:[^2]
The fact that $f$ and $g$ are inverses follows from the stronger claim that $F$ and $G$ are convex conjugates. For a convex function $F$, its convex conjugate (or dual)[^2] $F^*$ is

```math
F^*(\theta) = \sup_{\mu} [\langle \theta, \mu \rangle - F(\mu)].
```

The convex conjugate is an involution ($F^{**} = F$) meaning the Legendre transform allows us to convert back and forth between the natural and expectation parameter spaces. In the [next section](./objectives.md), we see how `ExpFamilyPCA.jl` exploits the rich mathematical structure of the Legendre transform to discover multiple specifications of the Bregman divergence.
Convex conjugation is also an involution meaning it inverts itself, so $F^{**} = F$. Conjugation provides a rich structure for converting between natural and expectation parameters and, as we explain in the [next section](./objectives.md), helps induce multiple useful specifications of the EPCA objective.

[^2]: Duality also refers to the concept in convex analysis [convex](@cite).

## Bregman Loss Functions

An important relationship connects exponential family distributions and Bregman divergences: minimizing the negative log-likelihood of an exponential family distribution is equivalent (up to a constant) to minimizing a Bregman divergence between the observed data and the distribution's expectation parameter [azoury, forster](@cite). This connection is fundamental in extending PCA to EPCA.

Consider the negative log-likelihood of an exponential family distribution:
Bregman divergences are crucial to EPCA, because they are equivalent (up to a constant) to maximum likelihood estimation for the exponential family [azoury, forster](@cite). To see this, consider the negative log-likelihood of such a distribution:

```math
-\ell(x; \theta) = G(\theta) - \langle x, \theta \rangle.
```

Our goal is to show that this expression is equivalent (up to a constant) to the Bregman divergence $B_F(x \| \mu)$. First, recall that $G$ is the convex conjugate (dual) of $F$, so:
We want to show that this is equivalent to the Bregman divergence $B_F(x \| \mu)$. From the previous subsection, we know $G$ is the convex conjugate of $F$, so:

```math
G(\theta) = \langle \theta, \mu \rangle - F(\mu).
```

Next, substitute $G$ back into the negative log-likelihood:
Substituting $G$ back into the negative log-likelihood, then yields:

```math
\begin{aligned}
Expand Down
4 changes: 2 additions & 2 deletions docs/src/math/objectives.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ In this section, we demonstrate how the EPCA objective function and the decompre
Recall from the [introduction](./intro.md) that the regularized EPCA objective aims to minimize the following expression:

```math
B_{F}(X \| g(\Theta)) + \epsilon B_{F}(\mu \| g(\Theta)).
B_{F}(X \| g(\Theta)) + \epsilon B_{F}(\mu_0 \| g(\Theta)).
```

where:
Expand All @@ -26,7 +26,7 @@ where:
* $g$ is the link function,
* $X$ is the data matrix,
* $\Theta$ is the parameter matrix, and
* $\mu \in \mathrm{range}(g)$ and $\epsilon > 0$ are regularization parameters.
* $\mu_0 \in \mathrm{range}(g)$ and $\epsilon > 0$ are regularization parameters.

Our goal is to show that both $B_F$ and $g$ can be induced from various base components, namely $F$, $G$, $f$ and $g$. This allows for multiple pathways to define and compute the EPCA objective in Julia.

Expand Down
3 changes: 3 additions & 0 deletions src/options.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ Defines a struct `Options` for configuring various parameters used in optimizati
- `high::Real`: Upper bound for binary search. Default is `1e10`.
- `tol::Real`: Tolerance for stopping binary search. Default is `1e-10`.
- `maxiter::Real`: Maximum iterations for binary search. Default is `1e6`.
!!! info
The `metaprogramming` flag controls whether metaprogramming is used during symbolic differentiation conversion. While conversion between Symbolics.jl atoms and base Julia can occur without it, this approach is slower and requires more calls. Nonetheless, the flag is provided for users who keenly want to avoid metaprogramming in their pipeline.
"""
@kwdef struct Options{T<:Real}
# symbolic calculus
Expand Down

0 comments on commit 9fab4da

Please sign in to comment.