Natural Gradient Optimizer? #1421

mathDR · 2025-09-17T22:17:06Z

mathDR
Sep 17, 2025

I have a question regarding optax scope: When optimizing sparse variational gaussian processes, typically the dual parameterization of Adam et.al is used.

GPFlow has a class that does this (extends a tf_keras optimizer)

It would be great for my workflow to be able to call optax to fit these models (Sparse Variational GPs).

My question is: would an optimizer of this type be in scope of the optax project? I am happy to implement a PR if so.

Please advise!

rdyro · 2025-09-21T20:22:56Z

rdyro
Sep 21, 2025
Maintainer

Hi @mathDR !

It's a cool idea. Could you sketch out the proposed implementation? Depending on the interface complexity and computational complexity (we probably want to mostly target first-order optimizers), it could be a cool addition to optax.

1 reply

mathDR Sep 21, 2025
Author

Will do @rdyro. I am close to having a working prototype. I will prioritize this for the week!

mathDR · 2025-09-22T15:44:37Z

mathDR
Sep 22, 2025
Author

Okay yeah it might be good to discuss this here. There are currently two approaches I am considering: the generic solution and a very bespoke targeted solution.

To formulate the problem (taking liberally from Salimbeni, et.al. Section 2.1):
We want to perform inference in a model of the form

$$\begin{equation} p(\bf{y},u) = \left[\prod\limits_{n=1}^N p(y_n|u)\right] p(u) \end{equation}$$

where $y_n$ are observed and $u$ unobserved.

We seek to maximize the evidence lower bound (ELBO) of an approximate posterior $q(u,\theta)$ from the exponential family to the (intractable) posterior $p(u|y).$

The exponential family is defined as

$$\begin{equation} \log q(u;\theta) = \log h(u) + \theta^T t(u) - A(\theta) \end{equation}$$

where $t(u)$ is the sufficient statistics vector, $A(\theta)$ is the log normalizing constant, and $h(\cdot)$ is the base measure.

We will use a smooth invertible transformation $\xi := \xi(\theta)$ that allows us to utilize different parameterizations of the family to aid in optimizing the ELBO. A particular parameterization of interest is the expectation parameterization: $\eta^T := \nabla_{\theta} A(\theta).$

The ELBO is thusly defined as

$$\begin{equation} \mathcal{L}(\xi) = E_{q(u;\xi)} \sum\limits_{n=1}^{N} \log p(y_n|u) - \textrm{KL}\left[q(u;\xi)||p(u)\right] \end{equation}$$

So we want to minimize $-\mathcal{L}(\xi)$.

We will do so by finding a sequence of parameters $(\xi_0, \xi_1,\cdots,\xi_T)$ using the iterative update

$$\begin{aligned} \xi_{t+1} &= \xi_t - \gamma_t P_t^{-1} g_t \\\ g_t &= \nabla_{\xi^T} \mathcal{L}|_{\xi=xi_t} \end{aligned}$$

where $\gamma_t$ is the step size and $P_t^{-1} g_t $ is the direction.

We can choose to do:

Gradient Descent - set $P$ to be the identity. The step size is of type base.ScalarOrSchedule
Adam - use a diagonal matrix for $P$ with diagonal elements given by $(\sqrt{v_i}+\epsilon)^{-1} m_i$ where $m_i$ and $v_i$ are the bias corrected exponential moving averages of $[g_t]_i$ and $([g_t]_i)^2.$
LBFGS - identify the direction as a minimizer of the local quadratic approximation $\mathcal{L}(\xi_t + \delta) \approx L)\xi_t) + g^T\delta + \frac{1}{P2}\delta^T P \delta.$ A natural choice for $P$ is the Hessian.
Natural Gradient Descent - the direction of steepest descent with respect to a norm $\left \Vert \delta \right \Vert _ A = \delta^T A \delta$ is given by $A^{-1}\nabla_{\xi^T} \mathcal{L}.$ If we identify $P$ with $A$, then the update will be taken in the steepest descent direction with respect to the norm induced by $P.$

So for Natural Gradient Descent, if the parameters come from distributions, the using the Euclidean norm (where $P$ is the identity) is an unnatural way to update the parameters when optimizing. We should instead consider the KL divergence between the distributions. In a small neighborhood around the current parameters, this induces a quadratic norm with curvature given by the expected Hessian of the log density -- this matrix is known as the Fisher Information Matrix.

1 reply

mathDR Sep 22, 2025
Author

So the plan would be to write a new optimizer that allows for computing the relevant Fisher information as the induced norm.

So I could write a generic optimizer called ngd, analogous to the current optax.sgd that allows for computing the Fisher Information as needed.

But with respect to Sparse Variational Gaussian Processes, there are known updates that are conjugate in the parameterization, so we could also have a bespoke implementation.

I think the best way would be to write a generic ngd method but allow for calling it with a compute_direction() function or something that would allow for the user to pass in a bespoke jax.custom_jvp since the natural parameterization of Adam et.al allows for quick parameter updates in lieu of computing the actual Fisher Information.

Sorry, I know this is a lot.

Thoughts?

mathDR · 2025-09-22T22:29:51Z

mathDR
Sep 22, 2025
Author

@rdyro
Okay adding more to the thread (more so to try to keep my thoughts on this succinct):
GpyTorch has a good writeup about using Natural Gradient Descent with Variational Models

They point to some good blog posts by Agustinus Kristiadi:

So we could build an optax.ngd method that would take in a likelihood $p(x|\theta)$ and optimize the parameters $\theta.$ To do so we would compute the Fisher matrix and invert it and do the updates. This is in general expensive (and maybe numerically unstable?) so we could then have only variational models where we approximate the likelihood with the ELBO and only consider distributions from the exponential family.

1 reply

mathDR Sep 22, 2025
Author

or better yet: have two functions:

optax.conjugate_ngd()
optax.nonconjugate_ngd()

corresponding to if our original likelihood comes from the exponential family (conjugate) or not (nonconjugate).

Still thinking about the function signatures

rdyro · 2025-09-27T18:07:31Z

rdyro
Sep 27, 2025
Maintainer

I really appreciate the explanations!

I think this is a good candidate for optax, especially if the interface can be made concise. I'd say the initial focus could probably be on distributions with a diagonal Fisher information matrix since we'd maintain linear scaling with parameter numbers. Would this be too restrictive?

Later, adding some form of matrix-inverse-vector product as a callback would probably work well. I like your suggestion.

I'd be happy to include NGD in optax in some form!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Natural Gradient Optimizer? #1421

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Natural Gradient Optimizer? #1421

Uh oh!

mathDR Sep 17, 2025

Replies: 4 comments · 3 replies

Uh oh!

rdyro Sep 21, 2025 Maintainer

Uh oh!

mathDR Sep 21, 2025 Author

Uh oh!

Uh oh!

mathDR Sep 22, 2025 Author

Uh oh!

mathDR Sep 22, 2025 Author

Uh oh!

mathDR Sep 22, 2025 Author

Uh oh!

Uh oh!

mathDR Sep 22, 2025 Author

Uh oh!

rdyro Sep 27, 2025 Maintainer

mathDR
Sep 17, 2025

Replies: 4 comments 3 replies

rdyro
Sep 21, 2025
Maintainer

mathDR Sep 21, 2025
Author

mathDR
Sep 22, 2025
Author

mathDR Sep 22, 2025
Author

mathDR
Sep 22, 2025
Author

mathDR Sep 22, 2025
Author

rdyro
Sep 27, 2025
Maintainer