Evidence lower bound

From HandWiki
Short description: Lower bound on the log-likelihood of some observed data

In variational Bayesian methods, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound[1] or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.

Definition

Let X and Z be random variables, jointly distributed with distribution pθ. For example, pθ(X) is the marginal distribution of X, and pθ(ZX) is the conditional distribution of Z given X. Then, for a sample xpθ, and any distribution qϕ, the ELBO is defined asL(ϕ,θ;x):=𝔼zqϕ(|x)[lnpθ(x,z)qϕ(z|x)].The ELBO can equivalently be written as[2]

L(ϕ,θ;x)=𝔼zqϕ(|x)[lnpθ(x,z)]+H[qϕ(z|x)]=lnpθ(x)DKL(qϕ(z|x)||pθ(z|x)).

In the first line, H[qϕ(z|x)] is the entropy of qϕ, which relates the ELBO to the Helmholtz free energy.[3] In the second line, lnpθ(x) is called the evidence for x, and DKL(qϕ(z|x)||pθ(z|x)) is the Kullback-Leibler divergence between qϕ and pθ. Since the Kullback-Leibler divergence is non-negative, L(ϕ,θ;x) forms a lower bound on the evidence (ELBO inequality)lnpθ(x)𝔼zqϕ(|x)[lnpθ(x,z)qϕ(z|x)].

Motivation

Variational Bayesian inference

Suppose we have an observable random variable X, and we want to find its true distribution p*. This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find p* exactly, forcing us to search for a good approximation.

That is, we define a sufficiently large parametric family {pθ}θΘ of distributions, then solve for minθL(pθ,p*) for some loss function L. One possible way to solve this is by considering small variation from pθ to pθ+δθ, and solve for L(pθ,p*)L(pθ+δθ,p*)=0. This is a problem in the calculus of variations, thus it is called the variational method.

Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider implicitly parametrized probability distributions:

  • First, define a simple distribution p(z) over a latent random variable Z. Usually a normal distribution or a uniform distribution suffices.
  • Next, define a family of complicated functions fθ (such as a deep neural network) parametrized by θ.
  • Finally, define a way to convert any fθ(z) into a simple distribution over the observable random variable X. For example, let fθ(z)=(f1(z),f2(z)) have two outputs, then we can define the corresponding distribution over X to be the normal distribution 𝒩(f1(z),ef2(z)).

This defines a family of joint distributions pθ over (X,Z). It is very easy to sample (x,z)pθ: simply sample zp, then compute fθ(z), and finally sample xpθ(|z) using fθ(z).

In other words, we have a generative model for both the observable and the latent. Now, we consider a distribution pθ good, if it is a close approximation of p*:pθ(X)p*(X)since the distribution on the right side is over X only, the distribution on the left side must marginalize the latent variable Z away.
In general, it's impossible to perform the integral pθ(x)=pθ(x|z)p(z)dz, forcing us to perform another approximation.

Since pθ(x)=pθ(x|z)p(z)pθ(z|x), it suffices to find a good approximation of pθ(z|x). So define another distribution family qϕ(z|x) and use it to approximate pθ(z|x). This is a discriminative model for the latent.

The entire situation is summarized in the following table:

X: observable X,Z Z: latent
p*(x)pθ(x)pθ(x|z)p(z)qϕ(z|x) approximable p(z), easy
pθ(x|z)p(z), easy
pθ(z|x)qϕ(z|x) approximable pθ(x|z), easy

In Bayesian language, X is the observed evidence, and Z is the latent/unobserved. The distribution p over Z is the prior distribution over Z, pθ(x|z) is the likelihood function, and pθ(z|x) is the posterior distribution over Z.

Given an observation x, we can infer what z likely gave rise to x by computing pθ(z|x). The usual Bayesian method is to estimate the integral pθ(x)=pθ(x|z)p(z)dz, then compute by Bayes' rule pθ(z|x)=pθ(x|z)p(z)pθ(x). This is expensive to perform in general, but if we can simply find a good approximation qϕ(z|x)pθ(z|x) for most x,z, then we can infer z from x cheaply. Thus, the search for a good qϕ is also called amortized inference.

All in all, we have found a problem of variational Bayesian inference.

Deriving the ELBO

A basic result in variational inference is that minimizing the Kullback–Leibler divergence (KL-divergence) is equivalent to maximizing the log-likelihood:𝔼xp*(x)[lnpθ(x)]=H(p*)DKL(p*(x)pθ(x))where H(p*)=𝔼xp*[lnp*(x)] is the entropy of the true distribution. So if we can maximize 𝔼xp*(x)[lnpθ(x)], we can minimize DKL(p*(x)pθ(x)), and consequently find an accurate approximation pθp*.

To maximize 𝔼xp*(x)[lnpθ(x)], we simply sample many xip*(x), i.e. use Importance samplingNmaxθ𝔼xp*(x)[lnpθ(x)]maxθilnpθ(xi)[note 1]

In order to maximize ilnpθ(xi), it's necessary to find lnpθ(x):lnpθ(x)=lnpθ(x|z)p(z)dzThis usually has no closed form and must be estimated. The usual way to estimate integrals is Monte Carlo integration with importance sampling:pθ(x|z)p(z)dz=𝔼zqϕ(|x)[pθ(x,z)qϕ(z|x)]where qϕ(z|x) is a sampling distribution over z that we use to perform the Monte Carlo integration.

So we see that if we sample zqϕ(|x), then pθ(x,z)qϕ(z|x) is an unbiased estimator of pθ(x). Unfortunately, this does not give us an unbiased estimator of lnpθ(x), because ln is nonlinear. Indeed, we have by Jensen's inequality, lnpθ(x)=ln𝔼zqϕ(|x)[pθ(x,z)qϕ(z|x)]𝔼zqϕ(|x)[lnpθ(x,z)qϕ(z|x)]In fact, all the obvious estimators of lnpθ(x) are biased downwards, because no matter how many samples of ziqϕ(|x) we take, we have by Jensen's inequality:𝔼ziqϕ(|x)[ln(1Nipθ(x,zi)qϕ(zi|x))]ln𝔼ziqϕ(|x)[1Nipθ(x,zi)qϕ(zi|x)]=lnpθ(x)Subtracting the right side, we see that the problem comes down to a biased estimator of zero:𝔼ziqϕ(|x)[ln(1Nipθ(zi|x)qϕ(zi|x))]0By the delta method, we have𝔼ziqϕ(|x)[ln(1Nipθ(zi|x)qϕ(zi|x))]12N𝕍zqϕ(|x)[pθ(z|x)qϕ(z|x)]=O(N1)If we continue with this, we would obtain the importance-weighted autoencoder.[4] But we return to the simplest case with N=1:lnpθ(x)=ln𝔼zqϕ(|x)[pθ(x,z)qϕ(z|x)]𝔼zqϕ(|x)[lnpθ(x,z)qϕ(z|x)]The tightness of the inequality has a closed form:lnpθ(x)𝔼zqϕ(|x)[lnpθ(x,z)qϕ(z|x)]=DKL(qϕ(|x)pθ(|x))0We have thus obtained the ELBO function:L(ϕ,θ;x):=lnpθ(x)DKL(qϕ(|x)pθ(|x))

Maximizing the ELBO

For fixed x, the optimization maxθ,ϕL(ϕ,θ;x) simultaneously attempts to maximize lnpθ(x) and minimize DKL(qϕ(|x)pθ(|x)). If the parametrization for pθ and qϕ are flexible enough, we would obtain some ϕ^,θ^, such that we have simultaneously

lnpθ^(x)maxθlnpθ(x);qϕ^(|x)pθ^(|x)Since𝔼xp*(x)[lnpθ(x)]=H(p*)DKL(p*(x)pθ(x))we havelnpθ^(x)maxθH(p*)DKL(p*(x)pθ(x))and soθ^argminDKL(p*(x)pθ(x))In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model pθ^p* and an accurate discriminative model qϕ^(|x)pθ^(|x).[5]

Main forms

The ELBO has many possible expressions, each with some different emphasis.

𝔼zqϕ(|x)[lnpθ(x,z)qϕ(z|x)]=qϕ(z|x)lnpθ(x,z)qϕ(z|x)dz

This form shows that if we sample zqϕ(|x), then lnpθ(x,z)qϕ(z|x) is an unbiased estimator of the ELBO.

lnpθ(x)DKL(qϕ(|x)pθ(|x))

This form shows that the ELBO is a lower bound on the evidence lnpθ(x), and that maximizing the ELBO with respect to ϕ is equivalent to minimizing the KL-divergence from pθ(|x) to qϕ(|x).

𝔼zqϕ(|x)[lnpθ(x|z)]DKL(qϕ(|x)p)

This form shows that maximizing the ELBO simultaneously attempts to keep qϕ(|x) close to p and concentrate qϕ(|x) on those z that maximizes lnpθ(x|z). That is, the approximate posterior qϕ(|x) balances between staying close to the prior p and moving towards the maximum likelihood argmaxzlnpθ(x|z).

H(qϕ(|x))+𝔼zq(|x)[lnpθ(z|x)]+lnpθ(x)

This form shows that maximizing the ELBO simultaneously attempts to keep the entropy of qϕ(|x) high, and concentrate qϕ(|x) on those z that maximizes lnpθ(z|x). That is, the approximate posterior qϕ(|x) balances between being a uniform distribution and moving towards the maximum a posteriori argmaxzlnpθ(z|x).

Data-processing inequality

Suppose we take N independent samples from p*, and collect them in the dataset D={x1,...,xN}, then we have empirical distribution qD(x)=1Niδxi.


Fitting pθ(x) to qD(x) can be done, as usual, by maximizing the loglikelihood lnpθ(D):DKL(qD(x)pθ(x))=1Nilnpθ(xi)H(qD)=1Nlnpθ(D)H(qD)Now, by the ELBO inequality, we can bound lnpθ(D), and thusDKL(qD(x)pθ(x))1NL(ϕ,θ;D)H(qD)The right-hand-side simplifies to a KL-divergence, and so we get:DKL(qD(x)pθ(x))1NiL(ϕ,θ;xi)H(qD)=DKL(qD,ϕ(x,z);pθ(x,z))This result can be interpreted as a special case of the data processing inequality.

In this interpretation, maximizing L(ϕ,θ;D)=iL(ϕ,θ;xi) is minimizing DKL(qD,ϕ(x,z);pθ(x,z)), which upper-bounds the real quantity of interest DKL(qD(x);pθ(x)) via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.[6]

References

  1. Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML].
  2. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "Chapter 19". Deep learning. Adaptive computation and machine learning. Cambridge, Mass: The MIT press. ISBN 978-0-262-03561-3. 
  3. Hinton, Geoffrey E; Zemel, Richard (1993). "Autoencoders, Minimum Description Length and Helmholtz Free Energy". Advances in Neural Information Processing Systems (Morgan-Kaufmann) 6. https://proceedings.neurips.cc/paper/1993/hash/9e3cfc48eccf81a0d57663e129aef3cb-Abstract.html. 
  4. Burda, Yuri; Grosse, Roger; Salakhutdinov, Ruslan (2015-09-01). "Importance Weighted Autoencoders". arXiv:1509.00519 [stat.ML].
  5. Neal, Radford M.; Hinton, Geoffrey E. (1998), "A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants", Learning in Graphical Models (Dordrecht: Springer Netherlands): pp. 355–368, doi:10.1007/978-94-011-5014-9_12, ISBN 978-94-010-6104-9, http://dx.doi.org/10.1007/978-94-011-5014-9_12 
  6. Kingma, Diederik P.; Welling, Max (2019-11-27). "An Introduction to Variational Autoencoders" (in English). Foundations and Trends in Machine Learning 12 (4): Section 2.7. doi:10.1561/2200000056. ISSN 1935-8237. https://www.nowpublishers.com/article/Details/MAL-056. 

Notes

  1. In fact, by Jensen's inequality, 𝔼xp*(x)[maxθilnpθ(xi)]maxθ𝔼xp*(x)[ilnpθ(xi)]=Nmaxθ𝔼xp*(x)[lnpθ(x)] The estimator is biased upwards. This can be seen as overfitting: for some finite set of sampled data xi , there is usually some θ that fits them better than the entire p* distribution.