Among deep studying practitioners, Kullback-Leibler divergence (KL divergence) is maybe finest recognized for its function in coaching variational autoencoders (VAEs). To be taught an informative latent house, we don’t simply optimize for good reconstruction. Rather, we additionally impose a previous on the latent distribution, and purpose to maintain them shut – typically, by minimizing KL divergence.
In this function, KL divergence acts like a watchdog; it’s a constraining, regularizing issue, and if anthropomorphized, would appear stern and extreme. If we depart it at that, nonetheless, we’ve seen only one aspect of its character, and are lacking out on its complement, an image of playfulness, journey, and curiosity. In this put up, we’ll check out that different aspect.
While being impressed by a sequence of tweets by Simon de Deo, enumerating functions of KL divergence in an unlimited variety of disciplines,
we don’t aspire to supply a complete write-up right here – as talked about within the preliminary tweet, the subject might simply fill an entire semester of research.
The way more modest objectives of this put up, then, are
- to rapidly recap the function of KL divergence in coaching VAEs, and point out similar-in-character functions;
- as an instance that extra playful, adventurous “other side” of its character; and
- in a not-so-entertaining, however – hopefully – helpful method, differentiate KL divergence from associated ideas similar to cross entropy, mutual info, or free vitality.
Before although, we begin with a definition and a few terminology.
KL divergence in a nutshell
KL divergence is the anticipated worth of the logarithmic distinction in possibilities in response to two distributions, (p) and (q). Here it’s in its discrete-probabilities variant:
[begin{equation}
D_{KL}(p||q) = sumlimits_{x} p(x) log(frac{p(x)}{q(x)})
tag{1}
end{equation}]
Notably, it’s uneven; that’s, (D_{KL}(p||q)) shouldn’t be the identical as (D_{KL}(q||p)). (Which is why it’s a divergence, not a distance.) This facet will play an necessary function in part 2 devoted to the “other side.”
To stress this asymmetry, KL divergence is typically referred to as relative info (as in “information of (p) relative to (q)”), or info acquire. We agree with one in all our sources that due to its universality and significance, KL divergence would in all probability have deserved a extra informative title; similar to, exactly, info acquire. (Which is much less ambiguous pronunciation-wise, as properly.)
KL divergence, “villain”
In many machine studying algorithms, KL divergence seems within the context of variational inference. Often, for sensible information, actual computation of the posterior distribution is infeasible. Thus, some type of approximation is required. In variational inference, the true posterior (p^*) is approximated by an easier distribution, (q), from some tractable household.
To guarantee we have now a superb approximation, we reduce – in concept, no less than – the KL divergence of (q) relative to (p^*), thus changing inference by optimization.
In follow, once more for causes of intractability, the KL divergence minimized is that of (q) relative to an unnormalized distribution (widetilde{p})
[begin{equation}
J(q) = D_{KL}(q||widetilde{p})
tag{2}
end{equation}]
the place (widetilde{p}) is the joint distribution of parameters and information:
[begin{equation}
widetilde{p}(mathbf{x}) = p(mathbf{x}, mathcal{D}) = p^*(mathbf{x}) p(mathcal{D})
tag{3}
end{equation}]
and (p^*) is the true posterior:
[begin{equation}
p^*(mathbf{x}) = p(mathbf{x}|mathcal{D})
tag{4}
end{equation}]
Equivalent to that formulation (eq. (2)) – for a derivation see (Murphy 2012) – is that this, which exhibits the optimization goal to be an higher certain on the damaging log-likelihood (NLL):
[begin{equation}
J(q) = D_{KL}(q||p^*) – log p(D)
tag{5}
end{equation}]
Yet one other formulation – once more, see (Murphy 2012) for particulars – is the one we really use when coaching (e.g.) VAEs. This one corresponds to the anticipated NLL plus the KL divergence between the approximation (q) and the imposed prior (p):
[begin{equation}
J(q) = D_{KL}(q||p) – E_q[- log p(mathcal{D}|mathbf{x})]
tag{6}
finish{equation}]
Negated, this formulation can also be referred to as the ELBO, for proof decrease certain. In the VAE put up cited above, the ELBO was written
[begin{equation}
ELBO = E[log p(x|z)] – KL(q(z)||p(z))
tag{7}
finish{equation}]
with (z) denoting the latent variables ((q(z)) being the approximation, (p(z)) the prior, typically a multivariate regular).
Beyond VAEs
Generalizing this “conservative” motion sample of KL divergence past VAEs, we will say that it expresses the standard of approximations. An necessary space the place approximation takes place is (lossy) compression. KL divergence gives a solution to quantify how a lot info is misplaced once we compress information.
Summing up, in these and related functions, KL divergence is “bad” – though we don’t need it to be zero (or else, why trouble utilizing the algorithm?), we definitely wish to preserve it low. So now, let’s see the opposite aspect.
KL divergence, good man
In a second class of functions, KL divergence shouldn’t be one thing to be minimized. In these domains, KL divergence is indicative of shock, disagreement, exploratory habits, or studying: This really is the angle of info acquire.
Surprise
One area the place shock, not info per se, governs habits is notion. For instance, eyetracking research (e.g., (Itti and Baldi 2005)) confirmed that shock, as measured by KL divergence, was a greater predictor of visible consideration than info, measured by entropy. While these research appear to have popularized the expression “Bayesian surprise,” this compound is – I believe – not essentially the most informative one, as neither half provides a lot info to the opposite. In Bayesian updating, the magnitude of the distinction between prior and posterior displays the diploma of shock caused by the information – shock is an integral a part of the idea.
Thus, with KL divergence linked to shock, and shock rooted within the elementary technique of Bayesian updating, a course of that might be used to explain the course of life itself, KL divergence itself turns into elementary. We might get tempted to see it in all places. Accordingly, it has been utilized in many fields to quantify unidirectional divergence.
For instance, (Zanardo 2017) have utilized it in buying and selling, measuring how a lot an individual disagrees with the market perception. Higher disagreement then corresponds to increased anticipated beneficial properties from betting towards the market.
Closer to the world of deep studying, it’s utilized in intrinsically motivated reinforcement studying (e.g., (Sun, Gomez, and Schmidhuber 2011)), the place an optimum coverage ought to maximize the long-term info acquire. This is feasible as a result of like entropy, KL divergence is additive.
Although its asymmetry is related whether or not you utilize KL divergence for regularization (part 1) or shock (this part), it turns into particularly evident when used for studying and shock.
Asymmetry in motion
Looking once more on the KL system
[begin{equation}
D_{KL}(p||q) = sumlimits_{x} p(x) log(frac{p(x)}{q(x)})
tag{1}
end{equation}]
the roles of (p) and (q) are basically completely different. For one, the expectation is computed over the primary distribution ((p) in (1)). This facet is necessary as a result of the “order” (the respective roles) of (p) and (q) might need to be chosen in response to tractability (which distribution can we common over).
Secondly, the fraction contained in the (log) signifies that if (q) is ever zero at a degree the place (p) isn’t, the KL divergence will “blow up.” What this implies for distribution estimation on the whole is properly detailed in Murphy (2012). In the context of shock, it signifies that if I be taught one thing I used to suppose had chance zero, I will probably be “infinitely surprised.”
To keep away from infinite shock, we will be sure our prior chance isn’t zero. But even then, the attention-grabbing factor is that how a lot info we acquire in anybody occasion relies on how a lot info I had earlier than. Let’s see a easy instance.
Assume that in my present understanding of the world, black swans in all probability don’t exist, however they may … possibly 1 % of them is black. Put in a different way, my prior perception of a swan, ought to I encounter one, being black is (q = 0.01).
Now the truth is I do encounter one, and it’s black.
The info I’ve gained is:
[begin{equation}
l(p,q) = 0 * log(frac{0}{0.99}) + 1 * log(frac{1}{0.01}) = 6.6 bits
tag{8}
end{equation}]
Conversely, suppose I’d been way more undecided earlier than; say I’d have thought the percentages had been 50:50.
On seeing a black swan, I get quite a bit much less info:
[begin{equation}
l(p,q) = 0 * log(frac{0}{0.5}) + 1 * log(frac{1}{0.5}) = 1 bit
tag{9}
end{equation}]
This view of KL divergence, by way of shock and studying, is inspiring – it could lead on one to seeing it in motion in all places. However, we nonetheless have the third and ultimate job to deal with: rapidly evaluate KL divergence to different ideas within the space.
Entropy
It all begins with entropy, or uncertainty, or info, as formulated by Claude Shannon.
Entropy is the typical log chance of a distribution:
[begin{equation}
H(X) = – sumlimits_{x=1}^n p(x_i) log(p(x_i))
tag{10}
end{equation}]
As properly described in (DeDeo 2016), this formulation was chosen to fulfill 4 standards, one in all which is what we generally image as its “essence,” and one in all which is particularly attention-grabbing.
As to the previous, if there are (n) attainable states, entropy is maximal when all states are equiprobable. E.g., for a coin flip uncertainty is highest when coin bias is 0.5.
The latter has to do with coarse-graining, a change in “resolution” of the state house. Say we have now 16 attainable states, however we don’t actually care at that stage of element. We do care about 3 particular person states, however all the remainder are principally the identical to us. Then entropy decomposes additively; whole (fine-grained) entropy is the entropy of the coarse-grained house, plus the entropy of the “lumped-together” group, weighted by their possibilities.
Subjectively, entropy displays our uncertainty whether or not an occasion will occur. Interestingly although, it exists within the bodily world as properly: For instance, when ice melts, it turns into extra unsure the place particular person particles are. As reported by (DeDeo 2016), the variety of bits launched when one gram of ice melts is about 100 billion terabytes!
As fascinating as it’s, info per se might, in lots of circumstances, not be the most effective technique of characterizing human habits. Going again to the eyetracking instance, it’s fully intuitive that folks have a look at shocking elements of photographs, not at white noise areas, that are the utmost you can get by way of entropy.
As a deep studying practitioner, you’ve in all probability been ready for the purpose at which we’d point out cross entropy – essentially the most generally used loss perform in categorization.
Cross entropy
The cross entropy between distributions (p) and (q) is the entropy of (p) plus the KL divergence of (p) relative to (q). If you’ve ever applied your personal classification community, you in all probability acknowledge the sum on the very proper:
[begin{equation}
H(p,q) = H(p) + D_{KL}(p||q) = – sum p log(q)
tag{11}
end{equation}]
In info theory-speak, (H(p,q)) is the anticipated message size per datum when (q) is assumed however (p) is true.
Closer to the world of machine studying, for fastened (p), minimizing cross entropy is equal to minimizing KL divergence.
Mutual info
Another extraordinarily necessary amount, utilized in many contexts and functions, is mutual info. Again citing DeDeo, “you can think of it as the most general form of correlation coefficient that you can measure.”
With two variables (X) and (Y), we will ask: How a lot can we study (X) once we study a person (y), (Y=y)? Averaged over all (y), that is the conditional entropy:
[begin{equation}
H(X|Y) = – sumlimits_{i} P(y_i) log(H(X|y_i))
tag{12}
end{equation}]
Now mutual info is entropy minus conditional entropy:
[begin{equation}
I(X, Y) = H(X) – H(X|Y) = H(Y) – H(Y|X)
tag{13}
end{equation}]
This amount – as required for a measure representing one thing like correlation – is symmetric: If two variables (X) and (Y) are associated, the quantity of data (X) offers you about (Y) is the same as that (Y) offers you about (X).
KL divergence is a part of a household of divergences, referred to as f-divergences, used to measure directed distinction between chance distributions. Let’s additionally rapidly look one other information-theoretic measure that not like these, is a distance.
Jensen-Shannon distance
In math, a distance, or metric, moreover being non-negative has to fulfill two different standards: It should be symmetric, and it should obey the triangle inequality.
Both standards are met by the Jensen-Shannon distance. With (m) a mix distribution:
[begin{equation}
m_i = frac{1}{2}(p_i + q_i)
tag{14}
end{equation}]
the Jensen-Shannon distance is a mean of KL divergences, one in all (m) relative to (p), the opposite of (m) relative to (q):
[begin{equation}
JSD = frac{1}{2}(KL(m||p) + KL(m||q))
tag{15}
end{equation}]
This could be a perfect candidate to make use of had been we inquisitive about (undirected) distance between, not directed shock attributable to, distributions.
Finally, let’s wrap up with a final time period, proscribing ourselves to a fast glimpse at one thing complete books might be written about.
(Variational) Free Energy
Reading papers on variational inference, you’re fairly prone to hear folks speaking not “just” about KL divergence and/or the ELBO (which as quickly as you understand what it stands for, is simply what it’s), but additionally, one thing mysteriously referred to as free vitality (or: variational free vitality, in that context).
For sensible functions, it suffices to know that variational free vitality is damaging the ELBO, that’s, corresponds to equation (2). But for these , there’s free vitality as a central idea in thermodynamics.
In this put up, we’re primarily inquisitive about how ideas are associated to KL divergence, and for this, we comply with the characterization John Baez offers in his aforementioned speak.
Free vitality, that’s, vitality in helpful kind, is the anticipated vitality minus temperature instances entropy:
[begin{equation}
F = [E] -T H
tag{16}
finish{equation}]
Then, the additional free vitality of a system (Q) – in comparison with a system in equilibrium (P) – is proportional to their KL divergence, that’s, the data of (Q) relative to (P):
[begin{equation}
F(Q) – F(P) = k T KL(q||p)
tag{17}
end{equation}]
Speaking of free vitality, there’s additionally the – not uncontroversial – free vitality precept posited in neuroscience.. But sooner or later, we have now to cease, and we do it right here.
Conclusion
Wrapping up, this put up has tried to do three issues: Having in thoughts a reader with background primarily in deep studying, begin with the “habitual” use in coaching variational autoencoders; then present the – in all probability much less acquainted – “other side”; and eventually, present a synopsis of associated phrases and their functions.
If you’re inquisitive about digging deeper into the numerous varied functions, in a spread of various fields, no higher place to begin than from the Twitter thread, talked about above, that gave rise to this put up. Thanks for studying!
DeDeo, Simon. 2016. “Information Theory for Intelligent People.”
Murphy, Kevin. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
Zanardo, Enrico. 2017. “HOW TO MEASURE DISAGREEMENT ?” In.