Posit AI Blog: Getting into the stream: Bijectors in TensorFlow Probability

0
143
Posit AI Blog: Getting into the stream: Bijectors in TensorFlow Probability


As of immediately, deep studying’s biggest successes have taken place within the realm of supervised studying, requiring tons and many annotated coaching information. However, information doesn’t (usually) include annotations or labels. Also, unsupervised studying is enticing due to the analogy to human cognition.

On this weblog up to now, we now have seen two main architectures for unsupervised studying: variational autoencoders and generative adversarial networks. Lesser identified, however interesting for conceptual in addition to for efficiency causes are normalizing flows (Jimenez Rezende and Mohamed 2015). In this and the subsequent put up, we’ll introduce flows, specializing in methods to implement them utilizing TensorFlow Probability (TFP).

In distinction to earlier posts involving TFP that accessed its performance utilizing low-level $-syntax, we now make use of tfprobability, an R wrapper within the type of keras, tensorflow and tfdatasets. A notice relating to this package deal: It remains to be below heavy growth and the API could change. As of this writing, wrappers don’t but exist for all TFP modules, however all TFP performance is on the market utilizing $-syntax if want be.

Density estimation and sampling

Back to unsupervised studying, and particularly considering of variational autoencoders, what are the primary issues they provide us? One factor that’s seldom lacking from papers on generative strategies are photos of super-real-looking faces (or mattress rooms, or animals …). So evidently sampling (or: era) is a vital half. If we will pattern from a mannequin and acquire real-seeming entities, this implies the mannequin has discovered one thing about how issues are distributed on the planet: it has discovered a distribution.
In the case of variational autoencoders, there may be extra: The entities are purported to be decided by a set of distinct, disentangled (hopefully!) latent components. But this isn’t the idea within the case of normalizing flows, so we’re not going to elaborate on this right here.

As a recap, how can we pattern from a VAE? We draw from (z), the latent variable, and run the decoder community on it. The consequence ought to – we hope – seem like it comes from the empirical information distribution. It shouldn’t, nevertheless, look precisely like all of the objects used to coach the VAE, or else we now have not discovered something helpful.

The second factor we could get from a VAE is an evaluation of the plausibility of particular person information, for use, for instance, in anomaly detection. Here “plausibility” is imprecise on goal: With VAE, we don’t have a way to compute an precise density below the posterior.

What if we wish, or want, each: era of samples in addition to density estimation? This is the place normalizing flows are available.

Normalizing flows

A stream is a sequence of differentiable, invertible mappings from information to a “nice” distribution, one thing we will simply pattern from and use to calculate a density. Let’s take as instance the canonical strategy to generate samples from some distribution, the exponential, say.

We begin by asking our random quantity generator for some quantity between 0 and 1:

This quantity we deal with as coming from a cumulative chance distribution (CDF) – from an exponential CDF, to be exact. Now that we now have a worth from the CDF, all we have to do is map that “back” to a worth. That mapping CDF -> worth we’re in search of is simply the inverse of the CDF of an exponential distribution, the CDF being

[F(x) = 1 – e^{-lambda x}]

The inverse then is

[
F^{-1}(u) = -frac{1}{lambda} ln (1 – u)
]

which suggests we could get our exponential pattern doing

lambda <- 0.5 # decide some lambda
x <- -1/lambda * log(1-u)

We see the CDF is definitely a stream (or a constructing block thereof, if we image most flows as comprising a number of transformations), since

  • It maps information to a uniform distribution between 0 and 1, permitting to evaluate information chance.
  • Conversely, it maps a chance to an precise worth, thus permitting to generate samples.

From this instance, we see why a stream needs to be invertible, however we don’t but see why it needs to be differentiable. This will change into clear shortly, however first let’s check out how flows can be found in tfprobability.

Bijectors

TFP comes with a treasure trove of transformations, known as bijectors, starting from easy computations like exponentiation to extra advanced ones just like the discrete cosine remodel.

To get began, let’s use tfprobability to generate samples from the conventional distribution.
There is a bijector tfb_normal_cdf() that takes enter information to the interval ([0,1]). Its inverse remodel then yields a random variable with the usual regular distribution:

Conversely, we will use this bijector to find out the (log) chance of a pattern from the conventional distribution. We’ll test towards an easy use of tfd_normal within the distributions module:

x <- 2.01
d_n <- tfd_normal(loc = 0, scale = 1) 

d_n %>% tfd_log_prob(x) %>% as.numeric() # -2.938989

To get hold of that very same log chance from the bijector, we add two parts:

  • Firstly, we run the pattern by means of the ahead transformation and compute log chance below the uniform distribution.
  • Secondly, as we’re utilizing the uniform distribution to find out chance of a standard pattern, we have to observe how chance adjustments below this transformation. This is finished by calling tfb_forward_log_det_jacobian (to be additional elaborated on under).
b <- tfb_normal_cdf()
d_u <- tfd_uniform()

l <- d_u %>% tfd_log_prob(b %>% tfb_forward(x))
j <- b %>% tfb_forward_log_det_jacobian(x, event_ndims = 0)

(l + j) %>% as.numeric() # -2.938989

Why does this work? Let’s get some background.

Probability mass is conserved

Flows are based mostly on the precept that below transformation, chance mass is conserved. Say we now have a stream from (x) to (z):
[z = f(x)]

Suppose we pattern from (z) after which, compute the inverse remodel to acquire (x). We know the chance of (z). What is the chance that (x), the reworked pattern, lies between (x_0) and (x_0 + dx)?

This chance is (p(x) dx), the density instances the size of the interval. This has to equal the chance that (z) lies between (f(x)) and (f(x + dx)). That new interval has size (f'(x) dx), so:

[p(x) dx = p(z) f'(x) dx]

Or equivalently

[p(x) = p(z) * dz/dx]

Thus, the pattern chance (p(x)) is decided by the bottom chance (p(z)) of the reworked distribution, multiplied by how a lot the stream stretches house.

The similar goes in greater dimensions: Again, the stream is in regards to the change in chance quantity between the (z) and (y) areas:

[p(x) = p(z) frac{vol(dz)}{vol(dx)}]

In greater dimensions, the Jacobian replaces the by-product. Then, the change in quantity is captured by absolutely the worth of its determinant:

[p(mathbf{x}) = p(f(mathbf{x})) bigg|detfrac{partial f({mathbf{x})}}{partial{mathbf{x}}}bigg|]

In apply, we work with log possibilities, so

[log p(mathbf{x}) = log p(f(mathbf{x})) + log bigg|detfrac{partial f({mathbf{x})}}{partial{mathbf{x}}}bigg| ]

Let’s see this with one other bijector instance, tfb_affine_scalar. Below, we assemble a mini-flow that maps just a few arbitrary chosen (x) values to double their worth (scale = 2):

x <- c(0, 0.5, 1)
b <- tfb_affine_scalar(shift = 0, scale = 2)

To examine densities below the stream, we select the conventional distribution, and take a look at the log densities:

d_n <- tfd_normal(loc = 0, scale = 1)
d_n %>% tfd_log_prob(x) %>% as.numeric() # -0.9189385 -1.0439385 -1.4189385

Now apply the stream and compute the brand new log densities as a sum of the log densities of the corresponding (x) values and the log determinant of the Jacobian:

z <- b %>% tfb_forward(x)

(d_n  %>% tfd_log_prob(b %>% tfb_inverse(z))) +
  (b %>% tfb_inverse_log_det_jacobian(z, event_ndims = 0)) %>%
  as.numeric() # -1.6120857 -1.7370857 -2.1120858

We see that because the values get stretched in house (we multiply by 2), the person log densities go down.
We can confirm the cumulative chance stays the identical utilizing tfd_transformed_distribution():

d_t <- tfd_transformed_distribution(distribution = d_n, bijector = b)
d_n %>% tfd_cdf(x) %>% as.numeric()  # 0.5000000 0.6914625 0.8413447

d_t %>% tfd_cdf(y) %>% as.numeric()  # 0.5000000 0.6914625 0.8413447

So far, the flows we noticed had been static – how does this match into the framework of neural networks?

Training a stream

Given that flows are bidirectional, there are two methods to consider them. Above, we now have principally confused the inverse mapping: We need a easy distribution we will pattern from, and which we will use to compute a density. In that line, flows are typically known as “mappings from data to noise” – noise principally being an isotropic Gaussian. However in apply, we don’t have that “noise” but, we simply have information.
So in apply, we now have to be taught a stream that does such a mapping. We do that through the use of bijectors with trainable parameters.
We’ll see a quite simple instance right here, and depart “real world flows” to the subsequent put up.

The instance relies on half 1 of Eric Jang’s introduction to normalizing flows. The predominant distinction (aside from simplification to point out the fundamental sample) is that we’re utilizing keen execution.

We begin from a two-dimensional, isotropic Gaussian, and we wish to mannequin information that’s additionally regular, however with a imply of 1 and a variance of two (in each dimensions).

library(tensorflow)
library(tfprobability)

tfe_enable_eager_execution(device_policy = "silent")

library(tfdatasets)

# the place we begin from
base_dist <- tfd_multivariate_normal_diag(loc = c(0, 0))

# the place we wish to go
target_dist <- tfd_multivariate_normal_diag(loc = c(1, 1), scale_identity_multiplier = 2)

# create coaching information from the goal distribution
target_samples <- target_dist %>% tfd_sample(1000) %>% tf$solid(tf$float32)

batch_size <- 100
dataset <- tensor_slices_dataset(target_samples) %>%
  dataset_shuffle(buffer_size = dim(target_samples)[1]) %>%
  dataset_batch(batch_size)

Now we’ll construct a tiny neural community, consisting of an affine transformation and a nonlinearity.
For the previous, we will make use of tfb_affine, the multi-dimensional relative of tfb_affine_scalar.
As to nonlinearities, at the moment TFP comes with tfb_sigmoid and tfb_tanh, however we will construct our personal parameterized ReLU utilizing tfb_inline:

# alpha is a learnable parameter
bijector_leaky_relu <- perform(alpha) {
  
  tfb_inline(
    # ahead remodel leaves optimistic values untouched and scales damaging ones by alpha
    forward_fn = perform(x)
      tf$the place(tf$greater_equal(x, 0), x, alpha * x),
    # inverse remodel leaves optimistic values untouched and scales damaging ones by 1/alpha
    inverse_fn = perform(y)
      tf$the place(tf$greater_equal(y, 0), y, 1/alpha * y),
    # quantity change is 0 when optimistic and 1/alpha when damaging
    inverse_log_det_jacobian_fn = perform(y) {
      I <- tf$ones_like(y)
      J_inv <- tf$the place(tf$greater_equal(y, 0), I, 1/alpha * I)
      log_abs_det_J_inv <- tf$log(tf$abs(J_inv))
      tf$reduce_sum(log_abs_det_J_inv, axis = 1L)
    },
    forward_min_event_ndims = 1
  )
}

Define the learnable variables for the affine and the PReLU layers:

d <- 2 # dimensionality
r <- 2 # rank of replace

# shift of affine bijector
shift <- tf$get_variable("shift", d)
# scale of affine bijector
L <- tf$get_variable('L', c(d * (d + 1) / 2))
# rank-r replace
V <- tf$get_variable("V", c(d, r))

# scaling issue of parameterized relu
alpha <- tf$abs(tf$get_variable('alpha', checklist())) + 0.01

With keen execution, the variables have for use contained in the loss perform, so that’s the place we outline the bijectors. Our little stream now’s a tfb_chain of bijectors, and we wrap it in a TransformedDistribution (tfd_transformed_distribution) that hyperlinks supply and goal distributions.

loss <- perform() {
  
 affine <- tfb_affine(
        scale_tril = tfb_fill_triangular() %>% tfb_forward(L),
        scale_perturb_factor = V,
        shift = shift
      )
 lrelu <- bijector_leaky_relu(alpha = alpha)  
 
 stream <- checklist(lrelu, affine) %>% tfb_chain()
 
 dist <- tfd_transformed_distribution(distribution = base_dist,
                          bijector = stream)
  
 l <- -tf$reduce_mean(dist$log_prob(batch))
 # hold observe of progress
 print(spherical(as.numeric(l), 2))
 l
}

Now we will really run the coaching!

optimizer <- tf$practice$AdamOptimizer(1e-4)

n_epochs <- 100
for (i in 1:n_epochs) {
  iter <- make_iterator_one_shot(dataset)
  until_out_of_range({
    batch <- iterator_get_next(iter)
    optimizer$decrease(loss)
  })
}

Outcomes will differ relying on random initialization, however you must see a gradual (if gradual) progress. Using bijectors, we now have really educated and outlined a bit of neural community.

Outlook

Undoubtedly, this stream is just too easy to mannequin advanced information, but it surely’s instructive to have seen the fundamental ideas earlier than delving into extra advanced flows. In the subsequent put up, we’ll try autoregressive flows, once more utilizing TFP and tfprobability.

Jimenez Rezende, Danilo, and Shakir Mohamed. 2015. “Variational Inference with Normalizing Flows.” arXiv e-Prints, May, arXiv:1505.05770. https://arxiv.org/abs/1505.05770.

LEAVE A REPLY

Please enter your comment!
Please enter your name here