With the abundance of nice libraries, in R, for statistical computing, why would you be all for TensorCirculate Probability (TFP, for brief)? Well – let’s have a look at an inventory of its elements:
- Distributions and bijectors (bijectors are reversible, composable maps)
- Probabilistic modeling (Edward2 and probabilistic community layers)
- Probabilistic inference (by way of MCMC or variational inference)
Now think about all these working seamlessly with the TensorCirculate framework – core, Keras, contributed modules – and likewise, operating distributed and on GPU. The area of potential functions is huge – and much too numerous to cowl as an entire in an introductory weblog submit.
Instead, our goal right here is to supply a primary introduction to TFP, specializing in direct applicability to and interoperability with deep studying.
We’ll shortly present how one can get began with one of many primary constructing blocks: distributions
. Then, we’ll construct a variational autoencoder much like that in Representation studying with MMD-VAE. This time although, we’ll make use of TFP to pattern from the prior and approximate posterior distributions.
We’ll regard this submit as a “proof on concept” for utilizing TFP with Keras – from R – and plan to comply with up with extra elaborate examples from the realm of semi-supervised illustration studying.
To set up TFP along with TensorCirculate, merely append tensorflow-probability
to the default record of additional packages:
library(tensorflow)
install_tensorflow(
extra_packages = c("keras", "tensorflow-hub", "tensorflow-probability"),
model = "1.12"
)
Now to make use of TFP, all we have to do is import it and create some helpful handles.
And right here we go, sampling from a normal regular distribution.
n <- tfd$Normal(loc = 0, scale = 1)
n$pattern(6L)
tf.Tensor(
"Normal_1/pattern/Reshape:0", form=(6,), dtype=float32
)
Now that’s good, however it’s 2019, we don’t wish to must create a session to judge these tensors anymore. In the variational autoencoder instance under, we’re going to see how TFP and TF keen execution are the right match, so why not begin utilizing it now.
To use keen execution, we’ve got to execute the next traces in a recent (R) session:
… and import TFP, similar as above.
tfp <- import("tensorflow_probability")
tfd <- tfp$distributions
Now let’s shortly have a look at TFP distributions.
Using distributions
Here’s that commonplace regular once more.
n <- tfd$Normal(loc = 0, scale = 1)
Things generally performed with a distribution embrace sampling:
# simply as in low-level tensorflow, we have to append L to point integer arguments
n$pattern(6L)
tf.Tensor(
[-0.34403768 -0.14122334 -1.3832929 1.618252 1.364448 -1.1299014 ],
form=(6,),
dtype=float32
)
As properly as getting the log chance. Here we do this concurrently for 3 values.
tf.Tensor(
[-1.4189385 -0.9189385 -1.4189385], form=(3,), dtype=float32
)
We can do the identical issues with plenty of different distributions, e.g., the Bernoulli:
b <- tfd$Bernoulli(0.9)
b$pattern(10L)
tf.Tensor(
[1 1 1 0 1 1 0 1 0 1], form=(10,), dtype=int32
)
tf.Tensor(
[-1.2411538 -0.3411539 -1.2411538 -1.2411538], form=(4,), dtype=float32
)
Note that within the final chunk, we’re asking for the log chances of 4 unbiased attracts.
Batch shapes and occasion shapes
In TFP, we are able to do the next.
tfp.distributions.Normal(
"Normal/", batch_shape=(3,), event_shape=(), dtype=float32
)
Contrary to what it’d seem like, this isn’t a multivariate regular. As indicated by batch_shape=(3,)
, it is a “batch” of unbiased univariate distributions. The undeniable fact that these are univariate is seen in event_shape=()
: Each of them lives in one-dimensional occasion area.
If as a substitute we create a single, two-dimensional multivariate regular:
tfp.distributions.MultivariateNormalDiag(
"MultivariateNormalDiag/", batch_shape=(), event_shape=(2,), dtype=float32
)
we see batch_shape=(), event_shape=(2,)
, as anticipated.
Of course, we are able to mix each, creating batches of multivariate distributions:
This instance defines a batch of three two-dimensional multivariate regular distributions.
Converting between batch shapes and occasion shapes
Strange as it could sound, conditions come up the place we wish to rework distribution shapes between these varieties – the truth is, we’ll see such a case very quickly.
tfd$Independent
is used to transform dimensions in batch_shape
to dimensions in event_shape
.
Here is a batch of three unbiased Bernoulli distributions.
bs <- tfd$Bernoulli(probs=c(.3,.5,.7))
bs
tfp.distributions.Bernoulli(
"Bernoulli/", batch_shape=(3,), event_shape=(), dtype=int32
)
We can convert this to a digital “three-dimensional” Bernoulli like this:
b <- tfd$Independent(bs, reinterpreted_batch_ndims = 1L)
b
tfp.distributions.Independent(
"IndependentBernoulli/", batch_shape=(), event_shape=(3,), dtype=int32
)
Here reinterpreted_batch_ndims
tells TFP how most of the batch dimensions are getting used for the occasion area, beginning to rely from the appropriate of the form record.
With this primary understanding of TFP distributions, we’re able to see them utilized in a VAE.
We’ll take the (not so) deep convolutional structure from Representation studying with MMD-VAE and use distributions
for sampling and computing chances. Optionally, our new VAE will be capable of be taught the prior distribution.
Concretely, the next exposition will include three elements.
First, we current widespread code relevant to each a VAE with a static prior, and one which learns the parameters of the prior distribution.
Then, we’ve got the coaching loop for the primary (static-prior) VAE. Finally, we talk about the coaching loop and extra mannequin concerned within the second (prior-learning) VAE.
Presenting each variations one after the opposite results in code duplications, however avoids scattering complicated if-else branches all through the code.
The second VAE is obtainable as a part of the Keras examples so that you don’t have to repeat out code snippets. The code additionally accommodates further performance not mentioned and replicated right here, reminiscent of for saving mannequin weights.
So, let’s begin with the widespread half.
At the danger of repeating ourselves, right here once more are the preparatory steps (together with just a few further library masses).
Dataset
For a change from MNIST and Fashion-MNIST, we’ll use the model new Kuzushiji-MNIST(Clanuwat et al. 2018).
As in that different submit, we stream the info by way of tfdatasets:
buffer_size <- 60000
batch_size <- 256
batches_per_epoch <- buffer_size / batch_size
train_dataset <- tensor_slices_dataset(train_images) %>%
dataset_shuffle(buffer_size) %>%
dataset_batch(batch_size)
Now let’s see what adjustments within the encoder and decoder fashions.
Encoder
The encoder differs from what we had with out TFP in that it doesn’t return the approximate posterior means and variances straight as tensors. Instead, it returns a batch of multivariate regular distributions:
# you would possibly wish to change this relying on the dataset
latent_dim <- 2
encoder_model <- perform(title = NULL) {
keras_model_custom(title = title, perform(self) {
self$conv1 <-
layer_conv_2d(
filters = 32,
kernel_size = 3,
strides = 2,
activation = "relu"
)
self$conv2 <-
layer_conv_2d(
filters = 64,
kernel_size = 3,
strides = 2,
activation = "relu"
)
self$flatten <- layer_flatten()
self$dense <- layer_dense(items = 2 * latent_dim)
perform (x, masks = NULL) {
x <- x %>%
self$conv1() %>%
self$conv2() %>%
self$flatten() %>%
self$dense()
tfd$MultivariateNormalDiag(
loc = x[, 1:latent_dim],
scale_diag = tf$nn$softplus(x[, (latent_dim + 1):(2 * latent_dim)] + 1e-5)
)
}
})
}
Let’s do that out.
encoder <- encoder_model()
iter <- make_iterator_one_shot(train_dataset)
x <- iterator_get_next(iter)
approx_posterior <- encoder(x)
approx_posterior
tfp.distributions.MultivariateNormalDiag(
"MultivariateNormalDiag/", batch_shape=(256,), event_shape=(2,), dtype=float32
)
approx_posterior$pattern()
tf.Tensor(
[[ 5.77791929e-01 -1.64988488e-02]
[ 7.93901443e-01 -1.00042784e+00]
[-1.56279251e-01 -4.06365871e-01]
...
...
[-6.47531569e-01 2.10889503e-02]], form=(256, 2), dtype=float32)
We don’t learn about you, however we nonetheless benefit from the ease of inspecting values with keen execution – rather a lot.
Now, on to the decoder, which too returns a distribution as a substitute of a tensor.
Decoder
In the decoder, we see why transformations between batch form and occasion form are helpful.
The output of self$deconv3
is four-dimensional. What we want is an on-off-probability for each pixel.
Formerly, this was achieved by feeding the tensor right into a dense layer and making use of a sigmoid activation.
Here, we use tfd$Independent
to successfully tranform the tensor right into a chance distribution over three-dimensional photos (width, peak, channel(s)).
decoder_model <- perform(title = NULL) {
keras_model_custom(title = title, perform(self) {
self$dense <- layer_dense(items = 7 * 7 * 32, activation = "relu")
self$reshape <- layer_reshape(target_shape = c(7, 7, 32))
self$deconv1 <-
layer_conv_2d_transpose(
filters = 64,
kernel_size = 3,
strides = 2,
padding = "similar",
activation = "relu"
)
self$deconv2 <-
layer_conv_2d_transpose(
filters = 32,
kernel_size = 3,
strides = 2,
padding = "similar",
activation = "relu"
)
self$deconv3 <-
layer_conv_2d_transpose(
filters = 1,
kernel_size = 3,
strides = 1,
padding = "similar"
)
perform (x, masks = NULL) {
x <- x %>%
self$dense() %>%
self$reshape() %>%
self$deconv1() %>%
self$deconv2() %>%
self$deconv3()
tfd$Independent(tfd$Bernoulli(logits = x),
reinterpreted_batch_ndims = 3L)
}
})
}
Let’s do that out too.
decoder <- decoder_model()
decoder_likelihood <- decoder(approx_posterior_sample)
tfp.distributions.Independent(
"IndependentBernoulli/", batch_shape=(256,), event_shape=(28, 28, 1), dtype=int32
)
This distribution can be used to generate the “reconstructions,” in addition to decide the loglikelihood of the unique samples.
KL loss and optimizer
Both VAEs mentioned under will want an optimizer …
optimizer <- tf$practice$AdamOptimizer(1e-4)
… and each will delegate to compute_kl_loss
to compute the KL a part of the loss.
This helper perform merely subtracts the log chance of the samples below the prior from their loglikelihood below the approximate posterior.
compute_kl_loss <- perform(
latent_prior,
approx_posterior,
approx_posterior_sample) {
kl_div <- approx_posterior$log_prob(approx_posterior_sample) -
latent_prior$log_prob(approx_posterior_sample)
avg_kl_div <- tf$reduce_mean(kl_div)
avg_kl_div
}
Now that we’ve appeared on the widespread elements, we first talk about how one can practice a VAE with a static prior.
In this VAE, we use TFP to create the standard isotropic Gaussian prior.
We then straight pattern from this distribution within the coaching loop.
latent_prior <- tfd$MultivariateNormalDiag(
loc = tf$zeros(record(latent_dim)),
scale_identity_multiplier = 1
)
And right here is the entire coaching loop. We’ll level out the essential TFP-related steps under.
for (epoch in seq_len(num_epochs)) {
iter <- make_iterator_one_shot(train_dataset)
total_loss <- 0
total_loss_nll <- 0
total_loss_kl <- 0
until_out_of_range({
x <- iterator_get_next(iter)
with(tf$GradientTape(persistent = TRUE) %as% tape, {
approx_posterior <- encoder(x)
approx_posterior_sample <- approx_posterior$pattern()
decoder_likelihood <- decoder(approx_posterior_sample)
nll <- -decoder_likelihood$log_prob(x)
avg_nll <- tf$reduce_mean(nll)
kl_loss <- compute_kl_loss(
latent_prior,
approx_posterior,
approx_posterior_sample
)
loss <- kl_loss + avg_nll
})
total_loss <- total_loss + loss
total_loss_nll <- total_loss_nll + avg_nll
total_loss_kl <- total_loss_kl + kl_loss
encoder_gradients <- tape$gradient(loss, encoder$variables)
decoder_gradients <- tape$gradient(loss, decoder$variables)
optimizer$apply_gradients(purrr::transpose(record(
encoder_gradients, encoder$variables
)),
global_step = tf$practice$get_or_create_global_step())
optimizer$apply_gradients(purrr::transpose(record(
decoder_gradients, decoder$variables
)),
global_step = tf$practice$get_or_create_global_step())
})
cat(
glue(
"Losses (epoch): {epoch}:",
" {(as.numeric(total_loss_nll)/batches_per_epoch) %>% spherical(4)} nll",
" {(as.numeric(total_loss_kl)/batches_per_epoch) %>% spherical(4)} kl",
" {(as.numeric(total_loss)/batches_per_epoch) %>% spherical(4)} whole"
),
"n"
)
}
Above, taking part in round with the encoder and the decoder, we’ve already seen how
approx_posterior <- encoder(x)
offers us a distribution we are able to pattern from. We use it to acquire samples from the approximate posterior:
approx_posterior_sample <- approx_posterior$pattern()
These samples, we take them and feed them to the decoder, who offers us on-off-likelihoods for picture pixels.
decoder_likelihood <- decoder(approx_posterior_sample)
Now the loss consists of the standard ELBO elements: reconstruction loss and KL divergence.
The reconstruction loss we straight get hold of from TFP, utilizing the discovered decoder distribution to evaluate the chance of the unique enter.
nll <- -decoder_likelihood$log_prob(x)
avg_nll <- tf$reduce_mean(nll)
The KL loss we get from compute_kl_loss
, the helper perform we noticed above:
kl_loss <- compute_kl_loss(
latent_prior,
approx_posterior,
approx_posterior_sample
)
We add each and arrive on the general VAE loss:
loss <- kl_loss + avg_nll
Apart from these adjustments attributable to utilizing TFP, the coaching course of is simply regular backprop, the way in which it seems to be utilizing keen execution.
Now let’s see how as a substitute of utilizing the usual isotropic Gaussian, we may be taught a combination of Gaussians.
The selection of variety of distributions right here is fairly arbitrary. Just as with latent_dim
, you would possibly wish to experiment and discover out what works greatest in your dataset.
mixture_components <- 16
learnable_prior_model <- perform(title = NULL, latent_dim, mixture_components) {
keras_model_custom(title = title, perform(self) {
self$loc <-
tf$get_variable(
title = "loc",
form = record(mixture_components, latent_dim),
dtype = tf$float32
)
self$raw_scale_diag <- tf$get_variable(
title = "raw_scale_diag",
form = c(mixture_components, latent_dim),
dtype = tf$float32
)
self$mixture_logits <-
tf$get_variable(
title = "mixture_logits",
form = c(mixture_components),
dtype = tf$float32
)
perform (x, masks = NULL) {
tfd$MixtureSameFamily(
components_distribution = tfd$MultivariateNormalDiag(
loc = self$loc,
scale_diag = tf$nn$softplus(self$raw_scale_diag)
),
mixture_distribution = tfd$Categorical(logits = self$mixture_logits)
)
}
})
}
In TFP terminology, components_distribution
is the underlying distribution kind, and mixture_distribution
holds the possibilities that particular person elements are chosen.
Note how self$loc
, self$raw_scale_diag
and self$mixture_logits
are TensorCirculate Variables
and thus, persistent and updatable by backprop.
Now we create the mannequin.
latent_prior_model <- learnable_prior_model(
latent_dim = latent_dim,
mixture_components = mixture_components
)
How will we get hold of a latent prior distribution we are able to pattern from? A bit unusually, this mannequin can be referred to as with out an enter:
latent_prior <- latent_prior_model(NULL)
latent_prior
tfp.distributions.MixtureSameFamily(
"MixtureSameFamily/", batch_shape=(), event_shape=(2,), dtype=float32
)
Here now’s the entire coaching loop. Note how we’ve got a 3rd mannequin to backprop via.
for (epoch in seq_len(num_epochs)) {
iter <- make_iterator_one_shot(train_dataset)
total_loss <- 0
total_loss_nll <- 0
total_loss_kl <- 0
until_out_of_range({
x <- iterator_get_next(iter)
with(tf$GradientTape(persistent = TRUE) %as% tape, {
approx_posterior <- encoder(x)
approx_posterior_sample <- approx_posterior$pattern()
decoder_likelihood <- decoder(approx_posterior_sample)
nll <- -decoder_likelihood$log_prob(x)
avg_nll <- tf$reduce_mean(nll)
latent_prior <- latent_prior_model(NULL)
kl_loss <- compute_kl_loss(
latent_prior,
approx_posterior,
approx_posterior_sample
)
loss <- kl_loss + avg_nll
})
total_loss <- total_loss + loss
total_loss_nll <- total_loss_nll + avg_nll
total_loss_kl <- total_loss_kl + kl_loss
encoder_gradients <- tape$gradient(loss, encoder$variables)
decoder_gradients <- tape$gradient(loss, decoder$variables)
prior_gradients <-
tape$gradient(loss, latent_prior_model$variables)
optimizer$apply_gradients(purrr::transpose(record(
encoder_gradients, encoder$variables
)),
global_step = tf$practice$get_or_create_global_step())
optimizer$apply_gradients(purrr::transpose(record(
decoder_gradients, decoder$variables
)),
global_step = tf$practice$get_or_create_global_step())
optimizer$apply_gradients(purrr::transpose(record(
prior_gradients, latent_prior_model$variables
)),
global_step = tf$practice$get_or_create_global_step())
})
checkpoint$save(file_prefix = checkpoint_prefix)
cat(
glue(
"Losses (epoch): {epoch}:",
" {(as.numeric(total_loss_nll)/batches_per_epoch) %>% spherical(4)} nll",
" {(as.numeric(total_loss_kl)/batches_per_epoch) %>% spherical(4)} kl",
" {(as.numeric(total_loss)/batches_per_epoch) %>% spherical(4)} whole"
),
"n"
)
}
And that’s it! For us, each VAEs yielded related outcomes, and we didn’t expertise nice variations from experimenting with latent dimensionality and the variety of combination distributions. But once more, we wouldn’t wish to generalize to different datasets, architectures, and so on.
Speaking of outcomes, how do they give the impression of being? Here we see letters generated after 40 epochs of coaching. On the left are random letters, on the appropriate, the standard VAE grid show of latent area.
Hopefully, we’ve succeeded in displaying that TensorCirculate Probability, keen execution, and Keras make for a horny mixture! If you relate whole quantity of code required to the complexity of the duty, in addition to depth of the ideas concerned, this could seem as a reasonably concise implementation.
In the nearer future, we plan to comply with up with extra concerned functions of TensorCirculate Probability, principally from the realm of illustration studying. Stay tuned!