Adding uncertainty estimates to Keras fashions with tfprobability

0
106
Adding uncertainty estimates to Keras fashions with tfprobability


About six months in the past, we confirmed tips on how to create a customized wrapper to acquire uncertainty estimates from a Keras community. Today we current a much less laborious, as properly faster-running manner utilizing tfprobability, the R wrapper to TensorFlow Probability. Like most posts on this weblog, this one gained’t be brief, so let’s shortly state what you possibly can anticipate in return of studying time.

What to anticipate from this submit

Starting from what not to anticipate: There gained’t be a recipe that tells you ways precisely to set all parameters concerned with the intention to report the “right” uncertainty measures. But then, what are the “right” uncertainty measures? Unless you occur to work with a way that has no (hyper-)parameters to tweak, there’ll at all times be questions on tips on how to report uncertainty.

What you can anticipate, although, is an introduction to acquiring uncertainty estimates for Keras networks, in addition to an empirical report of how tweaking (hyper-)parameters might have an effect on the outcomes. As within the aforementioned submit, we carry out our checks on each a simulated and an actual dataset, the Combined Cycle Power Plant Data Set. At the tip, rather than strict guidelines, you need to have acquired some instinct that may switch to different real-world datasets.

Did you discover our speaking about Keras networks above? Indeed this submit has a further aim: So far, we haven’t actually mentioned but how tfprobability goes along with keras. Now we lastly do (briefly: they work collectively seemlessly).

Finally, the notions of aleatoric and epistemic uncertainty, which can have stayed a bit summary within the prior submit, ought to get way more concrete right here.

Aleatoric vs. epistemic uncertainty

Reminiscent someway of the basic decomposition of generalization error into bias and variance, splitting uncertainty into its epistemic and aleatoric constituents separates an irreducible from a reducible half.

The reducible half pertains to imperfection within the mannequin: In concept, if our mannequin had been good, epistemic uncertainty would vanish. Put in another way, if the coaching information had been limitless – or in the event that they comprised the entire inhabitants – we may simply add capability to the mannequin till we’ve obtained an ideal match.

In distinction, usually there’s variation in our measurements. There could also be one true course of that determines my resting coronary heart charge; nonetheless, precise measurements will range over time. There is nothing to be performed about this: This is the aleatoric half that simply stays, to be factored into our expectations.

Now studying this, you may be considering: “Wouldn’t a model that actually were perfect capture those pseudo-random fluctuations?”. We’ll depart that phisosophical query be; as an alternative, we’ll attempt to illustrate the usefulness of this distinction by instance, in a sensible manner. In a nutshell, viewing a mannequin’s aleatoric uncertainty output ought to warning us to consider acceptable deviations when making our predictions, whereas inspecting epistemic uncertainty ought to assist us re-think the appropriateness of the chosen mannequin.

Now let’s dive in and see how we might accomplish our aim with tfprobability. We begin with the simulated dataset.

Uncertainty estimates on simulated information

Dataset

We re-use the dataset from the Google TensorFlow Probability crew’s weblog submit on the identical topic , with one exception: We prolong the vary of the impartial variable a bit on the unfavorable facet, to raised show the totally different strategies’ behaviors.

Here is the data-generating course of. We additionally get library loading out of the way in which. Like the previous posts on tfprobability, this one too options not too long ago added performance, so please use the event variations of tensorflow and tfprobability in addition to keras. Call install_tensorflow(model = "nightly") to acquire a present nightly construct of TensorFlow and TensorFlow Probability:

# make certain we use the event variations of tensorflow, tfprobability and keras
devtools::install_github("rstudio/tensorflow")
devtools::install_github("rstudio/tfprobability")
devtools::install_github("rstudio/keras")

# and that we use a nightly construct of TensorFlow and TensorFlow Probability
tensorflow::install_tensorflow(model = "nightly")

library(tensorflow)
library(tfprobability)
library(keras)

library(dplyr)
library(tidyr)
library(ggplot2)

# make certain this code is appropriate with TensorFlow 2.0
tf$compat$v1$enable_v2_behavior()

# generate the info
x_min <- -40
x_max <- 60
n <- 150
w0 <- 0.125
b0 <- 5

normalize <- perform(x) (x - x_min) / (x_max - x_min)

# coaching information; predictor 
x <- x_min + (x_max - x_min) * runif(n) %>% as.matrix()

# coaching information; goal
eps <- rnorm(n) * (3 * (0.25 + (normalize(x)) ^ 2))
y <- (w0 * x * (1 + sin(x)) + b0) + eps

# take a look at information (predictor)
x_test <- seq(x_min, x_max, size.out = n) %>% as.matrix()

How does the info look?

ggplot(data.frame(x = x, y = y), aes(x, y)) + geom_point()

Simulated data

Figure 1: Simulated information

The job right here is single-predictor regression, which in precept we are able to obtain use Keras dense layers.
Let’s see tips on how to improve this by indicating uncertainty, ranging from the aleatoric kind.

Aleatoric uncertainty

Aleatoric uncertainty, by definition, will not be a press release in regards to the mannequin. So why not have the mannequin be taught the uncertainty inherent within the information?

This is strictly how aleatoric uncertainty is operationalized on this method. Instead of a single output per enter – the anticipated imply of the regression – right here now we have two outputs: one for the imply, and one for the usual deviation.

How will we use these? Until shortly, we’d have needed to roll our personal logic. Now with tfprobability, we make the community output not tensors, however distributions – put in another way, we make the final layer a distribution layer.

Distribution layers are Keras layers, however contributed by tfprobability. The superior factor is that we are able to practice them with simply tensors as targets, as ordinary: No must compute chances ourselves.

Several specialised distribution layers exist, reminiscent of layer_kl_divergence_add_loss, layer_independent_bernoulli, or layer_mixture_same_family, however probably the most common is layer_distribution_lambda. layer_distribution_lambda takes as inputs the previous layer and outputs a distribution. In order to have the ability to do that, we have to inform it tips on how to make use of the previous layer’s activations.

In our case, sooner or later we are going to need to have a dense layer with two items.

... %>% layer_dense(items = 2, activation = "linear") %>%

Then layer_distribution_lambda will use the primary unit because the imply of a traditional distribution, and the second as its normal deviation.

layer_distribution_lambda(perform(x)
    tfd_normal(loc = x[, 1, drop = FALSE],
               scale = 1e-3 + tf$math$softplus(x[, 2, drop = FALSE])
               )
)

Here is the entire mannequin we use. We insert a further dense layer in entrance, with a relu activation, to offer the mannequin a bit extra freedom and capability. We talk about this, in addition to that scale = ... foo, as quickly as we’ve completed our walkthrough of mannequin coaching.

mannequin <- keras_model_sequential() %>%
  layer_dense(items = 8, activation = "relu") %>%
  layer_dense(items = 2, activation = "linear") %>%
  layer_distribution_lambda(perform(x)
    tfd_normal(loc = x[, 1, drop = FALSE],
               # ignore on first learn, we'll come again to this
               # scale = 1e-3 + 0.05 * tf$math$softplus(x[, 2, drop = FALSE])
               scale = 1e-3 + tf$math$softplus(x[, 2, drop = FALSE])
               )
  )

For a mannequin that outputs a distribution, the loss is the unfavorable log chance given the goal information.

negloglik <- perform(y, mannequin) - (mannequin %>% tfd_log_prob(y))

We can now compile and match the mannequin.

learning_rate <- 0.01
mannequin %>% compile(optimizer = optimizer_adam(lr = learning_rate), loss = negloglik)

mannequin %>% match(x, y, epochs = 1000)

We now name the mannequin on the take a look at information to acquire the predictions. The predictions now truly are distributions, and now we have 150 of them, one for every datapoint:

yhat <- mannequin(tf$fixed(x_test))
tfp.distributions.Normal("sequential/distribution_lambda/Normal/",
batch_shape=[150, 1], event_shape=[], dtype=float32)

To get hold of the means and normal deviations – the latter being that measure of aleatoric uncertainty we’re eager about – we simply name tfd_mean and tfd_stddev on these distributions.
That will give us the anticipated imply, in addition to the anticipated variance, per datapoint.

imply <- yhat %>% tfd_mean()
sd <- yhat %>% tfd_stddev()

Let’s visualize this. Here are the precise take a look at information factors, the anticipated means, in addition to confidence bands indicating the imply estimate plus/minus two normal deviations.

ggplot(data.frame(
  x = x,
  y = y,
  imply = as.numeric(imply),
  sd = as.numeric(sd)
),
aes(x, y)) +
  geom_point() +
  geom_line(aes(x = x_test, y = imply), coloration = "violet", dimension = 1.5) +
  geom_ribbon(aes(
    x = x_test,
    ymin = imply - 2 * sd,
    ymax = imply + 2 * sd
  ),
  alpha = 0.2,
  fill = "gray")

Aleatoric uncertainty on simulated data, using relu activation in the first dense layer.

Figure 2: Aleatoric uncertainty on simulated information, utilizing relu activation within the first dense layer.

This appears to be like fairly affordable. What if we had used linear activation within the first layer? Meaning, what if the mannequin had seemed like this:

mannequin <- keras_model_sequential() %>%
  layer_dense(items = 8, activation = "linear") %>%
  layer_dense(items = 2, activation = "linear") %>%
  layer_distribution_lambda(perform(x)
    tfd_normal(loc = x[, 1, drop = FALSE],
               scale = 1e-3 + 0.05 * tf$math$softplus(x[, 2, drop = FALSE])
               )
  )

This time, the mannequin doesn’t seize the “form” of the info that properly, as we’ve disallowed any nonlinearities.


Aleatoric uncertainty on simulated data, using linear activation in the first dense layer.

Figure 3: Aleatoric uncertainty on simulated information, utilizing linear activation within the first dense layer.

Using linear activations solely, we additionally must do extra experimenting with the scale = ... line to get the outcome look “right”. With relu, alternatively, outcomes are fairly sturdy to modifications in how scale is computed. Which activation will we select? If our aim is to adequately mannequin variation within the information, we are able to simply select relu – and depart assessing uncertainty within the mannequin to a unique approach (the epistemic uncertainty that’s up subsequent).

Overall, it looks as if aleatoric uncertainty is the easy half. We need the community to be taught the variation inherent within the information, which it does. What will we achieve? Instead of acquiring simply level estimates, which on this instance would possibly end up fairly dangerous within the two fan-like areas of the info on the left and proper sides, we be taught in regards to the unfold as properly. We’ll thus be appropriately cautious relying on what enter vary we’re making predictions for.

Epistemic uncertainty

Now our focus is on the mannequin. Given a speficic mannequin (e.g., one from the linear household), what sort of information does it say conforms to its expectations?

To reply this query, we make use of a variational-dense layer.
This is once more a Keras layer supplied by tfprobability. Internally, it really works by minimizing the proof decrease certain (ELBO), thus striving to search out an approximative posterior that does two issues:

  1. match the precise information properly (put in another way: obtain excessive log chance), and
  2. keep near a prior (as measured by KL divergence).

As customers, we truly specify the type of the posterior in addition to that of the prior. Here is how a previous may look.

prior_trainable <-
  perform(kernel_size,
           bias_size = 0,
           dtype = NULL) {
    n <- kernel_size + bias_size
    keras_model_sequential() %>%
      # we'll touch upon this quickly
      # layer_variable(n, dtype = dtype, trainable = FALSE) %>%
      layer_variable(n, dtype = dtype, trainable = TRUE) %>%
      layer_distribution_lambda(perform(t) {
        tfd_independent(tfd_normal(loc = t, scale = 1),
                        reinterpreted_batch_ndims = 1)
      })
  }

This prior is itself a Keras mannequin, containing a layer that wraps a variable and a layer_distribution_lambda, that kind of distribution-yielding layer we’ve simply encountered above. The variable layer might be fastened (non-trainable) or non-trainable, comparable to a real prior or a previous learnt from the info in an empirical Bayes-like manner. The distribution layer outputs a traditional distribution since we’re in a regression setting.

The posterior too is a Keras mannequin – undoubtedly trainable this time. It too outputs a traditional distribution:

posterior_mean_field <-
  perform(kernel_size,
           bias_size = 0,
           dtype = NULL) {
    n <- kernel_size + bias_size
    c <- log(expm1(1))
    keras_model_sequential(checklist(
      layer_variable(form = 2 * n, dtype = dtype),
      layer_distribution_lambda(
        make_distribution_fn = perform(t) {
          tfd_independent(tfd_normal(
            loc = t[1:n],
            scale = 1e-5 + tf$nn$softplus(c + t[(n + 1):(2 * n)])
            ), reinterpreted_batch_ndims = 1)
        }
      )
    ))
  }

Now that we’ve outlined each, we are able to arrange the mannequin’s layers. The first one, a variational-dense layer, has a single unit. The ensuing distribution layer then takes that unit’s output and makes use of it for the imply of a traditional distribution – whereas the size of that Normal is fastened at 1:

mannequin <- keras_model_sequential() %>%
  layer_dense_variational(
    items = 1,
    make_posterior_fn = posterior_mean_field,
    make_prior_fn = prior_trainable,
    kl_weight = 1 / n
  ) %>%
  layer_distribution_lambda(perform(x)
    tfd_normal(loc = x, scale = 1))

You might have observed one argument to layer_dense_variational we haven’t mentioned but, kl_weight.
This is used to scale the contribution to the entire lack of the KL divergence, and usually ought to equal one over the variety of information factors.

Training the mannequin is simple. As customers, we solely specify the unfavorable log chance a part of the loss; the KL divergence half is taken care of transparently by the framework.

negloglik <- perform(y, mannequin) - (mannequin %>% tfd_log_prob(y))
mannequin %>% compile(optimizer = optimizer_adam(lr = learning_rate), loss = negloglik)
mannequin %>% match(x, y, epochs = 1000)

Because of the stochasticity inherent in a variational-dense layer, every time we name this mannequin, we get hold of totally different outcomes: totally different regular distributions, on this case.
To get hold of the uncertainty estimates we’re in search of, we subsequently name the mannequin a bunch of instances – 100, say:

yhats <- purrr::map(1:100, perform(x) mannequin(tf$fixed(x_test)))

We can now plot these 100 predictions – strains, on this case, as there are not any nonlinearities:

means <-
  purrr::map(yhats, purrr::compose(as.matrix, tfd_mean)) %>% abind::abind()

strains <- data.frame(cbind(x_test, means)) %>%
  collect(key = run, worth = worth,-X1)

imply <- apply(means, 1, imply)

ggplot(data.frame(x = x, y = y, imply = as.numeric(imply)), aes(x, y)) +
  geom_point() +
  geom_line(aes(x = x_test, y = imply), coloration = "violet", dimension = 1.5) +
  geom_line(
    information = strains,
    aes(x = X1, y = worth, coloration = run),
    alpha = 0.3,
    dimension = 0.5
  ) +
  theme(legend.place = "none")

Epistemic uncertainty on simulated data, using linear activation in the variational-dense layer.

Figure 4: Epistemic uncertainty on simulated information, utilizing linear activation within the variational-dense layer.

What we see listed below are basically totally different fashions, per the assumptions constructed into the structure. What we’re not accounting for is the unfold within the information. Can we do each? We can; however first let’s touch upon just a few selections that had been made and see how they have an effect on the outcomes.

To stop this submit from rising to infinite dimension, we’ve shunned performing a scientific experiment; please take what follows not as generalizable statements, however as tips to issues you’ll want to take into account in your personal ventures. Especially, every (hyper-)parameter will not be an island; they may work together in unexpected methods.

After these phrases of warning, listed below are some issues we observed.

  1. One query you would possibly ask: Before, within the aleatoric uncertainty setup, we added a further dense layer to the mannequin, with relu activation. What if we did this right here?
    Firstly, we’re not including any extra, non-variational layers with the intention to maintain the setup “fully Bayesian” – we wish priors at each stage. As to utilizing relu in layer_dense_variational, we did strive that, and the outcomes look fairly related:

Epistemic uncertainty on simulated data, using relu activation in the variational-dense layer.

Figure 5: Epistemic uncertainty on simulated information, utilizing relu activation within the variational-dense layer.

However, issues look fairly totally different if we drastically scale back coaching time… which brings us to the subsequent remark.

  1. Unlike within the aleatoric setup, the variety of coaching epochs matter lots. If we practice, quote unquote, too lengthy, the posterior estimates will get nearer and nearer to the posterior imply: we lose uncertainty. What occurs if we practice “too short” is much more notable. Here are the outcomes for the linear-activation in addition to the relu-activation instances:

Epistemic uncertainty on simulated data if we train for 100 epochs only. Left: linear activation. Right: relu activation.

Figure 6: Epistemic uncertainty on simulated information if we practice for 100 epochs solely. Left: linear activation. Right: relu activation.

Interestingly, each mannequin households look very totally different now, and whereas the linear-activation household appears to be like extra affordable at first, it nonetheless considers an total unfavorable slope per the info.

So what number of epochs are “long enough”? From remark, we’d say {that a} working heuristic ought to in all probability be based mostly on the speed of loss discount. But definitely, it’ll make sense to strive totally different numbers of epochs and examine the impact on mannequin conduct. As an apart, monitoring estimates over coaching time might even yield essential insights into the assumptions constructed right into a mannequin (e.g., the impact of various activation features).

  1. As essential because the variety of epochs educated, and related in impact, is the studying charge. If we exchange the educational charge on this setup by 0.001, outcomes will look just like what we noticed above for the epochs = 100 case. Again, we are going to need to strive totally different studying charges and ensure we practice the mannequin “to completion” in some affordable sense.

  2. To conclude this part, let’s shortly take a look at what occurs if we range two different parameters. What if the prior had been non-trainable (see the commented line above)? And what if we scaled the significance of the KL divergence (kl_weight in layer_dense_variational’s argument checklist) in another way, changing kl_weight = 1/n by kl_weight = 1 (or equivalently, eradicating it)? Here are the respective outcomes for an otherwise-default setup. They don’t lend themselves to generalization – on totally different (e.g., larger!) datasets the outcomes will most definitely look totally different – however undoubtedly fascinating to watch.


Epistemic uncertainty on simulated data. Left: kl_weight = 1. Right: prior non-trainable.

Figure 7: Epistemic uncertainty on simulated information. Left: kl_weight = 1. Right: prior non-trainable.

Now let’s come again to the query: We’ve modeled unfold within the information, we’ve peeked into the guts of the mannequin, – can we do each on the similar time?

We can, if we mix each approaches. We add a further unit to the variational-dense layer and use this to be taught the variance: as soon as for every “sub-model” contained within the mannequin.

Combining each aleatoric and epistemic uncertainty

Reusing the prior and posterior from above, that is how the ultimate mannequin appears to be like:

mannequin <- keras_model_sequential() %>%
  layer_dense_variational(
    items = 2,
    make_posterior_fn = posterior_mean_field,
    make_prior_fn = prior_trainable,
    kl_weight = 1 / n
  ) %>%
  layer_distribution_lambda(perform(x)
    tfd_normal(loc = x[, 1, drop = FALSE],
               scale = 1e-3 + tf$math$softplus(0.01 * x[, 2, drop = FALSE])
               )
    )

We practice this mannequin similar to the epistemic-uncertainty just one. We then get hold of a measure of uncertainty per predicted line. Or within the phrases we used above, we now have an ensemble of fashions every with its personal indication of unfold within the information. Here is a manner we may show this – every coloured line is the imply of a distribution, surrounded by a confidence band indicating +/- two normal deviations.

yhats <- purrr::map(1:100, perform(x) mannequin(tf$fixed(x_test)))
means <-
  purrr::map(yhats, purrr::compose(as.matrix, tfd_mean)) %>% abind::abind()
sds <-
  purrr::map(yhats, purrr::compose(as.matrix, tfd_stddev)) %>% abind::abind()

means_gathered <- data.frame(cbind(x_test, means)) %>%
  collect(key = run, worth = mean_val,-X1)
sds_gathered <- data.frame(cbind(x_test, sds)) %>%
  collect(key = run, worth = sd_val,-X1)

strains <-
  means_gathered %>% inner_join(sds_gathered, by = c("X1", "run"))
imply <- apply(means, 1, imply)

ggplot(data.frame(x = x, y = y, imply = as.numeric(imply)), aes(x, y)) +
  geom_point() +
  theme(legend.place = "none") +
  geom_line(aes(x = x_test, y = imply), coloration = "violet", dimension = 1.5) +
  geom_line(
    information = strains,
    aes(x = X1, y = mean_val, coloration = run),
    alpha = 0.6,
    dimension = 0.5
  ) +
  geom_ribbon(
    information = strains,
    aes(
      x = X1,
      ymin = mean_val - 2 * sd_val,
      ymax = mean_val + 2 * sd_val,
      group = run
    ),
    alpha = 0.05,
    fill = "gray",
    inherit.aes = FALSE
  )

Displaying both epistemic and aleatoric uncertainty on the simulated dataset.

Figure 8: Displaying each epistemic and aleatoric uncertainty on the simulated dataset.

Nice! This appears to be like like one thing we may report.

As you may think, this mannequin, too, is delicate to how lengthy (suppose: variety of epochs) or how briskly (suppose: studying charge) we practice it. And in comparison with the epistemic-uncertainty solely mannequin, there’s a further option to be made right here: the scaling of the earlier layer’s activation – the 0.01 within the scale argument to tfd_normal:

scale = 1e-3 + tf$math$softplus(0.01 * x[, 2, drop = FALSE])

Keeping every part else fixed, right here we range that parameter between 0.01 and 0.05:


Epistemic plus aleatoric uncertainty on the simulated dataset: Varying the scale argument.

Figure 9: Epistemic plus aleatoric uncertainty on the simulated dataset: Varying the size argument.

Evidently, that is one other parameter we ought to be ready to experiment with.

Now that we’ve launched all three forms of presenting uncertainty – aleatoric solely, epistemic solely, or each – let’s see them on the aforementioned Combined Cycle Power Plant Data Set. Please see our earlier submit on uncertainty for a fast characterization, in addition to visualization, of the dataset.

Combined Cycle Power Plant Data Set

To maintain this submit at a digestible size, we’ll chorus from attempting as many alternate options as with the simulated information and primarily stick with what labored properly there. This must also give us an concept of how properly these “defaults” generalize. We individually examine two eventualities: The single-predictor setup (utilizing every of the 4 obtainable predictors alone), and the entire one (utilizing all 4 predictors directly).

The dataset is loaded simply as within the earlier submit.

First we take a look at the single-predictor case, ranging from aleatoric uncertainty.

Single predictor: Aleatoric uncertainty

Here is the “default” aleatoric mannequin once more. We additionally duplicate the plotting code right here for the reader’s comfort.

n <- nrow(X_train) # 7654
n_epochs <- 10 # we want fewer epochs as a result of the dataset is a lot larger

batch_size <- 100

learning_rate <- 0.01

# variable to suit - change to 2,3,4 to get the opposite predictors
i <- 1

mannequin <- keras_model_sequential() %>%
  layer_dense(items = 16, activation = "relu") %>%
  layer_dense(items = 2, activation = "linear") %>%
  layer_distribution_lambda(perform(x)
    tfd_normal(loc = x[, 1, drop = FALSE],
               scale = tf$math$softplus(x[, 2, drop = FALSE])
               )
    )

negloglik <- perform(y, mannequin) - (mannequin %>% tfd_log_prob(y))

mannequin %>% compile(optimizer = optimizer_adam(lr = learning_rate), loss = negloglik)

hist <-
  mannequin %>% match(
    X_train[, i, drop = FALSE],
    y_train,
    validation_data = checklist(X_val[, i, drop = FALSE], y_val),
    epochs = n_epochs,
    batch_size = batch_size
  )

yhat <- mannequin(tf$fixed(X_val[, i, drop = FALSE]))

imply <- yhat %>% tfd_mean()
sd <- yhat %>% tfd_stddev()

ggplot(data.frame(
  x = X_val[, i],
  y = y_val,
  imply = as.numeric(imply),
  sd = as.numeric(sd)
),
aes(x, y)) +
  geom_point() +
  geom_line(aes(x = x, y = imply), coloration = "violet", dimension = 1.5) +
  geom_ribbon(aes(
    x = x,
    ymin = imply - 2 * sd,
    ymax = imply + 2 * sd
  ),
  alpha = 0.4,
  fill = "gray")

How properly does this work?


Aleatoric uncertainty on the Combined Cycle Power Plant Data Set; single predictors.

Figure 10: Aleatoric uncertainty on the Combined Cycle Power Plant Data Set; single predictors.

This appears to be like fairly good we’d say! How about epistemic uncertainty?

Single predictor: Epistemic uncertainty

Here’s the code:

posterior_mean_field <-
  perform(kernel_size,
           bias_size = 0,
           dtype = NULL) {
    n <- kernel_size + bias_size
    c <- log(expm1(1))
    keras_model_sequential(checklist(
      layer_variable(form = 2 * n, dtype = dtype),
      layer_distribution_lambda(
        make_distribution_fn = perform(t) {
          tfd_independent(tfd_normal(
            loc = t[1:n],
            scale = 1e-5 + tf$nn$softplus(c + t[(n + 1):(2 * n)])
          ), reinterpreted_batch_ndims = 1)
        }
      )
    ))
  }

prior_trainable <-
  perform(kernel_size,
           bias_size = 0,
           dtype = NULL) {
    n <- kernel_size + bias_size
    keras_model_sequential() %>%
      layer_variable(n, dtype = dtype, trainable = TRUE) %>%
      layer_distribution_lambda(perform(t) {
        tfd_independent(tfd_normal(loc = t, scale = 1),
                        reinterpreted_batch_ndims = 1)
      })
  }

mannequin <- keras_model_sequential() %>%
  layer_dense_variational(
    items = 1,
    make_posterior_fn = posterior_mean_field,
    make_prior_fn = prior_trainable,
    kl_weight = 1 / n,
    activation = "linear",
  ) %>%
  layer_distribution_lambda(perform(x)
    tfd_normal(loc = x, scale = 1))

negloglik <- perform(y, mannequin) - (mannequin %>% tfd_log_prob(y))
mannequin %>% compile(optimizer = optimizer_adam(lr = learning_rate), loss = negloglik)
hist <-
  mannequin %>% match(
    X_train[, i, drop = FALSE],
    y_train,
    validation_data = checklist(X_val[, i, drop = FALSE], y_val),
    epochs = n_epochs,
    batch_size = batch_size
  )

yhats <- purrr::map(1:100, perform(x)
  yhat <- mannequin(tf$fixed(X_val[, i, drop = FALSE])))
  
means <-
  purrr::map(yhats, purrr::compose(as.matrix, tfd_mean)) %>% abind::abind()

strains <- data.frame(cbind(X_val[, i], means)) %>%
  collect(key = run, worth = worth,-X1)

imply <- apply(means, 1, imply)
ggplot(data.frame(x = X_val[, i], y = y_val, imply = as.numeric(imply)), aes(x, y)) +
  geom_point() +
  geom_line(aes(x = X_val[, i], y = imply), coloration = "violet", dimension = 1.5) +
  geom_line(
    information = strains,
    aes(x = X1, y = worth, coloration = run),
    alpha = 0.3,
    dimension = 0.5
  ) +
  theme(legend.place = "none")

And that is the outcome.


Epistemic uncertainty on the Combined Cycle Power Plant Data Set; single predictors.

Figure 11: Epistemic uncertainty on the Combined Cycle Power Plant Data Set; single predictors.

As with the simulated information, the linear fashions appears to “do the right thing”. And right here too, we predict we are going to need to increase this with the unfold within the information: Thus, on to manner three.

Single predictor: Combining each sorts

Here we go. Again, posterior_mean_field and prior_trainable look similar to within the epistemic-only case.

mannequin <- keras_model_sequential() %>%
  layer_dense_variational(
    items = 2,
    make_posterior_fn = posterior_mean_field,
    make_prior_fn = prior_trainable,
    kl_weight = 1 / n,
    activation = "linear"
  ) %>%
  layer_distribution_lambda(perform(x)
    tfd_normal(loc = x[, 1, drop = FALSE],
               scale = 1e-3 + tf$math$softplus(0.01 * x[, 2, drop = FALSE])))


negloglik <- perform(y, mannequin)
  - (mannequin %>% tfd_log_prob(y))
mannequin %>% compile(optimizer = optimizer_adam(lr = learning_rate), loss = negloglik)
hist <-
  mannequin %>% match(
    X_train[, i, drop = FALSE],
    y_train,
    validation_data = checklist(X_val[, i, drop = FALSE], y_val),
    epochs = n_epochs,
    batch_size = batch_size
  )

yhats <- purrr::map(1:100, perform(x)
  mannequin(tf$fixed(X_val[, i, drop = FALSE])))
means <-
  purrr::map(yhats, purrr::compose(as.matrix, tfd_mean)) %>% abind::abind()
sds <-
  purrr::map(yhats, purrr::compose(as.matrix, tfd_stddev)) %>% abind::abind()

means_gathered <- data.frame(cbind(X_val[, i], means)) %>%
  collect(key = run, worth = mean_val,-X1)
sds_gathered <- data.frame(cbind(X_val[, i], sds)) %>%
  collect(key = run, worth = sd_val,-X1)

strains <-
  means_gathered %>% inner_join(sds_gathered, by = c("X1", "run"))

imply <- apply(means, 1, imply)

#strains <- strains %>% filter(run=="X3" | run =="X4")

ggplot(data.frame(x = X_val[, i], y = y_val, imply = as.numeric(imply)), aes(x, y)) +
  geom_point() +
  theme(legend.place = "none") +
  geom_line(aes(x = X_val[, i], y = imply), coloration = "violet", dimension = 1.5) +
  geom_line(
    information = strains,
    aes(x = X1, y = mean_val, coloration = run),
    alpha = 0.2,
    dimension = 0.5
  ) +
geom_ribbon(
  information = strains,
  aes(
    x = X1,
    ymin = mean_val - 2 * sd_val,
    ymax = mean_val + 2 * sd_val,
    group = run
  ),
  alpha = 0.01,
  fill = "gray",
  inherit.aes = FALSE
)

And the output?


Combined uncertainty on the Combined Cycle Power Plant Data Set; single predictors.

Figure 12: Combined uncertainty on the Combined Cycle Power Plant Data Set; single predictors.

This appears to be like helpful! Let’s wrap up with our ultimate take a look at case: Using all 4 predictors collectively.

All predictors

The coaching code used on this state of affairs appears to be like similar to earlier than, aside from our feeding all predictors to the mannequin. For plotting, we resort to displaying the primary principal part on the x-axis – this makes the plots look noisier than earlier than. We additionally show fewer strains for the epistemic and epistemic-plus-aleatoric instances (20 as an alternative of 100). Here are the outcomes:


Uncertainty (aleatoric, epistemic, both) on the Combined Cycle Power Plant Data Set; all predictors.

Figure 13: Uncertainty (aleatoric, epistemic, each) on the Combined Cycle Power Plant Data Set; all predictors.

Conclusion

Where does this depart us? Compared to the learnable-dropout method described within the prior submit, the way in which offered here’s a lot simpler, quicker, and extra intuitively comprehensible.
The strategies per se are that straightforward to make use of that on this first introductory submit, we may afford to discover alternate options already: one thing we had no time to do in that earlier exposition.

In truth, we hope this submit leaves you ready to do your personal experiments, by yourself information.
Obviously, you’ll have to make choices, however isn’t that the way in which it’s in information science? There’s no manner round making choices; we simply ought to be ready to justify them …
Thanks for studying!

LEAVE A REPLY

Please enter your comment!
Please enter your name here