FNN-VAE for noisy time sequence forecasting

0
95
FNN-VAE for noisy time sequence forecasting


This publish didn’t find yourself fairly the way in which I’d imagined. A fast follow-up on the current Time sequence prediction with
FNN-LSTM
, it was alleged to exhibit how noisy time sequence (so frequent in
follow) might revenue from a change in structure: Instead of FNN-LSTM, an LSTM autoencoder regularized by false nearest
neighbors (FNN) loss, use FNN-VAE, a variational autoencoder constrained by the identical. However, FNN-VAE didn’t appear to deal with
noise higher than FNN-LSTM. No plot, no publish, then?

On the opposite hand – this isn’t a scientific research, with speculation and experimental setup all preregistered; all that basically
issues is that if there’s one thing helpful to report. And it appears like there’s.

Firstly, FNN-VAE, whereas on par performance-wise with FNN-LSTM, is way superior in that different which means of “performance”:
Training goes a lot quicker for FNN-VAE.

Secondly, whereas we don’t see a lot distinction between FNN-LSTM and FNN-VAE, we do see a transparent affect of utilizing FNN loss. Adding in FNN loss strongly reduces imply squared error with respect to the underlying (denoised) sequence – particularly within the case of VAE, however for LSTM as properly. This is of specific curiosity with VAE, because it comes with a regularizer
out-of-the-box – specifically, Kullback-Leibler (KL) divergence.

Of course, we don’t declare that related outcomes will at all times be obtained on different noisy sequence; nor did we tune any of
the fashions “to death.” For what might be the intent of such a publish however to point out our readers fascinating (and promising) concepts
to pursue in their very own experimentation?

The context

This publish is the third in a mini-series.

In Deep attractors: Where deep studying meets chaos, we
defined, with a considerable detour into chaos principle, the concept of FNN loss, launched in (Gilpin 2020). Please seek the advice of
that first publish for theoretical background and intuitions behind the method.

The subsequent publish, Time sequence prediction with FNN-LSTM, confirmed
easy methods to use an LSTM autoencoder, constrained by FNN loss, for forecasting (versus reconstructing an attractor). The outcomes had been gorgeous: In multi-step prediction (12-120 steps, with that quantity various by
dataset), the short-term forecasts had been drastically improved by including in FNN regularization. See that second publish for
experimental setup and outcomes on 4 very totally different, non-synthetic datasets.

Today, we present easy methods to change the LSTM autoencoder by a – convolutional – VAE. In mild of the experimentation outcomes,
already hinted at above, it’s fully believable that the “variational” half will not be even so necessary right here – {that a}
convolutional autoencoder with simply MSE loss would have carried out simply as properly on these knowledge. In truth, to search out out, it’s
sufficient to take away the decision to reparameterize() and multiply the KL element of the loss by 0. (We go away this to the
reader, to maintain the publish at cheap size.)

One final piece of context, in case you haven’t learn the 2 earlier posts and wish to bounce in right here straight. We’re
doing time sequence forecasting; so why this discuss of autoencoders? Shouldn’t we simply be evaluating an LSTM (or another sort of
RNN, for that matter) to a convnet? In truth, the need of a latent illustration is because of the very concept of FNN: The
latent code is meant to mirror the true attractor of a dynamical system. That is, if the attractor of the underlying
system is roughly two-dimensional, we hope to search out that simply two of the latent variables have appreciable variance. (This
reasoning is defined in numerous element within the earlier posts.)

FNN-VAE

So, let’s begin with the code for our new mannequin.

The encoder takes the time sequence, of format batch_size x num_timesteps x num_features identical to within the LSTM case, and
produces a flat, 10-dimensional output: the latent code, which FNN loss is computed on.

library(tensorflow)
library(keras)
library(tfdatasets)
library(tfautograph)
library(reticulate)
library(purrr)

vae_encoder_model <- operate(n_timesteps,
                               n_features,
                               n_latent,
                               identify = NULL) {
  keras_model_custom(identify = identify, operate(self) {
    self$conv1 <- layer_conv_1d(kernel_size = 3,
                                filters = 16,
                                strides = 2)
    self$act1 <- layer_activation_leaky_relu()
    self$batchnorm1 <- layer_batch_normalization()
    self$conv2 <- layer_conv_1d(kernel_size = 7,
                                filters = 32,
                                strides = 2)
    self$act2 <- layer_activation_leaky_relu()
    self$batchnorm2 <- layer_batch_normalization()
    self$conv3 <- layer_conv_1d(kernel_size = 9,
                                filters = 64,
                                strides = 2)
    self$act3 <- layer_activation_leaky_relu()
    self$batchnorm3 <- layer_batch_normalization()
    self$conv4 <- layer_conv_1d(
      kernel_size = 9,
      filters = n_latent,
      strides = 2,
      activation = "linear" 
    )
    self$batchnorm4 <- layer_batch_normalization()
    self$flat <- layer_flatten()
    
    operate (x, masks = NULL) {
      x %>%
        self$conv1() %>%
        self$act1() %>%
        self$batchnorm1() %>%
        self$conv2() %>%
        self$act2() %>%
        self$batchnorm2() %>%
        self$conv3() %>%
        self$act3() %>%
        self$batchnorm3() %>%
        self$conv4() %>%
        self$batchnorm4() %>%
        self$flat()
    }
  })
}

The decoder begins from this – flat – illustration and decompresses it right into a time sequence. In each encoder and decoder
(de-)conv layers, parameters are chosen to deal with a sequence size (num_timesteps) of 120, which is what we’ll use for
prediction beneath.

vae_decoder_model <- operate(n_timesteps,
                               n_features,
                               n_latent,
                               identify = NULL) {
  keras_model_custom(identify = identify, operate(self) {
    self$reshape <- layer_reshape(target_shape = c(1, n_latent))
    self$conv1 <- layer_conv_1d_transpose(kernel_size = 15,
                                          filters = 64,
                                          strides = 3)
    self$act1 <- layer_activation_leaky_relu()
    self$batchnorm1 <- layer_batch_normalization()
    self$conv2 <- layer_conv_1d_transpose(kernel_size = 11,
                                          filters = 32,
                                          strides = 3)
    self$act2 <- layer_activation_leaky_relu()
    self$batchnorm2 <- layer_batch_normalization()
    self$conv3 <- layer_conv_1d_transpose(
      kernel_size = 9,
      filters = 16,
      strides = 2,
      output_padding = 1
    )
    self$act3 <- layer_activation_leaky_relu()
    self$batchnorm3 <- layer_batch_normalization()
    self$conv4 <- layer_conv_1d_transpose(
      kernel_size = 7,
      filters = 1,
      strides = 1,
      activation = "linear"
    )
    self$batchnorm4 <- layer_batch_normalization()
    
    operate (x, masks = NULL) {
      x %>%
        self$reshape() %>%
        self$conv1() %>%
        self$act1() %>%
        self$batchnorm1() %>%
        self$conv2() %>%
        self$act2() %>%
        self$batchnorm2() %>%
        self$conv3() %>%
        self$act3() %>%
        self$batchnorm3() %>%
        self$conv4() %>%
        self$batchnorm4()
    }
  })
}

Note that regardless that we known as these constructors vae_encoder_model() and vae_decoder_model(), there’s nothing
variational to those fashions per se; they’re actually simply an encoder and a decoder, respectively. Metamorphosis right into a VAE will
occur within the coaching process; in truth, the one two issues that may make this a VAE are going to be the
reparameterization of the latent layer and the added-in KL loss.

Speaking of coaching, these are the routines we’ll name. The operate to compute FNN loss, loss_false_nn(), will be present in
each of the abovementioned predecessor posts; we kindly ask the reader to repeat it from one in every of these locations.

# to reparameterize encoder output earlier than calling decoder
reparameterize <- operate(imply, logvar = 0) {
  eps <- k_random_normal(form = n_latent)
  eps * k_exp(logvar * 0.5) + imply
}

# loss has 3 elements: NLL, KL, and FNN
# in any other case, that is simply regular TF2-style coaching 
train_step_vae <- operate(batch) {
  with (tf$GradientTape(persistent = TRUE) %as% tape, {
    code <- encoder(batch[[1]])
    z <- reparameterize(code)
    prediction <- decoder(z)
    
    l_mse <- mse_loss(batch[[2]], prediction)
    # see loss_false_nn in 2 earlier posts
    l_fnn <- loss_false_nn(code)
    # KL divergence to a typical regular
    l_kl <- -0.5 * k_mean(1 - k_square(z))
    # general loss is a weighted sum of all 3 elements
    loss <- l_mse + fnn_weight * l_fnn + kl_weight * l_kl
  })
  
  encoder_gradients <-
    tape$gradient(loss, encoder$trainable_variables)
  decoder_gradients <-
    tape$gradient(loss, decoder$trainable_variables)
  
  optimizer$apply_gradients(purrr::transpose(record(
    encoder_gradients, encoder$trainable_variables
  )))
  optimizer$apply_gradients(purrr::transpose(record(
    decoder_gradients, decoder$trainable_variables
  )))
  
  train_loss(loss)
  train_mse(l_mse)
  train_fnn(l_fnn)
  train_kl(l_kl)
}

# wrap all of it in autograph
training_loop_vae <- tf_function(autograph(operate(ds_train) {
  
  for (batch in ds_train) {
    train_step_vae(batch) 
  }
  
  tf$print("Loss: ", train_loss$consequence())
  tf$print("MSE: ", train_mse$consequence())
  tf$print("FNN loss: ", train_fnn$consequence())
  tf$print("KL loss: ", train_kl$consequence())
  
  train_loss$reset_states()
  train_mse$reset_states()
  train_fnn$reset_states()
  train_kl$reset_states()
  
}))

To end up the mannequin part, right here is the precise coaching code. This is sort of similar to what we did for FNN-LSTM earlier than.

n_latent <- 10L
n_features <- 1

encoder <- vae_encoder_model(n_timesteps,
                         n_features,
                         n_latent)

decoder <- vae_decoder_model(n_timesteps,
                         n_features,
                         n_latent)
mse_loss <-
  tf$keras$losses$MeanSquaredError(discount = tf$keras$losses$Reduction$SUM)

train_loss <- tf$keras$metrics$Mean(identify = 'train_loss')
train_fnn <- tf$keras$metrics$Mean(identify = 'train_fnn')
train_mse <-  tf$keras$metrics$Mean(identify = 'train_mse')
train_kl <-  tf$keras$metrics$Mean(identify = 'train_kl')

fnn_multiplier <- 1 # default worth utilized in practically all instances (see textual content)
fnn_weight <- fnn_multiplier * nrow(x_train)/batch_size

kl_weight <- 1

optimizer <- optimizer_adam(lr = 1e-3)

for (epoch in 1:100) {
  cat("Epoch: ", epoch, " -----------n")
  training_loop_vae(ds_train)
 
  test_batch <- as_iterator(ds_test) %>% iter_next()
  encoded <- encoder(test_batch[[1]][1:1000])
  test_var <- tf$math$reduce_variance(encoded, axis = 0L)
  print(test_var %>% as.numeric() %>% spherical(5))
}

Experimental setup and knowledge

The concept was so as to add white noise to a deterministic sequence. This time, the Roessler
system
was chosen, primarily for the prettiness of its attractor, obvious
even in its two-dimensional projections:


Roessler attractor, two-dimensional projections.

Figure 1: Roessler attractor, two-dimensional projections.

Like we did for the Lorenz system within the first a part of this sequence, we use deSolve to generate knowledge from the Roessler
equations.

library(deSolve)

parameters <- c(a = .2,
                b = .2,
                c = 5.7)

initial_state <-
  c(x = 1,
    y = 1,
    z = 1.05)

roessler <- operate(t, state, parameters) {
  with(as.record(c(state, parameters)), {
    dx <- -y - z
    dy <- x + a * y
    dz = b + z * (x - c)
    
    record(c(dx, dy, dz))
  })
}

instances <- seq(0, 2500, size.out = 20000)

roessler_ts <-
  ode(
    y = initial_state,
    instances = instances,
    func = roessler,
    parms = parameters,
    technique = "lsoda"
  ) %>% unclass() %>% as_tibble()

n <- 10000
roessler <- roessler_ts$x[1:n]

roessler <- scale(roessler)

Then, noise is added, to the specified diploma, by drawing from a traditional distribution, centered at zero, with normal deviations
various between 1 and a couple of.5.

# add noise
noise <- 1 # additionally used 1.5, 2, 2.5
roessler <- roessler + rnorm(10000, imply = 0, sd = noise)

Here you possibly can examine results of not including any noise (left), normal deviation-1 (center), and normal deviation-2.5 Gaussian noise:


Roessler series with added noise. Top: none. Middle: SD = 1. Bottom: SD = 2.5.

Figure 2: Roessler sequence with added noise. Top: none. Middle: SD = 1. Bottom: SD = 2.5.

Otherwise, preprocessing proceeds as within the earlier posts. In the upcoming outcomes part, we’ll examine forecasts not simply
to the “real,” after noise addition, check cut up of the information, but additionally to the underlying Roessler system – that’s, the factor
we’re actually enthusiastic about. (Just that in the actual world, we will’t do this examine.) This second check set is ready for
forecasting identical to the opposite one; to keep away from duplication we don’t reproduce the code.

n_timesteps <- 120
batch_size <- 32

gen_timesteps <- operate(x, n_timesteps) {
  do.name(rbind,
          purrr::map(seq_along(x),
                     operate(i) {
                       begin <- i
                       finish <- i + n_timesteps - 1
                       out <- x[start:end]
                       out
                     })
  ) %>%
    na.omit()
}

prepare <- gen_timesteps(roessler[1:(n/2)], 2 * n_timesteps)
check <- gen_timesteps(roessler[(n/2):n], 2 * n_timesteps) 

dim(prepare) <- c(dim(prepare), 1)
dim(check) <- c(dim(check), 1)

x_train <- prepare[ , 1:n_timesteps, , drop = FALSE]
y_train <- prepare[ , (n_timesteps + 1):(2*n_timesteps), , drop = FALSE]

ds_train <- tensor_slices_dataset(record(x_train, y_train)) %>%
  dataset_shuffle(nrow(x_train)) %>%
  dataset_batch(batch_size)

x_test <- check[ , 1:n_timesteps, , drop = FALSE]
y_test <- check[ , (n_timesteps + 1):(2*n_timesteps), , drop = FALSE]

ds_test <- tensor_slices_dataset(record(x_test, y_test)) %>%
  dataset_batch(nrow(x_test))

Results

The LSTM used for comparability with the VAE described above is similar to the structure employed within the earlier publish.
While with the VAE, an fnn_multiplier of 1 yielded adequate regularization for all noise ranges, some extra experimentation
was wanted for the LSTM: At noise ranges 2 and a couple of.5, that multiplier was set to five.

As a consequence, in all instances, there was one latent variable with excessive variance and a second one in every of minor significance. For all
others, variance was near 0.

In all instances right here means: In all instances the place FNN regularization was used. As already hinted at within the introduction, the primary
regularizing issue offering robustness to noise right here appears to be FNN loss, not KL divergence. So for all noise ranges,
moreover FNN-regularized LSTM and VAE fashions we additionally examined their non-constrained counterparts.

Low noise

Seeing how all fashions did beautifully on the unique deterministic sequence, a noise degree of 1 can virtually be handled as
a baseline. Here you see sixteen 120-timestep predictions from each regularized fashions, FNN-VAE (darkish blue), and FNN-LSTM
(orange). The noisy check knowledge, each enter (x, 120 steps) and output (y, 120 steps) are displayed in (blue-ish) gray. In
inexperienced, additionally spanning the entire sequence, we’ve got the unique Roessler knowledge, the way in which they’d look had no noise been added.


Roessler series with added Gaussian noise of standard deviation 1. Grey: actual (noisy) test data. Green: underlying Roessler system. Orange: Predictions from FNN-LSTM. Dark blue: Predictions from FNN-VAE.

Figure 3: Roessler sequence with added Gaussian noise of normal deviation 1. Grey: precise (noisy) check knowledge. Green: underlying Roessler system. Orange: Predictions from FNN-LSTM. Dark blue: Predictions from FNN-VAE.

Despite the noise, forecasts from each fashions look glorious. Is this because of the FNN regularizer?

Looking at forecasts from their unregularized counterparts, we’ve got to confess these don’t look any worse. (For higher
comparability, the sixteen sequences to forecast had been initiallly picked at random, however used to check all fashions and
circumstances.)


Roessler series with added Gaussian noise of standard deviation 1. Grey: actual (noisy) test data. Green: underlying Roessler system. Orange: Predictions from unregularized LSTM. Dark blue: Predictions from unregularized VAE.

Figure 4: Roessler sequence with added Gaussian noise of normal deviation 1. Grey: precise (noisy) check knowledge. Green: underlying Roessler system. Orange: Predictions from unregularized LSTM. Dark blue: Predictions from unregularized VAE.

What occurs after we begin to add noise?

Substantial noise

Between noise ranges 1.5 and a couple of, one thing modified, or grew to become noticeable from visible inspection. Let’s bounce on to the
highest-used degree although: 2.5.

Here first are predictions obtained from the unregularized fashions.


Roessler series with added Gaussian noise of standard deviation 2.5. Grey: actual (noisy) test data. Green: underlying Roessler system. Orange: Predictions from unregularized LSTM. Dark blue: Predictions from unregularized VAE.

Figure 5: Roessler sequence with added Gaussian noise of normal deviation 2.5. Grey: precise (noisy) check knowledge. Green: underlying Roessler system. Orange: Predictions from unregularized LSTM. Dark blue: Predictions from unregularized VAE.

Both LSTM and VAE get “distracted” a bit an excessive amount of by the noise, the latter to a fair increased diploma. This results in instances
the place predictions strongly “overshoot” the underlying non-noisy rhythm. This is no surprise, in fact: They had been educated
on the noisy model; predict fluctuations is what they discovered.

Do we see the identical with the FNN fashions?


Roessler series with added Gaussian noise of standard deviation 2.5. Grey: actual (noisy) test data. Green: underlying Roessler system. Orange: Predictions from FNN-LSTM. Dark blue: Predictions from FNN-VAE.

Figure 6: Roessler sequence with added Gaussian noise of normal deviation 2.5. Grey: precise (noisy) check knowledge. Green: underlying Roessler system. Orange: Predictions from FNN-LSTM. Dark blue: Predictions from FNN-VAE.

Interestingly, we see a significantly better match to the underlying Roessler system now! Especially the VAE mannequin, FNN-VAE, surprises
with an entire new smoothness of predictions; however FNN-LSTM turns up a lot smoother forecasts as properly.

“Smooth, fitting the system…” – by now it’s possible you’ll be questioning, when are we going to give you extra quantitative
assertions? If quantitative implies “mean squared error” (MSE), and if MSE is taken to be some divergence between forecasts
and the true goal from the check set, the reply is that this MSE doesn’t differ a lot between any of the 4 architectures.
Put in a different way, it’s principally a operate of noise degree.

However, we might argue that what we’re actually enthusiastic about is how properly a mannequin forecasts the underlying course of. And there,
we see variations.

In the next plot, we distinction MSEs obtained for the 4 mannequin varieties (gray: VAE; orange: LSTM; darkish blue: FNN-VAE; inexperienced:
FNN-LSTM). The rows mirror noise ranges (1, 1.5, 2, 2.5); the columns signify MSE in relation to the noisy(“real”) goal
(left) on the one hand, and in relation to the underlying system on the opposite (proper). For higher visibility of the impact,
MSEs have been normalized as fractions of the utmost MSE in a class.

So, if we need to predict sign plus noise (left), it’s not extraordinarily vital whether or not we use FNN or not. But if we need to
predict the sign solely (proper), with rising noise within the knowledge FNN loss turns into more and more efficient. This impact is way
stronger for VAE vs. FNN-VAE than for LSTM vs. FNN-LSTM: The distance between the gray line (VAE) and the darkish blue one
(FNN-VAE) turns into bigger and bigger as we add extra noise.


Normalized MSEs obtained for the four model types (grey: VAE; orange: LSTM; dark blue: FNN-VAE; green: FNN-LSTM). Rows are noise levels (1, 1.5, 2, 2.5); columns are MSE as related to the real target (left) and the underlying system (right).

Figure 7: Normalized MSEs obtained for the 4 mannequin varieties (gray: VAE; orange: LSTM; darkish blue: FNN-VAE; inexperienced: FNN-LSTM). Rows are noise ranges (1, 1.5, 2, 2.5); columns are MSE as associated to the actual goal (left) and the underlying system (proper).

Summing up

Our experiments present that when noise is prone to obscure measurements from an underlying deterministic system, FNN
regularization can strongly enhance forecasts. This is the case particularly for convolutional VAEs, and possibly convolutional
autoencoders on the whole. And if an FNN-constrained VAE performs as properly, for time sequence prediction, as an LSTM, there’s a
robust incentive to make use of the convolutional mannequin: It trains considerably quicker.

With that, we conclude our mini-series on FNN-regularized fashions. As at all times, we’d love to listen to from you in the event you had been capable of
make use of this in your individual work!

Thanks for studying!

Gilpin, William. 2020. “Deep Reconstruction of Strange Attractors from Time Series.” https://arxiv.org/abs/2002.05909.

LEAVE A REPLY

Please enter your comment!
Please enter your name here