RStudio AI Blog: torch time sequence, take three: Sequence-to-sequence prediction

0
77
RStudio AI Blog: torch time sequence, take three: Sequence-to-sequence prediction


Today, we proceed our exploration of multi-step time-series forecasting with torch. This publish is the third in a sequence.

  • Initially, we lined fundamentals of recurrent neural networks (RNNs), and educated a mannequin to foretell the very subsequent worth in a sequence. We additionally discovered we might forecast fairly a number of steps forward by feeding again particular person predictions in a loop.

  • Next, we constructed a mannequin “natively” for multi-step prediction. A small multi-layer-perceptron (MLP) was used to undertaking RNN output to a number of time factors sooner or later.

Of each approaches, the latter was the extra profitable. But conceptually, it has an unsatisfying contact to it: When the MLP extrapolates and generates output for, say, ten consecutive deadlines, there is no such thing as a causal relation between these. (Imagine a climate forecast for ten days that by no means received up to date.)

Now, we’d prefer to attempt one thing extra intuitively interesting. The enter is a sequence; the output is a sequence. In pure language processing (NLP), one of these process is quite common: It’s precisely the form of scenario we see with machine translation or summarization.

Quite fittingly, the sorts of fashions employed to those ends are named sequence-to-sequence fashions (typically abbreviated seq2seq). In a nutshell, they cut up up the duty into two elements: an encoding and a decoding half. The former is completed simply as soon as per input-target pair. The latter is completed in a loop, as in our first attempt. But the decoder has extra data at its disposal: At every iteration, its processing is predicated on the earlier prediction in addition to earlier state. That earlier state would be the encoder’s when a loop is began, and its personal ever thereafter.

Before discussing the mannequin intimately, we have to adapt our knowledge enter mechanism.

We proceed working with vic_elec , offered by tsibbledata.

Again, the dataset definition within the present publish appears a bit completely different from the way in which it did earlier than; it’s the form of the goal that differs. This time, y equals x, shifted to the left by one.

The motive we do that is owed to the way in which we’re going to prepare the community. With seq2seq, folks typically use a way referred to as “teacher forcing” the place, as an alternative of feeding again its personal prediction into the decoder module, you go it the worth it ought to have predicted. To be clear, that is finished throughout coaching solely, and to a configurable diploma.

library(torch)
library(tidyverse)
library(tsibble)
library(tsibbledata)
library(lubridate)
library(fable)
library(zeallot)

n_timesteps <- 7 * 24 * 2
n_forecast <- n_timesteps

vic_elec_get_year <- operate(12 months, month = NULL) {
  vic_elec %>%
    filter(12 months(Date) == 12 months, month(Date) == if (is.null(month)) month(Date) else month) %>%
    as_tibble() %>%
    choose(Demand)
}

elec_train <- vic_elec_get_year(2012) %>% as.matrix()
elec_valid <- vic_elec_get_year(2013) %>% as.matrix()
elec_test <- vic_elec_get_year(2014, 1) %>% as.matrix()

train_mean <- imply(elec_train)
train_sd <- sd(elec_train)

elec_dataset <- dataset(
  identify = "elec_dataset",
  
  initialize = operate(x, n_timesteps, sample_frac = 1) {
    
    self$n_timesteps <- n_timesteps
    self$x <- torch_tensor((x - train_mean) / train_sd)
    
    n <- size(self$x) - self$n_timesteps - 1
    
    self$begins <- kind(sample.int(
      n = n,
      measurement = n * sample_frac
    ))
    
  },
  
  .getitem = operate(i) {
    
    begin <- self$begins[i]
    finish <- begin + self$n_timesteps - 1
    lag <- 1
    
    checklist(
      x = self$x[start:end],
      y = self$x[(start+lag):(end+lag)]$squeeze(2)
    )
    
  },
  
  .size = operate() {
    size(self$begins) 
  }
)

Dataset in addition to dataloader instantations then can proceed as earlier than.

batch_size <- 32

train_ds <- elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
train_dl <- train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)

valid_ds <- elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl <- valid_ds %>% dataloader(batch_size = batch_size)

test_ds <- elec_dataset(elec_test, n_timesteps)
test_dl <- test_ds %>% dataloader(batch_size = 1)

Technically, the mannequin consists of three modules: the aforementioned encoder and decoder, and the seq2seq module that orchestrates them.

Encoder

The encoder takes its enter and runs it by an RNN. Of the 2 issues returned by a recurrent neural community, outputs and state, thus far we’ve solely been utilizing output. This time, we do the other: We throw away the outputs, and solely return the state.

If the RNN in query is a GRU (and assuming that of the outputs, we take simply the ultimate time step, which is what we’ve been doing all through), there actually isn’t any distinction: The ultimate state equals the ultimate output. If it’s an LSTM, nevertheless, there’s a second form of state, the “cell state”. In that case, returning the state as an alternative of the ultimate output will carry extra data.

encoder_module <- nn_module(
  
  initialize = operate(kind, input_size, hidden_size, num_layers = 1, dropout = 0) {
    
    self$kind <- kind
    
    self$rnn <- if (self$kind == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    }
    
  },
  
  ahead = operate(x) {
    
    x <- self$rnn(x)
    
    # return final states for all layers
    # per layer, a single tensor for GRU, an inventory of two tensors for LSTM
    x <- x[[2]]
    x
    
  }
  
)

Decoder

In the decoder, identical to within the encoder, the primary part is an RNN. In distinction to previously-shown architectures, although, it doesn’t simply return a prediction. It additionally experiences again the RNN’s ultimate state.

decoder_module <- nn_module(
  
  initialize = operate(kind, input_size, hidden_size, num_layers = 1) {
    
    self$kind <- kind
    
    self$rnn <- if (self$kind == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    }
    
    self$linear <- nn_linear(hidden_size, 1)
    
  },
  
  ahead = operate(x, state) {
    
    # enter to ahead:
    # x is (batch_size, 1, 1)
    # state is (1, batch_size, hidden_size)
    x <- self$rnn(x, state)
    
    # break up RNN return values
    # output is (batch_size, 1, hidden_size)
    # next_hidden is
    c(output, next_hidden) %<-% x
    
    output <- output$squeeze(2)
    output <- self$linear(output)
    
    checklist(output, next_hidden)
    
  }
  
)

seq2seq module

seq2seq is the place the motion occurs. The plan is to encode as soon as, then name the decoder in a loop.

If you look again to decoder ahead(), you see that it takes two arguments: x and state.

Depending on the context, x corresponds to considered one of three issues: ultimate enter, previous prediction, or prior floor fact.

  • The very first time the decoder known as on an enter sequence, x maps to the ultimate enter worth. This is completely different from a process like machine translation, the place you’ll go in a begin token. With time sequence, although, we’d prefer to proceed the place the precise measurements cease.

  • In additional calls, we wish the decoder to proceed from its most up-to-date prediction. It is barely logical, thus, to go again the previous forecast.

  • That mentioned, in NLP a way referred to as “teacher forcing” is usually used to hurry up coaching. With instructor forcing, as an alternative of the forecast we go the precise floor fact, the factor the decoder ought to have predicted. We do this solely in a configurable fraction of circumstances, and – naturally – solely whereas coaching. The rationale behind this method is that with out this type of re-calibration, consecutive prediction errors can rapidly erase any remaining sign.

state, too, is polyvalent. But right here, there are simply two prospects: encoder state and decoder state.

  • The first time the decoder known as, it’s “seeded” with the ultimate state from the encoder. Note how that is the one time we make use of the encoding.

  • From then on, the decoder’s personal earlier state shall be handed. Remember the way it returns two values, forecast and state?

seq2seq_module <- nn_module(
  
  initialize = operate(kind, input_size, hidden_size, n_forecast, num_layers = 1, encoder_dropout = 0) {
    
    self$encoder <- encoder_module(kind = kind, input_size = input_size,
                                   hidden_size = hidden_size, num_layers, encoder_dropout)
    self$decoder <- decoder_module(kind = kind, input_size = input_size,
                                   hidden_size = hidden_size, num_layers)
    self$n_forecast <- n_forecast
    
  },
  
  ahead = operate(x, y, teacher_forcing_ratio) {
    
    # put together empty output
    outputs <- torch_zeros(dim(x)[1], self$n_forecast)$to(gadget = gadget)
    
    # encode present enter sequence
    hidden <- self$encoder(x)
    
    # prime decoder with ultimate enter worth and hidden state from the encoder
    out <- self$decoder(x[ , n_timesteps, , drop = FALSE], hidden)
    
    # decompose into predictions and decoder state
    # pred is (batch_size, 1)
    # state is (1, batch_size, hidden_size)
    c(pred, state) %<-% out
    
    # retailer first prediction
    outputs[ , 1] <- pred$squeeze(2)
    
    # iterate to generate remaining forecasts
    for (t in 2:self$n_forecast) {
      
      # name decoder on both floor fact or earlier prediction, plus earlier decoder state
      teacher_forcing <- runif(1) < teacher_forcing_ratio
      enter <- if (teacher_forcing == TRUE) y[ , t - 1, drop = FALSE] else pred
      enter <- enter$unsqueeze(3)
      out <- self$decoder(enter, state)
      
      # once more, decompose decoder return values
      c(pred, state) %<-% out
      # and retailer present prediction
      outputs[ , t] <- pred$squeeze(2)
    }
    outputs
  }
  
)

web <- seq2seq_module("gru", input_size = 1, hidden_size = 32, n_forecast = n_forecast)

# coaching RNNs on the GPU at present prints a warning that will litter 
# the console
# see https://github.com/mlverse/torch/issues/461
# alternatively, use 
# gadget <- "cpu"
gadget <- torch_device(if (cuda_is_available()) "cuda" else "cpu")

web <- web$to(gadget = gadget)

The coaching process is primarily unchanged. We do, nevertheless, have to resolve about teacher_forcing_ratio, the proportion of enter sequences we need to carry out re-calibration on. In valid_batch(), this could at all times be 0, whereas in train_batch(), it’s as much as us (or moderately, experimentation). Here, we set it to 0.3.

optimizer <- optim_adam(web$parameters, lr = 0.001)

num_epochs <- 50

train_batch <- operate(b, teacher_forcing_ratio) {
  
  optimizer$zero_grad()
  output <- web(b$x$to(gadget = gadget), b$y$to(gadget = gadget), teacher_forcing_ratio)
  goal <- b$y$to(gadget = gadget)
  
  loss <- nnf_mse_loss(output, goal)
  loss$backward()
  optimizer$step()
  
  loss$merchandise()
  
}

valid_batch <- operate(b, teacher_forcing_ratio = 0) {
  
  output <- web(b$x$to(gadget = gadget), b$y$to(gadget = gadget), teacher_forcing_ratio)
  goal <- b$y$to(gadget = gadget)
  
  loss <- nnf_mse_loss(output, goal)
  
  loss$merchandise()
  
}

for (epoch in 1:num_epochs) {
  
  web$prepare()
  train_loss <- c()
  
  coro::loop(for (b in train_dl) {
    loss <-train_batch(b, teacher_forcing_ratio = 0.3)
    train_loss <- c(train_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, coaching: loss: %3.5f n", epoch, imply(train_loss)))
  
  web$eval()
  valid_loss <- c()
  
  coro::loop(for (b in valid_dl) {
    loss <- valid_batch(b)
    valid_loss <- c(valid_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, imply(valid_loss)))
}
Epoch 1, coaching: loss: 0.37961 

Epoch 1, validation: loss: 1.10699 

Epoch 2, coaching: loss: 0.19355 

Epoch 2, validation: loss: 1.26462 

# ...
# ...

Epoch 49, coaching: loss: 0.03233 

Epoch 49, validation: loss: 0.62286 

Epoch 50, coaching: loss: 0.03091 

Epoch 50, validation: loss: 0.54457

It’s fascinating to check performances for various settings of teacher_forcing_ratio. With a setting of 0.5, coaching loss decreases much more slowly; the other is seen with a setting of 0. Validation loss, nevertheless, just isn’t affected considerably.

The code to examine test-set forecasts is unchanged.

web$eval()

test_preds <- vector(mode = "checklist", size = size(test_dl))

i <- 1

coro::loop(for (b in test_dl) {
  
  output <- web(b$x$to(gadget = gadget), b$y$to(gadget = gadget), teacher_forcing_ratio = 0)
  preds <- as.numeric(output)
  
  test_preds[[i]] <- preds
  i <<- i + 1
  
})

vic_elec_jan_2014 <- vic_elec %>%
  filter(12 months(Date) == 2014, month(Date) == 1)

test_pred1 <- test_preds[[1]]
test_pred1 <- c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_jan_2014) - n_timesteps - n_forecast))

test_pred2 <- test_preds[[408]]
test_pred2 <- c(rep(NA, n_timesteps + 407), test_pred2, rep(NA, nrow(vic_elec_jan_2014) - 407 - n_timesteps - n_forecast))

test_pred3 <- test_preds[[817]]
test_pred3 <- c(rep(NA, nrow(vic_elec_jan_2014) - n_forecast), test_pred3)


preds_ts <- vic_elec_jan_2014 %>%
  choose(Demand) %>%
  add_column(
    mlp_ex_1 = test_pred1 * train_sd + train_mean,
    mlp_ex_2 = test_pred2 * train_sd + train_mean,
    mlp_ex_3 = test_pred3 * train_sd + train_mean) %>%
  pivot_longer(-Time) %>%
  update_tsibble(key = identify)


preds_ts %>%
  autoplot() +
  scale_colour_manual(values = c("#08c5d1", "#00353f", "#ffbf66", "#d46f4d")) +
  theme_minimal()

One-week-ahead predictions for January, 2014.

Figure 1: One-week-ahead predictions for January, 2014.

Comparing this to the forecast obtained from final time’s RNN-MLP combo, we don’t see a lot of a distinction. Is this shocking? To me it’s. If requested to invest in regards to the motive, I’d in all probability say this: In the entire architectures we’ve used thus far, the primary service of knowledge has been the ultimate hidden state of the RNN (one and solely RNN within the two earlier setups, encoder RNN on this one). It shall be fascinating to see what occurs within the final a part of this sequence, once we increase the encoder-decoder structure by consideration.

Thanks for studying!

Photo by Suzuha Kozuki on Unsplash

LEAVE A REPLY

Please enter your comment!
Please enter your name here