RStudio AI Blog: Variational convnets with tfprobability

0
125
RStudio AI Blog: Variational convnets with tfprobability


A bit greater than a 12 months in the past, in his lovely visitor put up, Nick Strayer confirmed methods to classify a set of on a regular basis actions utilizing smartphone-recorded gyroscope and accelerometer knowledge. Accuracy was superb, however Nick went on to examine classification outcomes extra intently. Were there actions extra vulnerable to misclassification than others? And how about these misguided outcomes: Did the community report them with equal, or much less confidence than people who had been right?

Technically, once we communicate of confidence in that method, we’re referring to the rating obtained for the “winning” class after softmax activation. If that profitable rating is 0.9, we would say “the network is sure that’s a gentoo penguin”; if it’s 0.2, we’d as a substitute conclude “to the network, neither option seemed fitting, but cheetah looked best.”

This use of “confidence” is convincing, nevertheless it has nothing to do with confidence – or credibility, or prediction, what have you ever – intervals. What we’d actually like to have the ability to do is put distributions over the community’s weights and make it Bayesian. Using tfprobability’s variational Keras-compatible layers, that is one thing we really can do.

Adding uncertainty estimates to Keras fashions with tfprobability exhibits methods to use a variational dense layer to acquire estimates of epistemic uncertainty. In this put up, we modify the convnet utilized in Nick’s put up to be variational all through. Before we begin, let’s shortly summarize the duty.

The activity

To create the Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set (Reyes-Ortiz et al. 2016), the researchers had topics stroll, sit, stand, and transition from a kind of actions to a different. Meanwhile, two varieties of smartphone sensors had been used to file movement knowledge: Accelerometers measure linear acceleration in three dimensions, whereas gyroscopes are used to trace angular velocity across the coordinate axes. Here are the respective uncooked sensor knowledge for six varieties of actions from Nick’s unique put up:

Just like Nick, we’re going to zoom in on these six varieties of exercise, and attempt to infer them from the sensor knowledge. Some knowledge wrangling is required to get the dataset right into a type we are able to work with; right here we’ll construct on Nick’s put up, and successfully begin from the info properly pre-processed and break up up into coaching and check units:

Observations: 289
Variables: 6
$ experiment    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 17, 18, 19, 2…
$ userId        <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 7, 7, 9, 9, 10, 10, 11…
$ exercise      <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7…
$ knowledge          <listing> [<data.frame[160 x 6]>, <knowledge.body[206 x 6]>, <dat…
$ activityName  <fct> STAND_TO_SIT, STAND_TO_SIT, STAND_TO_SIT, STAND_TO_S…
$ observationId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 17, 18, 19, 2…
Observations: 69
Variables: 6
$ experiment    <int> 11, 12, 15, 16, 32, 33, 42, 43, 52, 53, 56, 57, 11, …
$ userId        <int> 6, 6, 8, 8, 16, 16, 21, 21, 26, 26, 28, 28, 6, 6, 8,…
$ exercise      <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8…
$ knowledge          <listing> [<data.frame[185 x 6]>, <knowledge.body[151 x 6]>, <dat…
$ activityName  <fct> STAND_TO_SIT, STAND_TO_SIT, STAND_TO_SIT, STAND_TO_S…
$ observationId <int> 11, 12, 15, 16, 31, 32, 41, 42, 51, 52, 55, 56, 71, …

The code required to reach at this stage (copied from Nick’s put up) could also be discovered within the appendix on the backside of this web page.

Training pipeline

The dataset in query is sufficiently small to slot in reminiscence – however yours won’t be, so it will possibly’t damage to see some streaming in motion. Besides, it’s in all probability protected to say that with TensorCirculation 2.0, tfdatasets pipelines are the approach to feed knowledge to a mannequin.

Once the code listed within the appendix has run, the sensor knowledge is to be present in practiceData$knowledge, a listing column containing knowledge.bodys the place every row corresponds to a degree in time and every column holds one of many measurements. However, not all time collection (recordings) are of the identical size; we thus comply with the unique put up to pad all collection to size pad_size (= 338). The anticipated form of coaching batches will then be (batch_size, pad_size, 6).

We initially create our coaching dataset:

train_x <- train_data$knowledge %>% 
  map(as.matrix) %>%
  pad_sequences(maxlen = pad_size, dtype = "float32") %>%
  tensor_slices_dataset() 

train_y <- train_data$exercise %>% 
  one_hot_classes() %>% 
  tensor_slices_dataset()

train_dataset <- zip_datasets(train_x, train_y)
train_dataset
<ZipDataset shapes: ((338, 6), (6,)), sorts: (tf.float64, tf.float64)>

Then shuffle and batch it:

n_train <- nrow(train_data)
# the best potential batch measurement for this dataset
# chosen as a result of it yielded one of the best efficiency
# alternatively, experiment with e.g. totally different studying charges, ...
batch_size <- n_train

train_dataset <- train_dataset %>% 
  dataset_shuffle(n_train) %>%
  dataset_batch(batch_size)
train_dataset
<BatchDataset shapes: ((None, 338, 6), (None, 6)), sorts: (tf.float64, tf.float64)>

Same for the check knowledge.

test_x <- test_data$knowledge %>% 
  map(as.matrix) %>%
  pad_sequences(maxlen = pad_size, dtype = "float32") %>%
  tensor_slices_dataset() 

test_y <- test_data$exercise %>% 
  one_hot_classes() %>% 
  tensor_slices_dataset()

n_test <- nrow(test_data)
test_dataset <- zip_datasets(test_x, test_y) %>%
  dataset_batch(n_test)

Using tfdatasets doesn’t imply we can not run a fast sanity test on our knowledge:

first <- test_dataset %>% 
  reticulate::as_iterator() %>% 
  # get first batch (= entire check set, in our case)
  reticulate::iter_next() %>%
  # predictors solely
  .[[1]] %>% 
  # first merchandise in batch
  .[1,,]
first
tf.Tensor(
[[ 0.          0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.        ]
 ...
 [ 1.00416672  0.2375      0.12916666 -0.40225476 -0.20463985 -0.14782938]
 [ 1.04166663  0.26944447  0.12777779 -0.26755899 -0.02779437 -0.1441642 ]
 [ 1.0250001   0.27083334  0.15277778 -0.19639318  0.35094208 -0.16249016]],
 form=(338, 6), dtype=float64)

Now let’s construct the community.

A variational convnet

We construct on the simple convolutional structure from Nick’s put up, simply making minor modifications to kernel sizes and numbers of filters. We additionally throw out all dropout layers; no extra regularization is required on high of the priors utilized to the weights.

Note the next in regards to the “Bayesified” community.

  • Each layer is variational in nature, the convolutional ones (layer_conv_1d_flipout) in addition to the dense layers (layer_dense_flipout).

  • With variational layers, we are able to specify the prior weight distribution in addition to the type of the posterior; right here the defaults are used, leading to a regular regular prior and a default mean-field posterior.

  • Likewise, the person could affect the divergence perform used to evaluate the mismatch between prior and posterior; on this case, we really take some motion: We scale the (default) KL divergence by the variety of samples within the coaching set.

  • One final thing to notice is the output layer. It is a distribution layer, that’s, a layer wrapping a distribution – the place wrapping means: Training the community is enterprise as common, however predictions are distributions, one for every knowledge level.

library(tfprobability)

num_classes <- 6

# scale the KL divergence by variety of coaching examples
n <- n_train %>% tf$forged(tf$float32)
kl_div <- perform(q, p, unused)
  tfd_kl_divergence(q, p) / n

mannequin <- keras_model_sequential()
mannequin %>% 
  layer_conv_1d_flipout(
    filters = 12,
    kernel_size = 3, 
    activation = "relu",
    kernel_divergence_fn = kl_div
  ) %>%
  layer_conv_1d_flipout(
    filters = 24,
    kernel_size = 5, 
    activation = "relu",
    kernel_divergence_fn = kl_div
  ) %>%
  layer_conv_1d_flipout(
    filters = 48,
    kernel_size = 7, 
    activation = "relu",
    kernel_divergence_fn = kl_div
  ) %>%
  layer_global_average_pooling_1d() %>% 
  layer_dense_flipout(
    models = 48,
    activation = "relu",
    kernel_divergence_fn = kl_div
  ) %>% 
  layer_dense_flipout(
    num_classes, 
    kernel_divergence_fn = kl_div,
    identify = "dense_output"
  ) %>%
  layer_one_hot_categorical(event_size = num_classes)

We inform the community to attenuate the adverse log probability.

nll <- perform(y, mannequin) - (mannequin %>% tfd_log_prob(y))

This will develop into a part of the loss. The approach we arrange this instance, this isn’t its most substantial half although. Here, what dominates the loss is the sum of the KL divergences, added (mechanically) to mannequin$losses.

In a setup like this, it’s attention-grabbing to watch each components of the loss individually. We can do that via two metrics:

# the KL a part of the loss
kl_part <-  perform(y_true, y_pred) {
    kl <- tf$reduce_sum(mannequin$losses)
    kl
}

# the NLL half
nll_part <- perform(y_true, y_pred) {
    cat_dist <- tfd_one_hot_categorical(logits = y_pred)
    nll <- - (cat_dist %>% tfd_log_prob(y_true) %>% tf$reduce_mean())
    nll
}

We practice considerably longer than Nick did within the unique put up, permitting for early stopping although.

mannequin %>% compile(
  optimizer = "rmsprop",
  loss = nll,
  metrics = c("accuracy", 
              custom_metric("kl_part", kl_part),
              custom_metric("nll_part", nll_part)),
  experimental_run_tf_function = FALSE
)

train_history <- mannequin %>% match(
  train_dataset,
  epochs = 1000,
  validation_data = test_dataset,
  callbacks = listing(
    callback_early_stopping(endurance = 10)
  )
)

While the general loss declines linearly (and possibly would for a lot of extra epochs), this isn’t the case for classification accuracy or the NLL a part of the loss:

Final accuracy will not be as excessive as within the non-variational setup, although nonetheless not unhealthy for a six-class drawback. We see that with none extra regularization, there’s little or no overfitting to the coaching knowledge.

Now how will we get hold of predictions from this mannequin?

Probabilistic predictions

Though we received’t go into this right here, it’s good to know that we entry extra than simply the output distributions; via their kernel_posterior attribute, we are able to entry the hidden layers’ posterior weight distributions as effectively.

Given the small measurement of the check set, we compute all predictions directly. The predictions are actually categorical distributions, one for every pattern within the batch:

test_data_all <- dataset_collect(test_dataset) %>% { .[[1]][[1]]}

one_shot_preds <- mannequin(test_data_all) 

one_shot_preds
tfp.distributions.OneHotCategorical(
 "sequential_one_hot_categorical_OneHotCategorical_OneHotCategorical",
 batch_shape=[69], event_shape=[6], dtype=float32)

We prefixed these predictions with one_shot to point their noisy nature: These are predictions obtained on a single go via the community, all layer weights being sampled from their respective posteriors.

From the expected distributions, we calculate imply and customary deviation per (check) pattern.

one_shot_means <- tfd_mean(one_shot_preds) %>% 
  as.matrix() %>%
  as_tibble() %>% 
  mutate(obs = 1:n()) %>% 
  collect(class, imply, -obs) 

one_shot_sds <- tfd_stddev(one_shot_preds) %>% 
  as.matrix() %>%
  as_tibble() %>% 
  mutate(obs = 1:n()) %>% 
  collect(class, sd, -obs) 

The customary deviations thus obtained may very well be mentioned to replicate the general predictive uncertainty. We can estimate one other type of uncertainty, known as epistemic, by making quite a few passes via the community after which, calculating – once more, per check pattern – the usual deviations of the expected means.

mc_preds <- purrr::map(1:100, perform(x) {
  preds <- mannequin(test_data_all)
  tfd_mean(preds) %>% as.matrix()
})

mc_sds <- abind::abind(mc_preds, alongside = 3) %>% 
  apply(c(1,2), sd) %>% 
  as_tibble() %>%
  mutate(obs = 1:n()) %>% 
  collect(class, mc_sd, -obs) 

Putting all of it collectively, now we have

pred_data <- one_shot_means %>%
  inner_join(one_shot_sds, by = c("obs", "class")) %>% 
  inner_join(mc_sds, by = c("obs", "class")) %>% 
  right_join(one_hot_to_label, by = "class") %>% 
  prepare(obs)

pred_data
# A tibble: 414 x 6
     obs class       imply      sd    mc_sd label       
   <int> <chr>      <dbl>   <dbl>    <dbl> <fct>       
 1     1 V1    0.945      0.227   0.0743   STAND_TO_SIT
 2     1 V2    0.0534     0.225   0.0675   SIT_TO_STAND
 3     1 V3    0.00114    0.0338  0.0346   SIT_TO_LIE  
 4     1 V4    0.00000238 0.00154 0.000336 LIE_TO_SIT  
 5     1 V5    0.0000132  0.00363 0.00164  STAND_TO_LIE
 6     1 V6    0.0000305  0.00553 0.00398  LIE_TO_STAND
 7     2 V1    0.993      0.0813  0.149    STAND_TO_SIT
 8     2 V2    0.00153    0.0390  0.102    SIT_TO_STAND
 9     2 V3    0.00476    0.0688  0.108    SIT_TO_LIE  
10     2 V4    0.00000172 0.00131 0.000613 LIE_TO_SIT  
# … with 404 extra rows

Comparing predictions to the bottom reality:

eval_table <- pred_data %>% 
  group_by(obs) %>% 
  summarise(
    maxprob = max(imply),
    maxprob_sd = sd[mean == maxprob],
    maxprob_mc_sd = mc_sd[mean == maxprob],
    predicted = label[mean == maxprob]
  ) %>% 
  mutate(
    reality = test_data$activityName,
    right = reality == predicted
  ) 

eval_table %>% print(n = 20)
# A tibble: 69 x 7
     obs maxprob maxprob_sd maxprob_mc_sd predicted    reality        right
   <int>   <dbl>      <dbl>         <dbl> <fct>        <fct>        <lgl>  
 1     1   0.945     0.227         0.0743 STAND_TO_SIT STAND_TO_SIT TRUE   
 2     2   0.993     0.0813        0.149  STAND_TO_SIT STAND_TO_SIT TRUE   
 3     3   0.733     0.443         0.131  STAND_TO_SIT STAND_TO_SIT TRUE   
 4     4   0.796     0.403         0.138  STAND_TO_SIT STAND_TO_SIT TRUE   
 5     5   0.843     0.364         0.358  SIT_TO_STAND STAND_TO_SIT FALSE  
 6     6   0.816     0.387         0.176  SIT_TO_STAND STAND_TO_SIT FALSE  
 7     7   0.600     0.490         0.370  STAND_TO_SIT STAND_TO_SIT TRUE   
 8     8   0.941     0.236         0.0851 STAND_TO_SIT STAND_TO_SIT TRUE   
 9     9   0.853     0.355         0.274  SIT_TO_STAND STAND_TO_SIT FALSE  
10    10   0.961     0.195         0.195  STAND_TO_SIT STAND_TO_SIT TRUE   
11    11   0.918     0.275         0.168  STAND_TO_SIT STAND_TO_SIT TRUE   
12    12   0.957     0.203         0.150  STAND_TO_SIT STAND_TO_SIT TRUE   
13    13   0.987     0.114         0.188  SIT_TO_STAND SIT_TO_STAND TRUE   
14    14   0.974     0.160         0.248  SIT_TO_STAND SIT_TO_STAND TRUE   
15    15   0.996     0.0657        0.0534 SIT_TO_STAND SIT_TO_STAND TRUE   
16    16   0.886     0.318         0.0868 SIT_TO_STAND SIT_TO_STAND TRUE   
17    17   0.773     0.419         0.173  SIT_TO_STAND SIT_TO_STAND TRUE   
18    18   0.998     0.0444        0.222  SIT_TO_STAND SIT_TO_STAND TRUE   
19    19   0.885     0.319         0.161  SIT_TO_STAND SIT_TO_STAND TRUE   
20    20   0.930     0.255         0.271  SIT_TO_STAND SIT_TO_STAND TRUE   
# … with 49 extra rows

Are customary deviations increased for misclassifications?

eval_table %>% 
  group_by(reality, predicted) %>% 
  summarise(avg_mean = imply(maxprob),
            avg_sd = imply(maxprob_sd),
            avg_mc_sd = imply(maxprob_mc_sd)) %>% 
  mutate(right = reality == predicted) %>%
  prepare(avg_mc_sd) 
# A tibble: 2 x 5
  right rely avg_mean avg_sd avg_mc_sd
  <lgl>   <int>    <dbl>  <dbl>     <dbl>
1 FALSE      19    0.775  0.380     0.237
2 TRUE       50    0.879  0.264     0.183

They are; although maybe to not the extent we would need.

With simply six courses, we are able to additionally examine customary deviations on the person prediction-target pairings degree.

eval_table %>% 
  group_by(reality, predicted) %>% 
  summarise(cnt = n(),
            avg_mean = imply(maxprob),
            avg_sd = imply(maxprob_sd),
            avg_mc_sd = imply(maxprob_mc_sd)) %>% 
  mutate(right = reality == predicted) %>%
  prepare(desc(cnt), avg_mc_sd) 
# A tibble: 14 x 7
# Groups:   reality [6]
   reality        predicted      cnt avg_mean avg_sd avg_mc_sd right
   <fct>        <fct>        <int>    <dbl>  <dbl>     <dbl> <lgl>  
 1 SIT_TO_STAND SIT_TO_STAND    12    0.935  0.205    0.184  TRUE   
 2 STAND_TO_SIT STAND_TO_SIT     9    0.871  0.284    0.162  TRUE   
 3 LIE_TO_SIT   LIE_TO_SIT       9    0.765  0.377    0.216  TRUE   
 4 SIT_TO_LIE   SIT_TO_LIE       8    0.908  0.254    0.187  TRUE   
 5 STAND_TO_LIE STAND_TO_LIE     7    0.956  0.144    0.132  TRUE   
 6 LIE_TO_STAND LIE_TO_STAND     5    0.809  0.353    0.227  TRUE   
 7 SIT_TO_LIE   STAND_TO_LIE     4    0.685  0.436    0.233  FALSE  
 8 LIE_TO_STAND SIT_TO_STAND     4    0.909  0.271    0.282  FALSE  
 9 STAND_TO_LIE SIT_TO_LIE       3    0.852  0.337    0.238  FALSE  
10 STAND_TO_SIT SIT_TO_STAND     3    0.837  0.368    0.269  FALSE  
11 LIE_TO_STAND LIE_TO_SIT       2    0.689  0.454    0.233  FALSE  
12 LIE_TO_SIT   STAND_TO_SIT     1    0.548  0.498    0.0805 FALSE  
13 SIT_TO_STAND LIE_TO_STAND     1    0.530  0.499    0.134  FALSE  
14 LIE_TO_SIT   LIE_TO_STAND     1    0.824  0.381    0.231  FALSE  

Again, we see increased customary deviations for improper predictions, however to not a excessive diploma.

Conclusion

We’ve proven methods to construct, practice, and acquire predictions from a completely variational convnet. Evidently, there’s room for experimentation: Alternative layer implementations exist; a special prior may very well be specified; the divergence may very well be calculated in another way; and the standard neural community hyperparameter tuning choices apply.

Then, there’s the query of penalties (or: resolution making). What goes to occur in high-uncertainty instances, what even is a high-uncertainty case? Naturally, questions like these are out-of-scope for this put up, but of important significance in real-world purposes.
Thanks for studying!

Appendix

To be executed earlier than working this put up’s code. Copied from Classifying bodily exercise from smartphone knowledge.

library(keras)     
library(tidyverse) 

activity_labels <- read.table("knowledge/activity_labels.txt", 
                             col.names = c("quantity", "label")) 

one_hot_to_label <- activity_labels %>% 
  mutate(quantity = quantity - 7) %>% 
  filter(quantity >= 0) %>% 
  mutate(class = paste0("V",quantity + 1)) %>% 
  choose(-quantity)

labels <- read.table(
  "knowledge/UncookedData/labels.txt",
  col.names = c("experiment", "userId", "exercise", "startPos", "endPos")
)

dataFiles <- list.files("knowledge/UncookedData")
dataFiles %>% head()

fileInfo <- data_frame(
  filePath = dataFiles
) %>%
  filter(filePath != "labels.txt") %>%
  separate(filePath, sep = '_',
           into = c("kind", "experiment", "userId"),
           take away = FALSE) %>%
  mutate(
    experiment = str_remove(experiment, "exp"),
    userId = str_remove_all(userId, "person|.txt")
  ) %>%
  unfold(kind, filePath)

# Read contents of single file to a dataframe with accelerometer and gyro knowledge.
learnInData <- perform(experiment, userId){
  genFilePath = perform(kind) {
    paste0("knowledge/UncookedData/", kind, "_exp",experiment, "_user", userId, ".txt")
  }
  bind_cols(
    read.table(genFilePath("acc"), col.names = c("a_x", "a_y", "a_z")),
    read.table(genFilePath("gyro"), col.names = c("g_x", "g_y", "g_z"))
  )
}

# Function to learn a given file and get the observations contained alongside
# with their courses.
loadFileData <- perform(curExperiment, curUserId) {

  # load sensor knowledge from file into dataframe
  allData <- learnInData(curExperiment, curUserId)
  extractObservation <- perform(startPos, endPos){
    allData[startPos:endPos,]
  }

  # get statement areas on this file from labels dataframe
  dataLabels <- labels %>%
    filter(userId == as.integer(curUserId),
           experiment == as.integer(curExperiment))

  # extract observations as dataframes and save as a column in dataframe.
  dataLabels %>%
    mutate(
      knowledge = map2(startPos, endPos, extractObservation)
    ) %>%
    choose(-startPos, -endPos)
}

# scan via all experiment and userId combos and collect knowledge right into a dataframe.
allObservations <- map2_df(fileInfo$experiment, fileInfo$userId, loadFileData) %>%
  right_join(activityLabels, by = c("exercise" = "quantity")) %>%
  rename(activityName = label)

write_rds(allObservations, "allObservations.rds")

allObservations <- readRDS("allObservations.rds")

desiredActivities <- c(
  "STAND_TO_SIT", "SIT_TO_STAND", "SIT_TO_LIE", 
  "LIE_TO_SIT", "STAND_TO_LIE", "LIE_TO_STAND"  
)

filteredObservations <- allObservations %>% 
  filter(activityName %in% desiredActivities) %>% 
  mutate(observationId = 1:n())

# get all customers
userIds <- allObservations$userId %>% distinctive()

# randomly select 24 (80% of 30 people) for coaching
set.seed(42) # seed for reproducibility
trainIds <- pattern(userIds, measurement = 24)

# set the remainder of the customers to the testing set
testIds <- setdiff(userIds,trainIds)

# filter knowledge. 
# word S.Ok.: renamed to train_data for consistency with 
# variable naming used on this put up
train_data <- filteredObservations %>% 
  filter(userId %in% trainIds)

# word S.Ok.: renamed to test_data for consistency with 
# variable naming used on this put up
test_data <- filteredObservations %>% 
  filter(userId %in% testIds)

# word S.Ok.: renamed to pad_size for consistency with 
# variable naming used on this put up
pad_size <- practiceData$knowledge %>% 
  map_int(nrow) %>% 
  quantile(p = 0.98) %>% 
  ceiling()

# word S.Ok.: renamed to one_hot_classes for consistency with 
# variable naming used on this put up
one_hot_classes <- . %>% 
  {. - 7} %>%        # deliver integers all the way down to 0-6 from 7-12
  to_categorical()   # One-hot encode
Reyes-Ortiz, Jorge-L., Luca Oneto, Albert Samà, Xavier Parra, and Davide Anguita. 2016. “Transition-Aware Human Activity Recognition Using Smartphones.” Neurocomput. 171 (C): 754–67. https://doi.org/10.1016/j.neucom.2015.07.085.

LEAVE A REPLY

Please enter your comment!
Please enter your name here