This article interprets Daniel Falbel’s ‘Simple Audio Classification’ article from tensorflow/keras
to torch/torchaudio
. The important objective is to introduce torchaudio and illustrate its contributions to the torch
ecosystem. Here, we concentrate on a preferred dataset, the audio loader and the spectrogram transformer. An fascinating aspect product is the parallel between torch and tensorflow, displaying typically the variations, typically the similarities between them.
Downloading and Importing
torchaudio has the speechcommand_dataset
in-built. It filters out background_noise by default and lets us select between variations v0.01
and v0.02
.
# set an present folder right here to cache the dataset
DATASETS_PATH <- "~/datasets/"
# 1.4GB obtain
df <- speechcommand_dataset(
root = DATASETS_PATH,
url = "speech_commands_v0.01",
obtain = TRUE
)
# count on folder: _background_noise_
df$EXCEPT_FOLDER
# [1] "_background_noise_"
# variety of audio information
size(df)
# [1] 64721
# a pattern
pattern <- df[1]
pattern$waveform[, 1:10]
torch_tensor
0.0001 *
0.9155 0.3052 1.8311 1.8311 -0.3052 0.3052 2.4414 0.9155 -0.9155 -0.6104
[ CPUFloatType{1,10} ]
pattern$sample_rate
# 16000
pattern$label
# mattress
plot(pattern$waveform[1], sort = "l", col = "royalblue", important = pattern$label)
Classes
[1] "mattress" "fowl" "cat" "canine" "down" "eight" "5"
[8] "4" "go" "pleased" "home" "left" "marvin" "9"
[15] "no" "off" "on" "one" "proper" "seven" "sheila"
[22] "six" "cease" "three" "tree" "two" "up" "wow"
[29] "sure" "zero"
Generator Dataloader
torch::dataloader
has the identical activity as data_generator
outlined within the authentic article. It is chargeable for making ready batches – together with shuffling, padding, one-hot encoding, and so forth. – and for taking good care of parallelism / gadget I/O orchestration.
In torch we do that by passing the prepare/check subset to torch::dataloader
and encapsulating all of the batch setup logic inside a collate_fn()
perform.
At this level, dataloader(train_subset)
wouldn’t work as a result of the samples will not be padded. So we have to construct our personal collate_fn()
with the padding technique.
I recommend utilizing the next method when implementing the collate_fn()
:
- start with
collate_fn <- perform(batch) browser()
. - instantiate
dataloader
with thecollate_fn()
- create an surroundings by calling
enumerate(dataloader)
so you may ask to retrieve a batch from dataloader. - run
surroundings[[1]][[1]]
. Now you ought to be despatched inside collate_fn() with entry tobatch
enter object. - construct the logic.
collate_fn <- perform(batch) {
browser()
}
ds_train <- dataloader(
train_subset,
batch_size = 32,
shuffle = TRUE,
collate_fn = collate_fn
)
ds_train_env <- enumerate(ds_train)
ds_train_env[[1]][[1]]
The last collate_fn()
pads the waveform to size 16001 after which stacks the whole lot up collectively. At this level there aren’t any spectrograms but. We going to make spectrogram transformation part of mannequin structure.
pad_sequence <- perform(batch) {
# Make all tensors in a batch the identical size by padding with zeros
batch <- sapply(batch, perform(x) (x$t()))
batch <- torch::nn_utils_rnn_pad_sequence(batch, batch_first = TRUE, padding_value = 0.)
return(batch$permute(c(1, 3, 2)))
}
# Final collate_fn
collate_fn <- perform(batch) {
# Input construction:
# record of 32 lists: record(waveform, sample_rate, label, speaker_id, utterance_number)
# Transpose it
batch <- purrr::transpose(batch)
tensors <- batch$waveform
targets <- batch$label_index
# Group the record of tensors right into a batched tensor
tensors <- pad_sequence(tensors)
# goal encoding
targets <- torch::torch_stack(targets)
record(tensors = tensors, targets = targets) # (64, 1, 16001)
}
Batch construction is:
- batch[[1]]: waveforms –
tensor
with dimension (32, 1, 16001) - batch[[2]]: targets –
tensor
with dimension (32, 1)
Also, torchaudio comes with 3 loaders, av_loader
, tuner_loader
, and audiofile_loader
– extra to return. set_audio_backend()
is used to set one in every of them because the audio loader. Their performances differ primarily based on audio format (mp3 or wav). There is not any good world but: tuner_loader
is finest for mp3, audiofile_loader
is finest for wav, however neither of them has the choice of partially loading a pattern from an audio file with out bringing all the information into reminiscence first.
For a given audio backend we’d like cross it to every employee by way of worker_init_fn()
argument.
ds_train <- dataloader(
train_subset,
batch_size = 128,
shuffle = TRUE,
collate_fn = collate_fn,
num_workers = 16,
worker_init_fn = perform(.) {torchaudio::set_audio_backend("audiofile_loader")},
worker_globals = c("pad_sequence") # pad_sequence is required for collect_fn
)
ds_test <- dataloader(
test_subset,
batch_size = 64,
shuffle = FALSE,
collate_fn = collate_fn,
num_workers = 8,
worker_globals = c("pad_sequence") # pad_sequence is required for collect_fn
)
Model definition
Instead of keras::keras_model_sequential()
, we’re going to outline a torch::nn_module()
. As referenced by the unique article, the mannequin is predicated on this structure for MNIST from this tutorial, and I’ll name it ‘DanielNN’.
dan_nn <- torch::nn_module(
"DanielNN",
initialize = perform(
window_size_ms = 30,
window_stride_ms = 10
) {
# spectrogram spec
window_size <- as.integer(16000*window_size_ms/1000)
stride <- as.integer(16000*window_stride_ms/1000)
fft_size <- as.integer(2^trunc(log(window_size, 2) + 1))
n_chunks <- size(seq(0, 16000, stride))
self$spectrogram <- torchaudio::transform_spectrogram(
n_fft = fft_size,
win_length = window_size,
hop_length = stride,
normalized = TRUE,
energy = 2
)
# convs 2D
self$conv1 <- torch::nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = c(3,3))
self$conv2 <- torch::nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = c(3,3))
self$conv3 <- torch::nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = c(3,3))
self$conv4 <- torch::nn_conv2d(in_channels = 128, out_channels = 256, kernel_size = c(3,3))
# denses
self$dense1 <- torch::nn_linear(in_features = 14336, out_features = 128)
self$dense2 <- torch::nn_linear(in_features = 128, out_features = 30)
},
ahead = perform(x) {
x %>% # (64, 1, 16001)
self$spectrogram() %>% # (64, 1, 257, 101)
torch::torch_add(0.01) %>%
torch::torch_log() %>%
self$conv1() %>%
torch::nnf_relu() %>%
torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
self$conv2() %>%
torch::nnf_relu() %>%
torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
self$conv3() %>%
torch::nnf_relu() %>%
torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
self$conv4() %>%
torch::nnf_relu() %>%
torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
torch::nnf_dropout(p = 0.25) %>%
torch::torch_flatten(start_dim = 2) %>%
self$dense1() %>%
torch::nnf_relu() %>%
torch::nnf_dropout(p = 0.5) %>%
self$dense2()
}
)
mannequin <- dan_nn()
gadget <- torch::torch_device(if(torch::cuda_is_available()) "cuda" else "cpu")
mannequin$to(gadget = gadget)
print(mannequin)
An `nn_module` containing 2,226,846 parameters.
── Modules ──────────────────────────────────────────────────────
● spectrogram: <Spectrogram> #0 parameters
● conv1: <nn_conv2d> #320 parameters
● conv2: <nn_conv2d> #18,496 parameters
● conv3: <nn_conv2d> #73,856 parameters
● conv4: <nn_conv2d> #295,168 parameters
● dense1: <nn_linear> #1,835,136 parameters
● dense2: <nn_linear> #3,870 parameters
Model becoming
Unlike in tensorflow, there isn’t any mannequin %>% compile(...)
step in torch, so we’re going to set loss criterion
, optimizer technique
and analysis metrics
explicitly within the coaching loop.
loss_criterion <- torch::nn_cross_entropy_loss()
optimizer <- torch::optim_adadelta(mannequin$parameters, rho = 0.95, eps = 1e-7)
metrics <- record(acc = yardstick::accuracy_vec)
Training loop
library(glue)
library(progress)
pred_to_r <- perform(x) {
lessons <- issue(df$lessons)
lessons[as.numeric(x$to(device = "cpu"))]
}
set_progress_bar <- perform(complete) {
progress_bar$new(
complete = complete, clear = FALSE, width = 70,
format = ":present/:complete [:bar] - :elapsed - loss: :loss - acc: :acc"
)
}
epochs <- 20
losses <- c()
accs <- c()
for(epoch in seq_len(epochs)) {
pb <- set_progress_bar(size(ds_train))
pb$message(glue("Epoch {epoch}/{epochs}"))
coro::loop(for(batch in ds_train) {
optimizer$zero_grad()
predictions <- mannequin(batch[[1]]$to(gadget = gadget))
targets <- batch[[2]]$to(gadget = gadget)
loss <- loss_criterion(predictions, targets)
loss$backward()
optimizer$step()
# eval studies
prediction_r <- pred_to_r(predictions$argmax(dim = 2))
targets_r <- pred_to_r(targets)
acc <- metrics$acc(targets_r, prediction_r)
accs <- c(accs, acc)
loss_r <- as.numeric(loss$merchandise())
losses <- c(losses, loss_r)
pb$tick(tokens = record(loss = spherical(imply(losses), 4), acc = spherical(imply(accs), 4)))
})
}
# check
predictions_r <- c()
targets_r <- c()
coro::loop(for(batch_test in ds_test) {
predictions <- mannequin(batch_test[[1]]$to(gadget = gadget))
targets <- batch_test[[2]]$to(gadget = gadget)
predictions_r <- c(predictions_r, pred_to_r(predictions$argmax(dim = 2)))
targets_r <- c(targets_r, pred_to_r(targets))
})
val_acc <- metrics$acc(issue(targets_r, ranges = 1:30), issue(predictions_r, ranges = 1:30))
cat(glue("val_acc: {val_acc}nn"))
Epoch 1/20
[W SpectralOps.cpp:590] Warning: The perform torch.rfft is deprecated and might be eliminated in a future PyTorch launch. Use the brand new torch.fft module capabilities, as a substitute, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (perform operator())
354/354 [=========================] - 1m - loss: 2.6102 - acc: 0.2333
Epoch 2/20
354/354 [=========================] - 1m - loss: 1.9779 - acc: 0.4138
Epoch 3/20
354/354 [============================] - 1m - loss: 1.62 - acc: 0.519
Epoch 4/20
354/354 [=========================] - 1m - loss: 1.3926 - acc: 0.5859
Epoch 5/20
354/354 [==========================] - 1m - loss: 1.2334 - acc: 0.633
Epoch 6/20
354/354 [=========================] - 1m - loss: 1.1135 - acc: 0.6685
Epoch 7/20
354/354 [=========================] - 1m - loss: 1.0199 - acc: 0.6961
Epoch 8/20
354/354 [=========================] - 1m - loss: 0.9444 - acc: 0.7181
Epoch 9/20
354/354 [=========================] - 1m - loss: 0.8816 - acc: 0.7365
Epoch 10/20
354/354 [=========================] - 1m - loss: 0.8278 - acc: 0.7524
Epoch 11/20
354/354 [=========================] - 1m - loss: 0.7818 - acc: 0.7659
Epoch 12/20
354/354 [=========================] - 1m - loss: 0.7413 - acc: 0.7778
Epoch 13/20
354/354 [=========================] - 1m - loss: 0.7064 - acc: 0.7881
Epoch 14/20
354/354 [=========================] - 1m - loss: 0.6751 - acc: 0.7974
Epoch 15/20
354/354 [=========================] - 1m - loss: 0.6469 - acc: 0.8058
Epoch 16/20
354/354 [=========================] - 1m - loss: 0.6216 - acc: 0.8133
Epoch 17/20
354/354 [=========================] - 1m - loss: 0.5985 - acc: 0.8202
Epoch 18/20
354/354 [=========================] - 1m - loss: 0.5774 - acc: 0.8263
Epoch 19/20
354/354 [==========================] - 1m - loss: 0.5582 - acc: 0.832
Epoch 20/20
354/354 [=========================] - 1m - loss: 0.5403 - acc: 0.8374
val_acc: 0.876705979296493
Making predictions
We have already got all predictions calculated for test_subset
, let’s recreate the alluvial plot from the unique article.
library(dplyr)
library(alluvial)
df_validation <- data.frame(
pred_class = df$lessons[predictions_r],
class = df$lessons[targets_r]
)
x <- df_validation %>%
mutate(appropriate = pred_class == class) %>%
depend(pred_class, class, appropriate)
alluvial(
x %>% choose(class, pred_class),
freq = x$n,
col = ifelse(x$appropriate, "lightblue", "purple"),
border = ifelse(x$appropriate, "lightblue", "purple"),
alpha = 0.6,
conceal = x$n < 20
)
Model accuracy is 87,7%, considerably worse than tensorflow model from the unique put up. Nevertheless, all conclusions from authentic put up nonetheless maintain.
Reuse
Text and figures are licensed beneath Creative Commons Attribution CC BY 4.0. The figures which were reused from different sources do not fall beneath this license and may be acknowledged by a be aware of their caption: “Figure from …”.
Citation
For attribution, please cite this work as
Damiani (2021, Feb. 4). RStudio AI Blog: Simple audio classification with torch. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2021-02-04-simple-audio-classification-with-torch/
BibTeX quotation
@misc{athossimpleaudioclassification, writer = {Damiani, Athos}, title = {RStudio AI Blog: Simple audio classification with torch}, url = {https://blogs.rstudio.com/tensorflow/posts/2021-02-04-simple-audio-classification-with-torch/}, yr = {2021} }