In picture captioning, an algorithm is given a picture and tasked with producing a wise caption. It is a difficult activity for a number of causes, not the least being that it includes a notion of saliency or relevance. This is why current deep studying approaches principally embrace some “attention” mechanism (typically even a couple of) to assist specializing in related picture options.
In this put up, we reveal a formulation of picture captioning as an encoder-decoder downside, enhanced by spatial consideration over picture grid cells. The concept comes from a current paper on Neural Image Caption Generation with Visual Attention (Xu et al. 2015), and employs the identical form of consideration algorithm as detailed in our put up on machine translation.
We’re porting Python code from a current Google Colaboratory pocket book, utilizing Keras with TensorFlow keen execution to simplify our lives.
Prerequisites
The code proven right here will work with the present CRAN variations of tensorflow
, keras
, and tfdatasets
.
Check that you simply’re utilizing at the very least model 1.9 of TensorFlow. If that isn’t the case, as of this writing, this
will get you model 1.10.
When loading libraries, please be sure you’re executing the primary 4 traces on this precise order.
We want to ensure we’re utilizing the TensorFlow implementation of Keras (tf.keras
in Python land), and we’ve got to allow keen execution earlier than utilizing TensorFlow in any manner.
No have to copy-paste any code snippets – you’ll discover the entire code (so as obligatory for execution) right here: eager-image-captioning.R.
The dataset
MS-COCO (“Common Objects in Context”) is certainly one of, maybe the, reference dataset in picture captioning (object detection and segmentation, too).
We’ll be utilizing the coaching photos and annotations from 2014 – be warned, relying in your location, the obtain can take a lengthy time.
After unpacking, let’s outline the place the photographs and captions are.
annotation_file <- "train2014/annotations/captions_train2014.json"
image_path <- "train2014/train2014"
The annotations are in JSON format, and there are 414113 of them! Luckily for us we didn’t need to obtain that many photos – each picture comes with 5 totally different captions, for higher generalizability.
annotations <- fromJSON(file = annotation_file)
annot_captions <- annotations[[4]]
num_captions <- size(annot_captions)
We retailer each annotations and picture paths in lists, for later loading.
all_captions <- vector(mode = "record", size = num_captions)
all_img_names <- vector(mode = "record", size = num_captions)
for (i in seq_len(num_captions)) {
caption <- paste0("<begin> ",
annot_captions[[i]][["caption"]],
" <finish>"
)
image_id <- annot_captions[[i]][["image_id"]]
full_coco_image_path <- sprintf(
"%s/COCO_train2014_percent012d.jpg",
image_path,
image_id
)
all_img_names[[i]] <- full_coco_image_path
all_captions[[i]] <- caption
}
Depending in your computing surroundings, you’ll for certain wish to prohibit the variety of examples used.
This put up will use 30000 captioned photos, chosen randomly, and put aside 20% for validation.
Below, we take random samples, cut up into coaching and validation components. The companion code may also retailer the indices on disk, so you possibly can choose up on verification and evaluation later.
num_examples <- 30000
random_sample <- pattern(1:num_captions, dimension = num_examples)
train_indices <- pattern(random_sample, dimension = size(random_sample) * 0.8)
validation_indices <- setdiff(random_sample, train_indices)
sample_captions <- all_captions[random_sample]
sample_images <- all_img_names[random_sample]
train_captions <- all_captions[train_indices]
train_images <- all_img_names[train_indices]
validation_captions <- all_captions[validation_indices]
validation_images <- all_img_names[validation_indices]
Interlude
Before actually diving into the technical stuff, let’s take a second to replicate on this activity.
In typical image-related deep studying walk-throughs, we’re used to seeing well-defined issues – even when in some instances, the answer could also be laborious. Take, for instance, the stereotypical canine vs. cat downside. Some canine could appear to be cats and a few cats could appear to be canine, however that’s about it: All in all, within the common world we reside in, it ought to be a roughly binary query.
If, then again, we ask individuals to explain what they see in a scene, it’s to be anticipated from the outset that we’ll get totally different solutions. Still, how a lot consensus there’s will very a lot rely upon the concrete dataset we’re utilizing.
Let’s check out some picks from the very first 20 coaching objects sampled randomly above.
Now this picture doesn’t go away a lot room for resolution what to deal with, and acquired a really factual caption certainly: “There is a plate with one slice of bacon a half of orange and bread.” If the dataset have been all like this, we’d suppose a machine studying algorithm ought to do fairly nicely right here.
Picking one other one from the primary 20:
What could be salient info to you right here? The caption offered goes “A smiling little boy has a checkered shirt.”
Is the look of the shirt as essential as that? You may as nicely deal with the surroundings, – and even one thing on a totally totally different degree: The age of the photograph, or it being an analog one.
Let’s take a last instance.
What would you say about this scene? The official label we sampled right here is “A group of people posing in a funny way for the camera.” Well …
Please don’t overlook that for every picture, the dataset contains 5 totally different captions (though our n = 30000 samples in all probability gained’t).
So this isn’t saying the dataset is biased – under no circumstances. Instead, we wish to level out the ambiguities and difficulties inherent within the activity. Actually, given these difficulties, it’s all of the extra wonderful that the duty we’re tackling right here – having a community routinely generate picture captions – ought to be attainable in any respect!
Now let’s see how we are able to do that.
For the encoding a part of our encoder-decoder community, we are going to make use of InceptionV3 to extract picture options. In precept, which options to extract is as much as experimentation, – right here we simply use the final layer earlier than the totally related high:
image_model <- application_inception_v3(
include_top = FALSE,
weights = "imagenet"
)
For a picture dimension of 299×299, the output shall be of dimension (batch_size, 8, 8, 2048)
, that’s, we’re making use of 2048 characteristic maps.
InceptionV3 being a “big model,” the place each go by the mannequin takes time, we wish to precompute options prematurely and retailer them on disk.
We’ll use tfdatasets to stream photos to the mannequin. This means all our preprocessing has to make use of tensorflow features: That’s why we’re not utilizing the extra acquainted image_load
from keras beneath.
Our customized load_image
will learn in, resize and preprocess the photographs as required to be used with InceptionV3:
Now we’re prepared to avoid wasting the extracted options to disk. The (batch_size, 8, 8, 2048)
-sized options shall be flattened to (batch_size, 64, 2048)
. The latter form is what our encoder, quickly to be mentioned, will obtain as enter.
preencode <- distinctive(sample_images) %>% unlist() %>% type()
num_unique <- size(preencode)
# adapt this in accordance with your system's capacities
batch_size_4save <- 1
image_dataset <-
tensor_slices_dataset(preencode) %>%
dataset_map(load_image) %>%
dataset_batch(batch_size_4save)
save_iter <- make_iterator_one_shot(image_dataset)
until_out_of_range({
save_count <- save_count + batch_size_4save
batch_4save <- save_iter$get_next()
img <- batch_4save[[1]]
path <- batch_4save[[2]]
batch_features <- image_model(img)
batch_features <- tf$reshape(
batch_features,
record(dim(batch_features)[1], -1L, dim(batch_features)[4]
)
)
for (i in 1:dim(batch_features)[1]) {
np$save(path[i]$numpy()$decode("utf-8"),
batch_features[i, , ]$numpy())
}
})
Before we get to the encoder and decoder fashions although, we have to handle the captions.
Processing the captions
We’re utilizing keras text_tokenizer
and the textual content processing features texts_to_sequences
and pad_sequences
to remodel ascii textual content right into a matrix.
# we are going to use the 5000 most frequent phrases solely
top_k <- 5000
tokenizer <- text_tokenizer(
num_words = top_k,
oov_token = "<unk>",
filters = '!"#$%&()*+.,-/:;=?@[]^_`~ ')
tokenizer$fit_on_texts(sample_captions)
train_captions_tokenized <-
tokenizer %>% texts_to_sequences(train_captions)
validation_captions_tokenized <-
tokenizer %>% texts_to_sequences(validation_captions)
# pad_sequences will use 0 to pad all captions to the identical size
tokenizer$word_index["<pad>"] <- 0
# create a lookup dataframe that enables us to go in each instructions
word_index_df <- data.frame(
phrase = tokenizer$word_index %>% names(),
index = tokenizer$word_index %>% unlist(use.names = FALSE),
stringsAsElements = FALSE
)
word_index_df <- word_index_df %>% prepare(index)
decode_caption <- perform(textual content) {
paste(map(textual content, perform(quantity)
word_index_df %>%
filter(index == quantity) %>%
choose(phrase) %>%
pull()),
collapse = " ")
}
# pad all sequences to the identical size (the utmost size, in our case)
# might experiment with shorter padding (truncating the very longest captions)
caption_lengths <- map(
all_captions[1:num_examples],
perform(c) str_split(c," ")[[1]] %>% size()
) %>% unlist()
max_length <- fivenum(caption_lengths)[5]
train_captions_padded <- pad_sequences(
train_captions_tokenized,
maxlen = max_length,
padding = "put up",
truncating = "put up"
)
validation_captions_padded <- pad_sequences(
validation_captions_tokenized,
maxlen = max_length,
padding = "put up",
truncating = "put up"
)
Loading the information for coaching
Now that we’ve taken care of pre-extracting the options and preprocessing the captions, we want a strategy to stream them to our captioning mannequin. For that, we’re utilizing tensor_slices_dataset
from tfdatasets, passing within the record of paths to the photographs and the preprocessed captions. Loading the photographs is then carried out as a TensorFlow graph operation (utilizing tf$pyfunc).
The authentic Colab code additionally shuffles the information on each iteration. Depending in your {hardware}, this will likely take a very long time, and given the dimensions of the dataset it’s not strictly essential to get cheap outcomes. (The outcomes reported beneath have been obtained with out shuffling.)
batch_size <- 10
buffer_size <- num_examples
map_func <- perform(img_name, cap) {
p <- paste0(img_name$decode("utf-8"), ".npy")
img_tensor <- np$load(p)
img_tensor <- tf$forged(img_tensor, tf$float32)
record(img_tensor, cap)
}
train_dataset <-
tensor_slices_dataset(record(train_images, train_captions_padded)) %>%
dataset_map(
perform(item1, item2) tf$py_func(map_func, record(item1, item2), record(tf$float32, tf$int32))
) %>%
# optionally shuffle the dataset
# dataset_shuffle(buffer_size) %>%
dataset_batch(batch_size)
Captioning mannequin
The mannequin is principally the identical as that mentioned within the machine translation put up. Please seek advice from that article for a proof of the ideas, in addition to an in depth walk-through of the tensor shapes concerned at each step. Here, we offer the tensor shapes as feedback within the code snippets, for fast overview/comparability.
However, in the event you develop your personal fashions, with keen execution you possibly can merely insert debugging/logging statements at arbitrary locations within the code – even in mannequin definitions. So you possibly can have a perform
And in the event you now set
you possibly can hint – not solely tensor shapes, however precise tensor values by your fashions, as proven beneath for the encoder. (We don’t show any debugging statements after that, however the pattern code has many extra.)
Encoder
Now it’s time to outline some some sizing-related hyperparameters and housekeeping variables:
# for encoder output
embedding_dim <- 256
# decoder (LSTM) capability
gru_units <- 512
# for decoder output
vocab_size <- top_k
# variety of characteristic maps gotten from Inception V3
features_shape <- 2048
# form of consideration options (flattened from 8x8)
attention_features_shape <- 64
The encoder on this case is only a totally related layer, taking within the options extracted from Inception V3 (in flattened kind, as they have been written to disk), and embedding them in 256-dimensional house.
cnn_encoder <- perform(embedding_dim, title = NULL) {
keras_model_custom(title = title, perform(self) {
self$fc <- layer_dense(items = embedding_dim, activation = "relu")
perform(x, masks = NULL) {
# enter form: (batch_size, 64, features_shape)
maybecat("encoder enter", x)
# form after fc: (batch_size, 64, embedding_dim)
x <- self$fc(x)
maybecat("encoder output", x)
x
}
})
}
Attention module
Unlike within the machine translation put up, right here the eye module is separated out into its personal customized mannequin.
The logic is similar although:
attention_module <- perform(gru_units, title = NULL) {
keras_model_custom(title = title, perform(self) {
self$W1 = layer_dense(items = gru_units)
self$W2 = layer_dense(items = gru_units)
self$V = layer_dense(items = 1)
perform(inputs, masks = NULL) {
options <- inputs[[1]]
hidden <- inputs[[2]]
# options(CNN_encoder output) form == (batch_size, 64, embedding_dim)
# hidden form == (batch_size, gru_units)
# hidden_with_time_axis form == (batch_size, 1, gru_units)
hidden_with_time_axis <- k_expand_dims(hidden, axis = 2)
# rating form == (batch_size, 64, 1)
rating <- self$V(k_tanh(self$W1(options) + self$W2(hidden_with_time_axis)))
# attention_weights form == (batch_size, 64, 1)
attention_weights <- k_softmax(rating, axis = 2)
# context_vector form after sum == (batch_size, embedding_dim)
context_vector <- k_sum(attention_weights * options, axis = 2)
record(context_vector, attention_weights)
}
})
}
Decoder
The decoder at every time step calls the eye module with the options it acquired from the encoder and its final hidden state, and receives again an consideration vector. The consideration vector will get concatenated with the present enter and additional processed by a GRU and two totally related layers, the final of which supplies us the (unnormalized) possibilities for the following phrase within the caption.
The present enter at every time step right here is the earlier phrase: the right one throughout coaching (instructor forcing), the final generated one throughout inference.
rnn_decoder <- perform(embedding_dim, gru_units, vocab_size, title = NULL) {
keras_model_custom(title = title, perform(self) {
self$gru_units <- gru_units
self$embedding <- layer_embedding(input_dim = vocab_size,
output_dim = embedding_dim)
self$gru <- if (tf$check$is_gpu_available()) {
layer_cudnn_gru(
items = gru_units,
return_sequences = TRUE,
return_state = TRUE,
recurrent_initializer = 'glorot_uniform'
)
} else {
layer_gru(
items = gru_units,
return_sequences = TRUE,
return_state = TRUE,
recurrent_initializer = 'glorot_uniform'
)
}
self$fc1 <- layer_dense(items = self$gru_units)
self$fc2 <- layer_dense(items = vocab_size)
self$consideration <- attention_module(self$gru_units)
perform(inputs, masks = NULL) {
x <- inputs[[1]]
options <- inputs[[2]]
hidden <- inputs[[3]]
c(context_vector, attention_weights) %<-%
self$consideration(record(options, hidden))
# x form after passing by embedding == (batch_size, 1, embedding_dim)
x <- self$embedding(x)
# x form after concatenation == (batch_size, 1, 2 * embedding_dim)
x <- k_concatenate(record(k_expand_dims(context_vector, 2), x))
# passing the concatenated vector to the GRU
c(output, state) %<-% self$gru(x)
# form == (batch_size, 1, gru_units)
x <- self$fc1(output)
# x form == (batch_size, gru_units)
x <- k_reshape(x, c(-1, dim(x)[[3]]))
# output form == (batch_size, vocab_size)
x <- self$fc2(x)
record(x, state, attention_weights)
}
})
}
Loss perform, and instantiating all of it
Now that we’ve outlined our mannequin (constructed of three customized fashions), we nonetheless want to truly instantiate it (being exact: the 2 courses we are going to entry from outdoors, that’s, the encoder and the decoder).
We additionally have to instantiate an optimizer (Adam will do), and outline our loss perform (categorical crossentropy).
Note that tf$nn$sparse_softmax_cross_entropy_with_logits
expects uncooked logits as a substitute of softmax activations, and that we’re utilizing the sparse variant as a result of our labels usually are not one-hot-encoded.
encoder <- cnn_encoder(embedding_dim)
decoder <- rnn_decoder(embedding_dim, gru_units, vocab_size)
optimizer = tf$prepare$AdamOptimizer()
cx_loss <- perform(y_true, y_pred) {
masks <- 1 - k_cast(y_true == 0L, dtype = "float32")
loss <- tf$nn$sparse_softmax_cross_entropy_with_logits(
labels = y_true,
logits = y_pred
) * masks
tf$reduce_mean(loss)
}
Training
Training the captioning mannequin is a time-consuming course of, and you’ll for certain wish to save the mannequin’s weights!
How does this work with keen execution?
We create a tf$prepare$Checkpoint
object, passing it the objects to be saved: In our case, the encoder, the decoder, and the optimizer. Later, on the finish of every epoch, we are going to ask it to jot down the respective weights to disk.
restore_checkpoint <- FALSE
checkpoint_dir <- "./checkpoints_captions"
checkpoint_prefix <- file.path(checkpoint_dir, "ckpt")
checkpoint <- tf$prepare$Checkpoint(
optimizer = optimizer,
encoder = encoder,
decoder = decoder
)
As we’re simply beginning to prepare the mannequin, restore_checkpoint
is about to false. Later, restoring the weights shall be as simple as
if (restore_checkpoint) {
checkpoint$restore(tf$prepare$latest_checkpoint(checkpoint_dir))
}
The coaching loop is structured similar to within the machine translation case: We loop over epochs, batches, and the coaching targets, feeding within the right earlier phrase at each timestep.
Again, tf$GradientTape
takes care of recording the ahead go and calculating the gradients, and the optimizer applies the gradients to the mannequin’s weights.
As every epoch ends, we additionally save the weights.
num_epochs <- 20
if (!restore_checkpoint) {
for (epoch in seq_len(num_epochs)) {
total_loss <- 0
progress <- 0
train_iter <- make_iterator_one_shot(train_dataset)
until_out_of_range({
batch <- iterator_get_next(train_iter)
loss <- 0
img_tensor <- batch[[1]]
target_caption <- batch[[2]]
dec_hidden <- k_zeros(c(batch_size, gru_units))
dec_input <- k_expand_dims(
rep(record(word_index_df[word_index_df$word == "<start>", "index"]),
batch_size)
)
with(tf$GradientTape() %as% tape, {
options <- encoder(img_tensor)
for (t in seq_len(dim(target_caption)[2] - 1)) {
c(preds, dec_hidden, weights) %<-%
decoder(record(dec_input, options, dec_hidden))
loss <- loss + cx_loss(target_caption[, t], preds)
dec_input <- k_expand_dims(target_caption[, t])
}
})
total_loss <-
total_loss + loss / k_cast_to_floatx(dim(target_caption)[2])
variables <- c(encoder$variables, decoder$variables)
gradients <- tape$gradient(loss, variables)
optimizer$apply_gradients(purrr::transpose(record(gradients, variables)),
global_step = tf$prepare$get_or_create_global_step()
)
})
cat(paste0(
"nnTotal loss (epoch): ",
epoch,
": ",
(total_loss / k_cast_to_floatx(buffer_size)) %>% as.double() %>% spherical(4),
"n"
))
checkpoint$save(file_prefix = checkpoint_prefix)
}
}
Peeking at outcomes
Just like within the translation case, it’s fascinating to take a look at mannequin efficiency throughout coaching. The companion code has that performance built-in, so you possibly can watch mannequin progress for your self.
The fundamental perform right here is get_caption
: It will get handed the trail to a picture, masses it, obtains its options from Inception V3, after which asks the encoder-decoder mannequin to generate a caption. If at any level the mannequin produces the finish
image, we cease early. Otherwise, we proceed till we hit the predefined most size.
<-
get_caption perform(picture) {
<-
attention_matrix matrix(0, nrow = max_length, ncol = attention_features_shape)
<- k_expand_dims(load_image(picture)[[1]], 1)
temp_input <- image_model(temp_input)
img_tensor_val <- k_reshape(
img_tensor_val
img_tensor_val,record(dim(img_tensor_val)[1], -1, dim(img_tensor_val)[4])
)<- encoder(img_tensor_val)
options
<- k_zeros(c(1, gru_units))
dec_hidden <-
dec_input k_expand_dims(
record(word_index_df[word_index_df$word == "<start>", "index"])
)
<- ""
end result
for (t in seq_len(max_length - 1)) {
c(preds, dec_hidden, attention_weights) %<-%
decoder(record(dec_input, options, dec_hidden))
<- k_reshape(attention_weights, c(-1))
attention_weights <- attention_weights %>% as.double()
attention_matrix[t,]
<- tf$multinomial(exp(preds), num_samples = 1)[1, 1]
pred_idx %>% as.double()
<-
pred_word $index == pred_idx, "word"]
word_index_df[word_index_df
if (pred_word == "<finish>") {
<-
end result paste(end result, pred_word)
<-
attention_matrix 1:length(str_split(result, " ")[[1]]), ,
attention_matrix[= FALSE]
drop return (record(end result, attention_matrix))
else {
} <-
end result paste(end result, pred_word)
<- k_expand_dims(record(pred_idx))
dec_input
}
}
record(str_trim(end result), attention_matrix)
}
With that performance, now let’s really do this: peek at outcomes whereas the community is studying!
We’ve picked 3 examples every from the coaching and validation units. Here they’re.
First, our picks from the coaching set:
Let’s see the goal captions:
- a herd of giraffe standing on high of a grass coated area
- a view of playing cards driving down a avenue
- the skateboarding flips his board off of the sidewalk
Interestingly, right here we even have an illustration of how labeled datasets (like something human) could include errors. (The samples weren’t picked for that; as a substitute, they have been chosen – with out an excessive amount of screening – for being moderately unequivocal of their visible content material.)
Now for the validation candidates.
and their official captions:
- a left handed pitcher throwing the bottom ball
- a girl taking a chunk of a slice of pizza in a restaraunt
- a girl hitting swinging a tennis racket at a tennis ball on a tennis courtroom
(Again, any spelling peculiarities haven’t been launched by us.)
Epoch 1
Now, what does our community produce after the primary epoch? Remember that this implies, having seen every one of many 24000 coaching photos as soon as.
First then, listed below are the captions for the prepare photos:
a bunch of sheep standing within the grass
a bunch of vehicles driving down a avenue
a person is standing on a avenue
Not solely is the syntax right in each case, the content material isn’t that unhealthy both!
How in regards to the validation set?
a baseball participant is taking part in baseball uniform is holding a baseball bat
a person is holding a desk with a desk with a desk with a desk with a desk with a desk with a desk with a desk with a desk with a desk with a desk with a desk with a desk with a desk
a tennis participant is holding a tennis courtroom
This definitely tells that the community has been capable of generalize over – let’s not name them ideas, however mappings between visible and textual entities, say It’s true that it’s going to have seen a few of these photos earlier than, as a result of photos include a number of captions. You might be extra strict organising your coaching and validation units – however right here, we don’t actually care about goal efficiency scores and so, it does not likely matter.
Let’ skip on to epoch 20, our final coaching epoch, and test for additional enhancements.
Epoch 20
This is what we get for the coaching photos:
a bunch of many tall giraffe standing subsequent to a sheep
a view of playing cards and white gloves on a avenue
a skateboarding flips his board
And this, for the validation photos.
a baseball catcher and umpire hit a baseball recreation
a person is consuming a sandwich
a feminine tennis participant is within the courtroom
I feel we’d agree that this nonetheless leaves room for enchancment – however then, we solely skilled for 20 epochs and on a really small portion of the dataset.
In the above code snippets, you’ll have seen the decoder returning an attention_matrix
– however we weren’t commenting on it.
Now lastly, simply as within the translation instance, take a look what we are able to make of that.
Where does the community look?
We can visualize the place the community is “looking” because it generates every phrase by overlaying the unique picture and the eye matrix. This instance is taken from the 4th epoch.
Here white-ish squares point out areas receiving stronger focus. Compared to text-to-text translation although, the mapping is inherently much less easy – the place does one “look” when producing phrases like “and,” “the,” or “in?”
Conclusion
It in all probability goes with out saying that a lot better outcomes are to be anticipated when coaching on (a lot!) extra knowledge and for way more time.
Apart from that, there are different choices, although. The idea applied right here makes use of spatial consideration over a uniform grid, that’s, the eye mechanism guides the decoder the place on the grid to look subsequent when producing a caption.
However, this isn’t the one manner, and this isn’t the way it works with people. A way more believable method is a mixture of top-down and bottom-up consideration. E.g., (Anderson et al. 2017) use object detection strategies to bottom-up isolate fascinating objects, and an LSTM stack whereby the primary LSTM computes top-down consideration guided by the output phrase generated by the second.
Another fascinating method involving consideration is utilizing a multimodal attentive translator (Liu et al. 2017), the place the picture options are encoded and introduced in a sequence, such that we find yourself with sequence fashions each on the encoding and the decoding sides.
Another various is so as to add a discovered matter to the data enter (Zhu, Xue, and Yuan 2018), which once more is a top-down characteristic present in human cognition.
If you discover certainly one of these, or one more, method extra convincing, an keen execution implementation, within the fashion of the above, will seemingly be a sound manner of implementing it.