Last Updated on November 2, 2022
We have arrived at a degree the place now we have applied and examined the Transformer encoder and decoder individually, and we might now be a part of the 2 collectively into an entire mannequin. We will even see how you can create padding and look-ahead masks by which we are going to suppress the enter values that won’t be thought of within the encoder or decoder computations. Our finish objective stays to use the entire mannequin to Natural Language Processing (NLP).
In this tutorial, you’ll uncover how you can implement the entire Transformer mannequin and create padding and look-ahead masks.
After finishing this tutorial, you’ll know:
- How to create a padding masks for the encoder and decoder
- How to create a look-ahead masks for the decoder
- How to hitch the Transformer encoder and decoder right into a single mannequin
- How to print out a abstract of the encoder and decoder layers
Let’s get began.
Tutorial Overview
This tutorial is split into 4 elements; they’re:
- Recap of the Transformer Architecture
- Masking
- Creating a Padding Mask
- Creating a Look-Ahead Mask
- Joining the Transformer Encoder and Decoder
- Creating an Instance of the Transformer Model
- Printing Out a Summary of the Encoder and Decoder Layers
Prerequisites
For this tutorial, we assume that you’re already aware of:
Recap of the Transformer Architecture
Recall having seen that the Transformer structure follows an encoder-decoder construction. The encoder, on the left-hand aspect, is tasked with mapping an enter sequence to a sequence of steady representations; the decoder, on the right-hand aspect, receives the output of the encoder along with the decoder output on the earlier time step to generate an output sequence.
In producing an output sequence, the Transformer doesn’t depend on recurrence and convolutions.
You have seen how you can implement the Transformer encoder and decoder individually. In this tutorial, you’ll be a part of the 2 into an entire Transformer mannequin and apply padding and look-ahead masking to the enter values.
Let’s begin first by discovering how you can apply masking.
Kick-start your challenge with my ebook Building Transformer Models with Attention. It supplies self-study tutorials with working code to information you into constructing a fully-working transformer fashions that may
translate sentences from one language to a different…
Masking
Creating a Padding Mask
You ought to already be aware of the significance of masking the enter values earlier than feeding them into the encoder and decoder.
As you will note while you proceed to prepare the Transformer mannequin, the enter sequences fed into the encoder and decoder will first be zero-padded as much as a selected sequence size. The significance of getting a padding masks is to be sure that these zero values will not be processed together with the precise enter values by each the encoder and decoder.
Let’s create the next perform to generate a padding masks for each the encoder and decoder:
def padding_mask(enter):
# Create masks which marks the zero padding values within the enter by a 1
masks = math.equal(enter, 0)
masks = solid(masks, float32)
return masks
from tensorflow import math, solid, float32
def padding_mask(enter): # Create masks which marks the zero padding values within the enter by a 1 masks = math.equal(enter, 0) masks = solid(masks, float32)
return masks |
Upon receiving an enter, this perform will generate a tensor that marks by a worth of one wherever the enter comprises a worth of zero.
Hence, for those who enter the next array:
enter = array([1, 2, 3, 4, 0, 0, 0])
print(padding_mask(enter))
from numpy import array
enter = array([1, 2, 3, 4, 0, 0, 0]) print(padding_mask(enter)) |
Then the output of the padding_mask
perform could be the next:
tf.Tensor([0. 0. 0. 0. 1. 1. 1.], form=(7,), dtype=float32) |
Creating a Look-Ahead Mask
A glance-ahead masks is required to forestall the decoder from attending to succeeding phrases, such that the prediction for a specific phrase can solely rely on identified outputs for the phrases that come earlier than it.
For this goal, let’s create the next perform to generate a look-ahead masks for the decoder:
def lookahead_mask(form):
# Mask out future entries by marking them with a 1.0
masks = 1 – linalg.band_part(ones((form, form)), -1, 0)
return masks
from tensorflow import linalg, ones
def lookahead_mask(form): # Mask out future entries by marking them with a 1.0 masks = 1 – linalg.band_part(ones((form, form)), –1, 0)
return masks |
You will move to it the size of the decoder enter. Let’s make this size equal to five, for example:
Then the output that the lookahead_mask
perform returns is the next:
tf.Tensor( [[0. 1. 1. 1. 1.] [0. 0. 1. 1. 1.] [0. 0. 0. 1. 1.] [0. 0. 0. 0. 1.] [0. 0. 0. 0. 0.]], form=(5, 5), dtype=float32) |
Again, the one values masks out the entries that shouldn’t be used. In this way, the prediction of each phrase solely will depend on those who come earlier than it.
Joining the Transformer Encoder and Decoder
Let’s begin by creating the category, TransformerModel
, which inherits from the Model
base class in Keras:
# Set up the encoder
self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, charge)
# Set up the decoder
self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, charge)
# Define the ultimate dense layer
self.model_last_layer = Dense(dec_vocab_size)
…
class TransformerModel(Model): def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, charge, **kwargs): tremendous(TransformerModel, self).__init__(**kwargs)
# Set up the encoder self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, charge)
# Set up the decoder self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, charge)
# Define the ultimate dense layer self.model_last_layer = Dense(dec_vocab_size) ... |
Our first step in creating the TransformerModel
class is to initialize situations of the Encoder
and Decoder
courses applied earlier and assign their outputs to the variables, encoder
and decoder
, respectively. If you saved these courses in separate Python scripts, don’t forget to import them. I saved my code within the Python scripts encoder.py and decoder.py, so I must import them accordingly.
You will even embrace one ultimate dense layer that produces the ultimate output, as within the Transformer structure of Vaswani et al. (2017).
Next, you shall create the category methodology, name()
, to feed the related inputs into the encoder and decoder.
A padding masks is first generated to masks the encoder enter, in addition to the encoder output, when that is fed into the second self-attention block of the decoder:
# Create padding masks to masks the encoder inputs and the encoder outputs within the decoder
enc_padding_mask = self.padding_mask(encoder_input)
…
... def name(self, encoder_input, decoder_input, coaching):
# Create padding masks to masks the encoder inputs and the encoder outputs within the decoder enc_padding_mask = self.padding_mask(encoder_input) ... |
A padding masks and a look-ahead masks are then generated to masks the decoder enter. These are mixed collectively by way of an element-wise most
operation:
... # Create and mix padding and look-ahead masks to be fed into the decoder dec_in_padding_mask = self.padding_mask(decoder_input) dec_in_lookahead_mask = self.lookahead_mask(decoder_input.form[1]) dec_in_lookahead_mask = most(dec_in_padding_mask, dec_in_lookahead_mask) ... |
Next, the related inputs are fed into the encoder and decoder, and the Transformer mannequin output is generated by feeding the decoder output into one ultimate dense layer:
# Feed the encoder output into the decoder
decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, coaching)
# Pass the decoder output by way of a ultimate dense layer
model_output = self.model_last_layer(decoder_output)
return model_output
... # Feed the enter into the encoder encoder_output = self.encoder(encoder_input, enc_padding_mask, coaching)
# Feed the encoder output into the decoder decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, coaching)
# Pass the decoder output by way of a ultimate dense layer model_output = self.model_last_layer(decoder_output)
return model_output |
Combining all of the steps provides us the next full code itemizing:
class TransformerModel(Model):
def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, charge, **kwargs):
tremendous(TransformerModel, self).__init__(**kwargs)
# Set up the encoder
self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, charge)
# Set up the decoder
self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, charge)
# Define the ultimate dense layer
self.model_last_layer = Dense(dec_vocab_size)
def padding_mask(self, enter):
# Create masks which marks the zero padding values within the enter by a 1.0
masks = math.equal(enter, 0)
masks = solid(masks, float32)
# The form of the masks must be broadcastable to the form
# of the eye weights that it is going to be masking in a while
return masks[:, newaxis, newaxis, :]
def lookahead_mask(self, form):
# Mask out future entries by marking them with a 1.0
masks = 1 – linalg.band_part(ones((form, form)), -1, 0)
return masks
def name(self, encoder_input, decoder_input, coaching):
# Create padding masks to masks the encoder inputs and the encoder outputs within the decoder
enc_padding_mask = self.padding_mask(encoder_input)
# Create and mix padding and look-ahead masks to be fed into the decoder
dec_in_padding_mask = self.padding_mask(decoder_input)
dec_in_lookahead_mask = self.lookahead_mask(decoder_input.form[1])
dec_in_lookahead_mask = most(dec_in_padding_mask, dec_in_lookahead_mask)
# Feed the enter into the encoder
encoder_output = self.encoder(encoder_input, enc_padding_mask, coaching)
# Feed the encoder output into the decoder
decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, coaching)
# Pass the decoder output by way of a ultimate dense layer
model_output = self.model_last_layer(decoder_output)
return model_output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
from encoder import Encoder from decoder import Decoder from tensorflow import math, solid, float32, linalg, ones, most, newaxis from tensorflow.keras import Model from tensorflow.keras.layers import Dense
class TransformerModel(Model): def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, charge, **kwargs): tremendous(TransformerModel, self).__init__(**kwargs)
# Set up the encoder self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, charge)
# Set up the decoder self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, charge)
# Define the ultimate dense layer self.model_last_layer = Dense(dec_vocab_size)
def padding_mask(self, enter): # Create masks which marks the zero padding values within the enter by a 1.0 masks = math.equal(enter, 0) masks = solid(masks, float32)
# The form of the masks must be broadcastable to the form # of the eye weights that it is going to be masking in a while return masks[:, newaxis, newaxis, :]
def lookahead_mask(self, form): # Mask out future entries by marking them with a 1.0 masks = 1 – linalg.band_part(ones((form, form)), –1, 0)
return masks
def name(self, encoder_input, decoder_input, coaching):
# Create padding masks to masks the encoder inputs and the encoder outputs within the decoder enc_padding_mask = self.padding_mask(encoder_input)
# Create and mix padding and look-ahead masks to be fed into the decoder dec_in_padding_mask = self.padding_mask(decoder_input) dec_in_lookahead_mask = self.lookahead_mask(decoder_input.form[1]) dec_in_lookahead_mask = most(dec_in_padding_mask, dec_in_lookahead_mask)
# Feed the enter into the encoder encoder_output = self.encoder(encoder_input, enc_padding_mask, coaching)
# Feed the encoder output into the decoder decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, coaching)
# Pass the decoder output by way of a ultimate dense layer model_output = self.model_last_layer(decoder_output)
return model_output |
Note that you’ve got carried out a small change to the output that’s returned by the padding_mask
perform. Its form is made broadcastable to the form of the eye weight tensor that it’s going to masks while you prepare the Transformer mannequin.
Creating an Instance of the Transformer Model
You will work with the parameter values specified within the paper, Attention Is All You Need, by Vaswani et al. (2017):
dropout_rate = 0.1 # Frequency of dropping the enter models within the dropout layers
…
h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the internal totally related layer d_model = 512 # Dimensionality of the mannequin sub-layers’ outputs n = 6 # Number of layers within the encoder stack
dropout_rate = 0.1 # Frequency of dropping the enter models within the dropout layers ... |
As for the input-related parameters, you’ll work with dummy values for now till you arrive on the stage of coaching the entire Transformer mannequin. At that time, you’ll use precise sentences:
enc_seq_length = 5 # Maximum size of the enter sequence
dec_seq_length = 5 # Maximum size of the goal sequence
…
... enc_vocab_size = 20 # Vocabulary measurement for the encoder dec_vocab_size = 20 # Vocabulary measurement for the decoder
enc_seq_length = 5 # Maximum size of the enter sequence dec_seq_length = 5 # Maximum size of the goal sequence ... |
You can now create an occasion of the TransformerModel
class as follows:
# Create mannequin
training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)
from mannequin import TransformerModel
# Create mannequin training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) |
The full code itemizing is as follows:
enc_seq_length = 5 # Maximum size of the enter sequence
dec_seq_length = 5 # Maximum size of the goal sequence
h = 8 # Number of self-attention heads
d_k = 64 # Dimensionality of the linearly projected queries and keys
d_v = 64 # Dimensionality of the linearly projected values
d_ff = 2048 # Dimensionality of the internal totally related layer
d_model = 512 # Dimensionality of the mannequin sub-layers’ outputs
n = 6 # Number of layers within the encoder stack
dropout_rate = 0.1 # Frequency of dropping the enter models within the dropout layers
# Create mannequin
training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
enc_vocab_size = 20 # Vocabulary measurement for the encoder dec_vocab_size = 20 # Vocabulary measurement for the decoder
enc_seq_length = 5 # Maximum size of the enter sequence dec_seq_length = 5 # Maximum size of the goal sequence
h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the internal totally related layer d_model = 512 # Dimensionality of the mannequin sub-layers’ outputs n = 6 # Number of layers within the encoder stack
dropout_rate = 0.1 # Frequency of dropping the enter models within the dropout layers
# Create mannequin training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) |
Printing Out a Summary of the Encoder and Decoder Layers
You might also print out a abstract of the encoder and decoder blocks of the Transformer mannequin. The option to print them out individually will enable you to have the ability to see the small print of their particular person sub-layers. In order to take action, add the next line of code to the __init__()
methodology of each the EncoderLayer
and DecoderLayer
courses:
self.construct(input_shape=[None, sequence_length, d_model]) |
Then you have to add the next methodology to the EncoderLayer
class:
def build_graph(self): input_layer = Input(form=(self.sequence_length, self.d_model)) return Model(inputs=[input_layer], outputs=self.name(input_layer, None, True)) |
And the next methodology to the DecoderLayer
class:
def build_graph(self): input_layer = Input(form=(self.sequence_length, self.d_model)) return Model(inputs=[input_layer], outputs=self.name(input_layer, input_layer, None, None, True)) |
This leads to the EncoderLayer
class being modified as follows (the three dots below the name()
methodology imply that this stays the identical because the one which was applied right here):
class EncoderLayer(Layer):
def __init__(self, sequence_length, h, d_k, d_v, d_model, d_ff, charge, **kwargs):
tremendous(EncoderLayer, self).__init__(**kwargs)
self.construct(input_shape=[None, sequence_length, d_model])
self.d_model = d_model
self.sequence_length = sequence_length
self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)
self.dropout1 = Dropout(charge)
self.add_norm1 = AddNormalization()
self.feed_forward = FeedForward(d_ff, d_model)
self.dropout2 = Dropout(charge)
self.add_norm2 = AddNormalization()
def build_graph(self):
input_layer = Input(form=(self.sequence_length, self.d_model))
return Model(inputs=[input_layer], outputs=self.name(input_layer, None, True))
def name(self, x, padding_mask, coaching):
…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from tensorflow.keras.layers import Input from tensorflow.keras import Model
class EncoderLayer(Layer): def __init__(self, sequence_length, h, d_k, d_v, d_model, d_ff, charge, **kwargs): tremendous(EncoderLayer, self).__init__(**kwargs) self.construct(input_shape=[None, sequence_length, d_model]) self.d_model = d_model self.sequence_length = sequence_length self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(charge) self.add_norm1 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout2 = Dropout(charge) self.add_norm2 = AddNormalization()
def build_graph(self): input_layer = Input(form=(self.sequence_length, self.d_model)) return Model(inputs=[input_layer], outputs=self.name(input_layer, None, True))
def name(self, x, padding_mask, coaching): ... |
Similar adjustments might be made to the DecoderLayer
class too.
Once you might have the mandatory adjustments in place, you may proceed to create situations of the EncoderLayer
and DecoderLayer
courses and print out their summaries as follows:
encoder = EncoderLayer(enc_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)
encoder.build_graph().abstract()
decoder = DecoderLayer(dec_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)
decoder.build_graph().abstract()
from encoder import EncoderLayer from decoder import DecoderLayer
encoder = EncoderLayer(enc_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate) encoder.build_graph().abstract()
decoder = DecoderLayer(dec_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate) decoder.build_graph().abstract() |
The ensuing abstract for the encoder is the next:
multi_head_attention_18 (Multi (None, 5, 512) 131776 [‘input_1[0][0]’,
HeadAttention) ‘input_1[0][0]’,
‘input_1[0][0]’]
dropout_32 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_18[0][0]’]
add_normalization_30 (AddNorma (None, 5, 512) 1024 [‘input_1[0][0]’,
lization) ‘dropout_32[0][0]’]
feed_forward_12 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_30[0][0]’]
dropout_33 (Dropout) (None, 5, 512) 0 [‘feed_forward_12[0][0]’]
add_normalization_31 (AddNorma (None, 5, 512) 1024 [‘add_normalization_30[0][0]’,
lization) ‘dropout_33[0][0]’]
==================================================================================================
Total params: 2,233,536
Trainable params: 2,233,536
Non-trainable params: 0
__________________________________________________________________________________________________
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
Model: “mannequin” __________________________________________________________________________________________________ Layer (kind) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) [(None, 5, 512)] 0 []
multi_head_attention_18 (Multi (None, 5, 512) 131776 [‘input_1[0][0]’, HeadAttention) ‘input_1[0][0]’, ‘input_1[0][0]’]
dropout_32 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_18[0][0]’]
add_normalization_30 (AddNorma (None, 5, 512) 1024 [‘input_1[0][0]’, lization) ‘dropout_32[0][0]’]
feed_forward_12 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_30[0][0]’]
dropout_33 (Dropout) (None, 5, 512) 0 [‘feed_forward_12[0][0]’]
add_normalization_31 (AddNorma (None, 5, 512) 1024 [‘add_normalization_30[0][0]’, lization) ‘dropout_33[0][0]’]
================================================================================================== Total params: 2,233,536 Trainable params: 2,233,536 Non-trainable params: 0 __________________________________________________________________________________________________ |
While the ensuing abstract for the decoder is the next:
multi_head_attention_19 (Multi (None, 5, 512) 131776 [‘input_2[0][0]’,
HeadAttention) ‘input_2[0][0]’,
‘input_2[0][0]’]
dropout_34 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_19[0][0]’]
add_normalization_32 (AddNorma (None, 5, 512) 1024 [‘input_2[0][0]’,
lization) ‘dropout_34[0][0]’,
‘add_normalization_32[0][0]’,
‘dropout_35[0][0]’]
multi_head_attention_20 (Multi (None, 5, 512) 131776 [‘add_normalization_32[0][0]’,
HeadAttention) ‘input_2[0][0]’,
‘input_2[0][0]’]
dropout_35 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_20[0][0]’]
feed_forward_13 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_32[1][0]’]
dropout_36 (Dropout) (None, 5, 512) 0 [‘feed_forward_13[0][0]’]
add_normalization_34 (AddNorma (None, 5, 512) 1024 [‘add_normalization_32[1][0]’,
lization) ‘dropout_36[0][0]’]
==================================================================================================
Total params: 2,365,312
Trainable params: 2,365,312
Non-trainable params: 0
__________________________________________________________________________________________________
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
Model: “model_1” __________________________________________________________________________________________________ Layer (kind) Output Shape Param # Connected to ================================================================================================== input_2 (InputLayer) [(None, 5, 512)] 0 []
multi_head_attention_19 (Multi (None, 5, 512) 131776 [‘input_2[0][0]’, HeadAttention) ‘input_2[0][0]’, ‘input_2[0][0]’]
dropout_34 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_19[0][0]’]
add_normalization_32 (AddNorma (None, 5, 512) 1024 [‘input_2[0][0]’, lization) ‘dropout_34[0][0]’, ‘add_normalization_32[0][0]’, ‘dropout_35[0][0]’]
multi_head_attention_20 (Multi (None, 5, 512) 131776 [‘add_normalization_32[0][0]’, HeadAttention) ‘input_2[0][0]’, ‘input_2[0][0]’]
dropout_35 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_20[0][0]’]
feed_forward_13 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_32[1][0]’]
dropout_36 (Dropout) (None, 5, 512) 0 [‘feed_forward_13[0][0]’]
add_normalization_34 (AddNorma (None, 5, 512) 1024 [‘add_normalization_32[1][0]’, lization) ‘dropout_36[0][0]’]
================================================================================================== Total params: 2,365,312 Trainable params: 2,365,312 Non-trainable params: 0 __________________________________________________________________________________________________ |
Further Reading
This part supplies extra sources on the subject in case you are trying to go deeper.
Books
Papers
Summary
In this tutorial, you found how you can implement the entire Transformer mannequin and create padding and look-ahead masks.
Specifically, you discovered:
- How to create a padding masks for the encoder and decoder
- How to create a look-ahead masks for the decoder
- How to hitch the Transformer encoder and decoder right into a single mannequin
- How to print out a abstract of the encoder and decoder layers
Do you might have any questions?
Ask your questions within the feedback under and I’ll do my finest to reply.