LLaMA in R with Keras and TensorCirculate

0
680

[ad_1]

OpenAI’s chatGPT has woke up a collective consciousness of what Large
Language Models (LLMs) are able to. With that awakening comes a each day
march of LLM information: new merchandise, new options, new fashions, new
capabilities, (and new worries). It appears we’re within the early phases of a
Cambrian explosion of LLMs and LLM powered instruments; it’s not but clear how
LLMs will affect and affect our skilled and private lives, however
it appears clear that they’ll, indirectly.

Since LLMs are right here to remain, it’s worthwhile to take a while to
perceive how these fashions work from a first-principles perspective.
Starting with the mechanics might help foster sturdy intuitions that can
inform our utilization of those fashions now and sooner or later. (Especially if
the long run is one the place LLMs are a staple of the information scientist’s
toolbox, as widespread as an lm() operate name).

And what higher manner is there to study than by doing. So with that
preamble, on this submit we’ll stroll by means of an implementation of an LLM,
LLaMA (Touvron et al. 2023)
particularly, in TensorCirculate and Keras, with the aim being to develop
understanding first, functionality second.

Why LLaMA? With the sheer quantity of LLM associated content material and information out
there, it will probably appear formidable to know the place to get began. Almost weekly
it appears there’s a new mannequin introduced. Browsing some hubs of LLM
exercise (HuggingFace,
TFHub,
reddit,
HackerNews) muddies the waters even
extra. How to choose a selected mannequin?

Of the various LLM-related information gadgets previously months, one which stands
head-and-shoulders above the gang is the launch of
LLaMA
,
a contemporary, foundational LLM made obtainable to the general public by Meta AI in
Februay 2023. On widespread benchmarks, LLaMA outperforms OpenAI’s GPT-3,
whereas being considerably smaller (although nonetheless giant).

LLaMA is a superb beginning place as a result of it’s a easy and fashionable
structure, has glorious efficiency on benchmarks, and is open. The
mannequin structure has had only a few new concepts integrated into it since
the unique Transformer structure first described in,
Attention Is All You Need
printed from Google (Vaswani et al. 2017). Four completely different sizes of
LLaMA have been launched: 7 billion and 13 billion parameter fashions
skilled on 1 Trillion tokens, and 33 billion and 65 billion parameter
fashions skilled on 1.4 trillion tokens. This is a gigantic quantity of
coaching information these fashions have seen–the biggest 65B mannequin has been
skilled on roughly the “Chinchilla
compute-optimum”
(Hoffmann et al. 2022)
variety of tokens, whereas the smaller LLaMAs are considerably
past that optimum. In this weblog submit we’ll give attention to the smallest, 7B
parameter LLaMA mannequin, which you’ll comfortably load regionally and run on
CPU with solely 64Gb of RAM.

While not strictly mandatory, to comply with alongside regionally, you’ll in all probability
wish to purchase the pre-trained LLaMA weights one
manner
or
one other. Note, the
weights do include their very own license, which you’ll preview
right here.

So, with out additional ado, let’s get began.

Setup

First, we’ll wish to set up the required R and Python packages, and
configure a digital atmosphere:

remotes::install_github(c("rstudio/reticulate",
                          "rstudio/tensorflow",
                          "rstudio/keras"))
reticulate::virtualenv_create("./.venv", model = "3.10")
tensorflow::install_tensorflow(envname = "./.venv", model = "launch")

With that out of the way in which, let’s load some packages and put together our R
session:

library(purrr)
library(envir)

library(tensorflow)
library(tfautograph)
library(keras)

use_virtualenv("./.venv")
choices(tensorflow.extract.warn_tensors_passed_asis = FALSE)

attach_eval({
  import_from(glue, glue)
  import_from(jsonlite, read_json)
  import_from(withr, with_dir, with_options)
  import_from(keras$layers, Dense)
  np <- reticulate::import("numpy", convert = FALSE)

  seq_len0 <- operate(x) seq.int(from = 0L, size.out = x)
})

If you’ve acquired the pre-trained weights, it’ll be handy to
convert them from the torch checkpoint format to one thing that’s extra
framework agnostic (you solely want to do that as soon as, in fact):

# reticulate::py_install("torch", pip = TRUE)
torch <- reticulate::import("torch", convert = FALSE)
with_dir("~/github/facebookresearch/llama/weights/LLaMA/7B", {
  pretrained_weights <- torch$load("consolidated.00.pth",
                                   map_location = "cpu")
  for (identify in names(pretrained_weights)) {
    filename <- sprintf("%s.npy", identify)
    array <- pretrained_weights[[nm]]$numpy()
    np$save(filename, array)
    message(glue(
      "wrote: '{basename(filename)}' with form: {array$form}"))
  }
})

We’ll additionally outline a helper operate so we are able to keep away from having to retype the
full path to our weights:

weights_path <- operate(filename) normalizePath(file.path(
  "~/github/facebookresearch/llama/weights/LLaMA/",
  glue(filename, .envir = guardian.body())), mustWork = TRUE)

And load the mannequin configuration parameters particular to the 7B LLaMA,
which we’ll use to construct the mannequin.

params <- read_json(weights_path("7B/params.json"))
str(params)
List of 6
 $ dim        : int 4096
 $ multiple_of: int 256
 $ n_heads    : int 32
 $ n_layers   : int 32
 $ norm_eps   : num 1e-06
 $ vocab_size : int -1

Tokenizer

The first element to LLaMA is the tokenizer, which converts textual content to a
sequence of integers. The LLaMA mannequin makes use of the
SentencePiece tokenizer from
Google. SentencePiece is out there as a TensorCirculate graph operation
by means of
tf_text.SentencepieceTokenizer,
and likewise as a Keras layer in
keras_nlp.tokenizers.SentencepieceTokenizer.
By alternative of a coin flip, we’ll use the lower-level tf_text interface.

tf_text <- reticulate::import("tensorflow_text")
tokenizer_path <- weights_path("tokenizer.mannequin")
tokenizer <- tf_text$SentencepieceTokenizer(
  tf$io$gfile$GFile(tokenizer_path, "rb")$learn(),
  add_bos = TRUE, add_eos = FALSE,
)

Let’s check it out with a immediate:

immediate <- "The greatest solution to appeal to bees"
tokenizer$tokenize(immediate)
tf.Tensor([    1   450  1900   982   304 13978   367   267], form=(8), dtype=int32)
immediate |> tokenizer$tokenize() |> tokenizer$detokenize()
tf.Tensor(b'The greatest solution to appeal to bees', form=(), dtype=string)

Let’s outline a show_tokens() helper operate and play with the
tokenizer slightly.

show_tokens <- operate(what) >
    map_chr(operate(id) >
        as_tensor(form = c(1)) )

  names(tokens) <- token_ids
  tokens


show_tokens(immediate)
        1       450      1900       982       304     13978       367       267
       ""     "The"    "greatest"     "manner"      "to" "appeal to"      "be"      "es"

Note that “bees” is 2 tokens. Not each token corresponds to a phrase.
For instance, one non-word token we are able to reliably count on to indicate up in a
tokenizer skilled on a corpus of English textual content is “ing.” However, when the
“ing” token reveals up is not going to at all times comply with your intuitions, as a result of
widespread phrases get their very own token id, even when they are often decomposed into
a number of tokens.

    1  2348
   "" "ing"
        1      1985
       "" "working"
     1   8525    292
    "" "flex"  "ing"
     1   2113   9292
    ""  "gained" "king"

Another factor to notice in regards to the tokenizer is that every token sequence
begins with token id 1. This is a particular beginning-of-sequence
token that we requested be added after we loaded the tokenizer with
add_bos = TRUE. There are two different such particular tokens that we’ll
encounter later: an end-of-sequence particular tokens with id 2, and an
unknown-token with id 0.

as.character(tokenizer$id_to_string(0L))
[1] "<unk>"
as.character(tokenizer$id_to_string(1L))
[1] "<s>"
as.character(tokenizer$id_to_string(2L))
[1] "</s>"
    1     0     2
   "" " ⁇ "    ""

Overall, there are 32,000 tokens.

as.integer(tokenizer$vocab_size())
[1] 32000

One final statement is that the extra ceaselessly encountered tokens are
assigned decrease ids.

show_tokens(seq(50, len = 10))
 50  51  52  53  54  55  56  57  58  59
"/" "0" "1" "2" "3" "4" "5" "6" "7" "8"
show_tokens(seq(100, len = 10))
100 101 102 103 104 105 106 107 108 109
"a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
show_tokens(seq(1000, len = 10))
   1000    1001    1002    1003    1004    1005    1006    1007    1008    1009
  "ied"    "ER"  "stat"   "fig"    "me"   "von" "inter"  "roid"  "ater" "their"
show_tokens(seq(10000, len = 10))
   10000    10001    10002    10003    10004    10005    10006    10007
   "ång"  "citep"    "Ill"   "rank" "sender"   "beim"    "рак" "compat"
   10008    10009
"happens"  "diese"
show_tokens(seq(20000, len = 10))
    20000     20001     20002     20003     20004     20005     20006     20007
  "admit" "Comment"     "стя"    "Vien"      "ці"  "permut"     "cgi"    "crít"
    20008     20009
"Console"    "ctic"
show_tokens(seq(to = as.integer(tokenizer$vocab_size()) - 1, len = 10))
31990 31991 31992 31993 31994 31995 31996 31997 31998 31999
  "ὀ"  "げ"  "べ"  "边"  "还"  "黃"  "왕"  "收"  "弘"  "给"

Moving on, the following step after tokenization is embedding. An embedding
layer is successfully a dictionary lookup that converts an integer (token
id) to a 1-d float array. For this we are able to use the usual keras
Embedding layer.

tok_embeddings <- keras$layers$Embedding(
  input_dim = tokenizer$vocab_size(),
  output_dim = params$dim,
  embeddings_initializer =
    (...) np$load(weights_path("7B/tok_embeddings.weight.npy"))
)

tok_embeddings(3L) |> str()
<tf.Tensor: form=(4096), dtype=float32, numpy=…>
immediate |> # "The greatest solution to appeal to bees"
  tokenizer$tokenize() |>
  tok_embeddings() |>
  str()
<tf.Tensor: form=(8, 4096), dtype=float32, numpy=…>

TransformerBlock

Once it’s tokenized and embedded, the enter then passes by means of the majority
of the mannequin, a sequence of repeating TransformerBlock layers. The 7B
mannequin has 32 of those TransformerBlock layers, whereas the 65B mannequin has
80 of them.

weights_path("7B/params.json")  |> read_json() |> _$n_layers
[1] 32
weights_path("65B/params.json") |> read_json() |> _$n_layers
[1] 80

Here is what the transformer block appears to be like like:

TransformerBlock(keras$layers$Layer) %py_class% {
  initialize <- operate(attn_head_size, attn_n_heads,
                         norm_eps = k_epsilon(), ...,
                         block_id = NULL) {
    tremendous$initialize(...)

    self$consideration <- Attention(attn_head_size, attn_n_heads,
                                block_id = block_id)

    self$feed_forward <- FeedForward(
      hidden_dim = 4 * attn_head_size * attn_n_heads,
      block_id = block_id)

    self$attention_norm <- RMSNorm(eps = norm_eps,
                                   block_id = block_id,
                                   feeds_into = "consideration")
    self$feed_forward_norm <- RMSNorm(eps = norm_eps,
                                      block_id = block_id,
                                      feeds_into = "ffn")
  }

  name <- operate(x) >
      self$attention_norm() 
}

While there may be not quite a lot of code, there are quite a lot of concepts packed in
there. This block types the primary trunk of the mannequin, so it’s value
taking the time to undergo it slowly.

We implement the TransformerBlock as a subclassed
keras.layers.Layer. This is offers us some niceties like the power to
compose with different Keras layers, however these are principally irrelevant to the
objective of this weblog submit; we may simply as simply implement this as,
for instance, a vanilla R6 class. Our TransformerBlock class has two
strategies: initialize, known as after we first create the block, and
name, known as after we run the ahead go of the block.

In initialize, we create 4 layers: an Attention layer, a
FeedForward layer, and a couple of RMSNorm layers. We’ll take a detailed take a look at
every of those quickly, however even earlier than we achieve this, we are able to see how they match
collectively by wanting on the TransformerBlock$name() technique.

The name technique has a couple of easy concepts. In no explicit order, the
first one to look at is the composition sample of including residuals.

x2 <- x |> ...
x <- x + x2 # add residual x to x2

This is a standard sample that helps with mannequin coaching, and particularly
to assist with the vanishing gradient
drawback
. It’s
a skip-connection within the other-wise linear sequence of matrix
transformations. It reinjects data (in the course of the ahead go), and
gradients (throughout again propagation), again into the trunk. You can assume
of those residual connections as liberating the learnable layers in-between
(the ... within the pseudo code) from the burden of getting to
“pass-through” or “preserve” data in x, permitting the weights to
as a substitute give attention to studying transformations which can be, (in corporatese
vernacular), value-adding.

The subsequent composition sample to notice is the repeating utilization of a
normalization layer:

x2 <- x |> norm() |> ...
x <- x + x2

There are many sorts of normalization layers, however to barely
over-generalize, they’ll all be considered a stabilizer that helps
with coaching. Like their deep-learning cousins the regularizers, their
essential operate is to maintain values passing by means of in a wise vary–in
the ball park of (-1, 1), sometimes. We’ll take a better take a look at
RMSNorm quickly.

Stripped of two methods which can be principally there to assist the mannequin practice,
residuals and normalization, the core of the TransformerBlock is simply
this:

x |> consideration() |> feed_forward()

In a second we’ll see that that feed_foward is a barely fancier
variation of a traditional sequence of Dense layer. Before we get
there we are able to we safely skip forward to distill the next instinct: a
TransformerBlock is mainly an Attention layer adopted by a couple of
(fancy) dense layers, with some easy composition patterns (methods)
that assist with coaching. Attention is the center of the mannequin: it’s the
most attention-grabbing, and likewise probably the most concerned.

With the framing in place, let’s undergo and take a better take a look at
RMSNorm, FeedForward, after which with the muse in place, we’ll
flip our consideration to Attention.

RMSNorm

RMSNorm(keras$layers$Layer) %py_class% {
  initialize <-
    operate(eps = 1e-6, ..., block_id = NULL, feeds_into = NULL) {
      tremendous$initialize(...)
      self$eps <- eps
      self$block_id <- block_id
      self$feeds_into <- feeds_into
    }

  construct <- operate(input_shape) {
    # input_shape == (batch_size, seqlen, params$dim)
    # self$w will broadcast over batch_size and seqlen dims.
    # w_shape == (1, 1, params$dim)
    w_shape <- rep(1L, size(input_shape))
    w_shape[length(input_shape)] <- as.integer(input_shape) |> tail(1L)

    # outline a neighborhood operate that can load
    # the pretrained-weights if we equipped `block_id` and `feeds_into`
    import_from({self}, block_id, feeds_into)
    initializer <-if (is.null(block_id))
      "ones"
      else if (block_id >=0) {
        (...) weights_path("7B/layers.{block_id}.{feeds_into}_norm.weight.npy") |>
               np$load() |> np$expand_dims(0:1)
      } else if(block_id == -1)
        # load weights for the ultimate output normalization layer, which isn't
        # a part of a TransformerBlock
        (...) weights_path("7B/norm.weight.npy") |>
               np$load() |> np$expand_dims(0:1)

    self$w <- self$add_weight(form = w_shape,
                              initializer = initializer,
                              trainable = TRUE)
  }

  rrms <- operate(x) {
    # reciprocal root imply sq. alongside the final axis
    x %>% # (batch_size, seqlen, n_features)
      tf$math$sq.() %>%
      tf$reduce_mean(axis = -1L, keepdims = TRUE) %>% # (batch_size, seqlen, 1)
      tf$math$add(self$eps) %>% # for numerical stability
      tf$math$rsqrt()
  }

  name <- operate(x) {
    x * self$rrms(x) * self$w
  }
}

RMSnorm() has a single trainable tensor w. In the ahead go, every
worth within the enter is multiplied by the reciprocal-root-mean-square of
all of the values within the characteristic axis and by w. Certainly a mouthful, however
only a easy sequence of arithmetic transformations ultimately,
designed for the categorical objective of adjusting the vary of values
passing by means of.

Let’s kick the tires on it:

norm <- RMSNorm()
m <- matrix(c(0, 1,
              2, 3), nrow = 2)
norm(m)
tf.Tensor(
[[0.         1.4142132 ]
 [0.44721353 1.3416406 ]], form=(2, 2), dtype=float32)
tf.Tensor(
[[0.         1.4142137 ]
 [0.44721362 1.3416408 ]], form=(2, 2), dtype=float32)
tf.Tensor(
[[0.        1.4142137]
 [0.4472136 1.3416408]], form=(2, 2), dtype=float32)

FeedForward

Next up is FeedForward()

FeedForward(keras$layers$Layer) %py_class% {

  initialize <- operate(hidden_dim, multiple_of = 256L,
                         ..., block_id = NULL) {
    tremendous$initialize()

    if(!is.null(multiple_of)) {
      hidden_dim <- hidden_dim %>%
        { as.integer( . * (2/3)) } %>%
        { (. + multiple_of - 1) %/% multiple_of } %>%
        { . * multiple_of }
    }

    self$hidden_dim <- hidden_dim
    self$block_id <- block_id
  }

  construct <- operate(input_shape) {
    output_dim <- input_shape |> as.integer() |> tail(1)

    if(is.null(self$block_id))
      load_weight <- (...) NULL
    else
      load_weight <- (identify) (...) np$load(weights_path(
        "7B/layers.{self$block_id}.feed_forward.{identify}.weight.npy"))$`T`

    self$w1 <- Dense(self$hidden_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w1"))
    self$w2 <- Dense(output_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w2"))
    self$w3 <- Dense(self$hidden_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w3"))

    tremendous$construct(input_shape)
  }

  name <- operate(x) {
    import_from({self}, w1, w2, w3)
    import_from(tf$nn, silu)

    x %>%
      { silu(w1(.)) * w3(.) } %>% # SwiGLU
      w2()
  }

}

FeedForward consists of three Dense layers. initialize does some
easy arithmetic, munging on the enter worth hidden_dim to make sure the
measurement is a performant a number of of 256, and construct is generally boiler plate
for creating the layers and loading the weights.

The novelty of FeedForward() is within the name() technique, the place relatively
than composing the Dense layers in a traditional sequential mannequin
with, say, ReLU activations in between and possibly some dropout, the
layers are composed to kind a “SwiGLU” unit. The publication by Shazeer (2020)
of SwiGLU and different variations on GLU is an exemplar of the categories
of explorations and enhancements across the Transformer structure
since its preliminary publication in
2017; a gradual accretion of
enhancements that has introduced us to right this moment. The Feedforward$name() is
only a single SwiGLU adopted by a linear projection. In its essence,
it’s a intelligent composition of three (discovered) linear projections, an
element-wise multiplication, and a silu()
activation

operate.

Perhaps probably the most stunning statement to make right here is the relative
dearth of activation features, and even non-linearities, not simply in
FeedForward, however total. The silu() on this feedforward, the
reciprocal-root-mean-square in RMSnorm(), and a softmax() in
Attention() are the one non-linear transformations in the entire
sequence of TransformerBlocks. Everything else is a linear
transformation!

Attention

Finally, let’s flip our consideration to Attention().

Attention(keras$layers$Layer) %py_class% {
  initialize <- operate(head_size, n_heads,
                         ..., block_id = NULL) {
    tremendous$initialize(...)

    self$head_size <- head_size
    self$n_heads <- n_heads

    if (is.null(block_id))
      load_weight <- operate(identify) NULL
    else
      load_weight <- (identify) (...) np$load(weights_path(
        "7B/layers.{block_id}.consideration.{identify}.weight.npy"))$`T`

    Dense <- operate(identify) keras$layers$Dense(
      items = n_heads * head_size,
      use_bias = FALSE,
      kernel_initializer = load_weight(identify)
    )

    self$wq <- Dense("wq")
    self$wk <- Dense("wk")
    self$wv <- Dense("wv")
    self$wo <- Dense("wo")
  }

  name <- operate(x) {
    c(batch_size, seqlen, n_features) %<-% tf$unstack(tf$form(x))

    # 1. undertaking (linear rework) x into
    #    question, key, and worth tensors
    # 2. reshape q okay v, splitting out the final dim (n_features)
    #    into n_heads impartial subspaces,
    #    every with measurement head_size.
    #    (n_features == head_size * n_heads)
    split_heads_shape <- c(batch_size, seqlen,
                           self$n_heads, self$head_size)
    q <- x |> self$wq() |> tf$reshape(split_heads_shape)
    okay <- x |> self$wk() |> tf$reshape(split_heads_shape)
    v <- x |> self$wv() |> tf$reshape(split_heads_shape)

    # embed positional data in question and key
    # (bsz, seqlen, n_heads, head_size)
    q %<>% apply_rotary_embedding()
    okay %<>% apply_rotary_embedding()

    # reshape:
    #   transfer heads out of the final 2 axes,
    #   so later matmuls are carried out throughout the subspaces (heads)
    #   between (seqlen, head_size) axes
    v <- tf$transpose(v, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
    q <- tf$transpose(q, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
    okay <- tf$transpose(okay, c(0L, 2L, 3L, 1L)) # (bsz, n_heads, head_size, seqlen)

    # calculate and normalize consideration scores
    scores <- q %*% okay                       # (bsz, n_heads, seqlen, seqlen)
    scores <- scores / sqrt(self$head_size) # scale

    # apply causal masks, so the mannequin cannot "look forward" throughout coaching
    masks <- make_mask(seqlen, dtype = scores$dtype)
    scores %<>% { . + masks }

    scores <- tf$nn$softmax(scores, axis = -1L)

    # modify values tensor with consideration scores
                      # scores (bsz, n_heads, seqlen, seqlen)
                      # v      (bsz, n_heads, seqlen, head_size)
    output <- scores %*% v   # (bsz, n_heads, seqlen, head_size)

    # mix heads again right into a single options dim,
    # so Attention output_shape==input_shape
    output <- output |>
      tf$transpose(c(0L, 2L, 1L, 3L)) |> # (bsz, seqlen, n_heads, head_size)
      tf$reshape(tf$form(x))            # (bsz, seqlen, n_heads * head_size)

    # another trainable linear projection for good luck
    output <- self$wo(output) # (bsz, seqlen, n_heads * head_size)

    output
  }
}

Attention in LLaMA is comparable however not an identical to the Attention
described within the authentic Transformers
paper
(and obtainable as a keras
builtin underneath keras$layers$MultiHeadvertAttention()). The core novelty is
the addition of the apply_rotary_embedding() operate, which we’ll
describe shortly. The extra novelty is balanced by the simplicity
from the truth that the layer is performing self-attention—we don’t want
to go in numerous question, key, and worth tensors (or cause about what
meaning), because the similar enter serves all three roles. Note that the
standard MultiHeadvertAttention() layer is roofed fairly totally in
the 2nd Edition of Deep Learning with R,
together with a full implementation of consideration in base R.

To develop an understanding of the mechanics in a layer like this, it’s
useful to briefly unsee a few of the minutia that may act as a fog
obscuring the essence of the operation. In this occasion, if we
briefly strip out the transpose()s and reshape()s (as intelligent and
very important as they’re), that is what’s left:

name <- operate(x) > self$wq()
  okay <- x 

Returning to the transpose()s and reshapes(), you possibly can observe that
their objective is to make it in order that the eye calculations are
carried out throughout n_heads impartial subspaces, relatively than in a
single bigger house. The similar reasoning drives this resolution as that
driving utilization of depthwise-separable convolutions in picture fashions.
Empirically, for the fastened compute funds, factoring options into
impartial subspaces performs higher than doing the identical core
operations in single bigger characteristic house. As with all issues, there may be
a stability to strike between n_heads (the variety of subspaces) and
head_dim (the dimensions of every subspace). The LLaMA authors have struck
the stability like this on the numerous mannequin sizes:

lapply(c("7B", "13B", "30B", "65B"), (measurement) {
  p <- read_json(weights_path("{measurement}/params.json"))
  with(p, checklist(llama_size = measurement,
               n_heads = n_heads,
               head_dim = dim %/% n_heads))
}) |> dplyr::bind_rows()
# A tibble: 4 × 3
  llama_size n_heads head_dim
  <chr>        <int>    <int>
1 7B              32      128
2 13B             40      128
3 30B             52      128
4 65B             64      128

Next lets flip our consideration to the causal consideration masks.

make_mask <- operate(seqlen, dtype = k_floatx()) {
  x <- tf$vary(seqlen)
  masks <- tf$the place(x[, tf$newaxis] < x[tf$newaxis, ],
                   tf$fixed(-Inf, dtype = dtype),
                   tf$fixed(0, dtype = dtype))

  # broadcast over batch and heads dim
  masks[tf$newaxis, tf$newaxis, , ] # (1, 1, seqlen, seqlen)
}

The masks is a strictly higher triangular matrix full of -Inf
values. Adding the masks to the eye scores prevents the mannequin from
with the ability to “look ahead” and see the eye rating for a token
pairing it hasn’t seen but at a specific place within the sequence.
This want for a masks is greatest considered a vestige from coaching,
an equipment that the mannequin wanted to study with and now it will probably’t operate with out.
During coaching, gradients are calculated for predictions from all
token positions in a sequence, together with predictions tokens the place the right
reply is proper there, because the very subsequent token in similar sequence. The masks
prevents the mannequin from with the ability to cheat and look forward into the long run,
one thing it gained’t be capable to do as soon as it’s we’re operating it for inference.

tf.Tensor(
[[[[  0. -inf -inf -inf -inf]
   [  0.   0. -inf -inf -inf]
   [  0.   0.   0. -inf -inf]
   [  0.   0.   0.   0. -inf]
   [  0.   0.   0.   0.   0.]]]], form=(1, 1, 5, 5), dtype=float32)

Rotary Position Embedding

Next lets flip our consideration to apply_rotary_embedding(). This core
innovation was printed by Su et al. (2022) within the paper titled
“RoFormer: Enhanced Transformer with Rotary Position Embedding”.

Some context:

  • The naked Attention() mechanism doesn’t go away any risk for a
    token’s place in a sequence to have an effect on the eye scores, since
    solely token-pairs are scored. Attention treats its enter like a
    bag-of-tokens.

  • The place of a token in a sequence is clearly necessary, and the
    consideration layer ought to have entry to that data.

  • The absolute place of a token in a sequence is much less necessary
    than the relative place between tokens. (Especially so for lengthy
    sequences).

Which leads us into the complicated airplane. If we think about the options as
complicated numbers, we are able to rotate them, and we are able to calculate angles between
them. From the Roformers paper:

Specifically, incorporating the relative place embedding is
simple: merely rotate the affine-transformed phrase embedding
vector by quantity of angle multiples of its place index and thus
interprets the instinct behind Rotary Position Embedding

Expanding barely: the rotation matrix is designed in order that
subsequently, after rotating our q and okay token sequence embedding
the identical manner, the angle between token options is a operate of the
relative distance between these tokens within the token sequence. The
relative angle between two tokens is invariant to absolutely the
place of these tokens within the full sequence.

In quick, the rotation injects positional data. The that means or
interpretability of that positional data, or how it’s meant to
be used, and even extracted from the results of q %*% okay, is left to the
mannequin to study.

Here is the code:

apply_rotary_embedding <- operate(x) {
  c(., seqlen, ., head_size) %<-%
    tf$unstack(tf$form(x))

  rotation_matrix <- compute_rotation_matrix(seqlen, head_size)

  x %>%
    view_as_complex() %>%
    { . * rotation_matrix } %>%
    view_as_real()

}

compute_rotation_matrix <-
  operate(seqlen, feature_dim, theta = 10000) {
    # `feature_dim` right here goes to be consideration$head_size
    # `seqlen` goes to match the token sequence size.

    t <- tf$vary(seqlen, dtype = tf$float32)
    freqs <- tf$vary(begin = 0, restrict = 1, delta = 1 / (feature_dim %/% 2),
                      dtype = tf$float32)
    tf_assert(tf$measurement(freqs) == feature_dim %/% 2)
    freqs <- 1.0 / (theta ^ freqs)

    # outer product; (seqlen, head_size/2)
    freqs <- tf$einsum('a,b->ab', t, freqs)

    rot_mat <- tf$complicated(tf$cos(freqs), tf$sin(freqs))

    # the positional embedding might be broadcast throughout batch and heads dim
    rot_mat[tf$newaxis, , tf$newaxis, ] #(1, seqlen, 1, headdim/2)
  }

view_as_complex <- operate(x) {
  tf$complicated(x[all_dims(), `::2`],
             x[all_dims(), `2::2`])
}

view_as_real <- operate(x) {
  # xs = (..., f);  xs2 = (..., f*2)
  xs <- tf$form(x)
  xs2 <- tf$concat(checklist(xs[1:(length(xs)-1)],
                        xs[length(xs), drop = FALSE] * 2L),
                   axis = 0L)

  x2 <- tf$stack(checklist(Re(x), Im(x)), axis = -1L)

  # (..., f, 2) -> (..., f*2)
  tf$reshape(x2, xs2)
}

As you possibly can see, to think about the embedding options as current within the
complicated airplane, we merely deal with adjoining pairs of floats within the
underlying array as the actual and imaginary a part of a fancy quantity. We
rotate the embeddings within the complicated airplane, then return to imagining
the options as current in the actual airplane. Again, the job of
decoding the that means of the options after rotation is left to the
mannequin to study.

We can rapidly affirm that the rotary embeddings solely rotate options
and don’t scale them:

close to <- operate (x, y, tol = 1e-6) abs(x - y) < tol
all(close to(1, Mod(compute_rotation_matrix(2048L, 128L))))
tf.Tensor(True, form=(), dtype=bool)

There is another trick to look at earlier than transferring on: due to a few of
the mathematical properties of the rotation matrix, it’s attainable to
keep away from doing a full complicated multiply operation and nonetheless arrive on the
similar end result. Also, because the rotation matrix by no means modifications, it makes
sense to solely compute it as soon as and cache it, like so:

precomputed_rotation_matrix <- compute_rotation_matrix(
  seqlen = 2048L, # LLaMA max seqlen
  feature_dim = with(params, dim %/% n_heads)  # head_size
)

apply_rotary_embedding_faster <- operate(x) {

  rotate_every_two <- operate(x) {
    x1 <- x[all_dims(), `::2`]
    x2 <- x[all_dims(), `2::2`]
    x_ <- tf$stack(checklist(-x2, x1), axis = -1L)
    tf$reshape(x_, tf$form(x))
  }

  repeat_each_twice <- operate(x) {
    tf$`repeat`(x, 2L, axis = -1L)
  }

  seqlen <- tf$form(x)[2]
  rot <- precomputed_rotation_matrix[, NA:seqlen, , ]

  cos <- Re(rot) |> repeat_each_twice()
  sin <- Im(rot) |> repeat_each_twice()

  (x * cos) + (rotate_every_two(x) * sin)
}
rand <- tf$random$uniform(form(3, 8, params$n_heads, 128))
all(apply_rotary_embedding(rand) ==
    apply_rotary_embedding_faster(rand))
tf.Tensor(True, form=(), dtype=bool)
apply_rotary_embedding <- apply_rotary_embedding_faster

Finally, notice that the rotary positional embeddings are utilized inside
every Attention layer. This is completely different from the unique Transformer
implementation, the place a positional embedding was solely added as soon as on the
head of the mannequin. Similar to residual connections, you possibly can consider the
presence of those repeated injections of positional data as
relieving the remaining trainable layers from the burden of allocating
a few of their weights to the duty of “passing through” or “preserving”
the positional data for later layers.

Positional embeddings are a wealthy topic that additionally comes up in different
deep studying architectures, like denoising diffusion (Falbel and Keydana 2023),
so time spent understanding them higher is time nicely
spent. For the needs of this weblog submit we’ve coated the factors
wanted and we’ll transfer on to tying all items collectively. To go deeper and
develop a extra mathematically knowledgeable perceive of RoPE, two glorious
beginning factors are:

  1. The authentic paper by Su et al. (2022)

  2. This weblog submit by
    Biderman et al. (2021)

Tying all of it collectively

With Tokenizer, Embedding, TransformerBlock (RMSNorm,
Attention FeedForward and apply_rotary_embedding) all coated,
it’s time to tie all of the items collectively right into a Transformer mannequin. We
may do that utilizing %py_class% like with the opposite layers above, however
it’s simply as straightforward to maneuver over to utilizing the Keras purposeful API at this
level.

layer_transformer_block <- create_layer_wrapper(TransformerBlock)
layer_rms_norm <- create_layer_wrapper(RMSNorm)

# enter to the mannequin might be output from the tokenizer
enter <- layer_input(form(NA)) #, dtype = "int32")

x <- enter |>
  tok_embeddings()  # instantiated earlier within the blog-post

for(block_id in seq_len0(params$n_layers)) >
    layer_transformer_block(attn_head_size = params$dim %/% params$n_heads,
                            attn_n_heads = params$n_heads,
                            norm_eps = params$norm_eps,
                            block_id = block_id)


# closing output projection into logits of output tokens
x <- x |>
  layer_rms_norm(block_id = -1, eps = params$norm_eps) |>
  layer_dense(
    tokenizer$vocab_size(), use_bias = FALSE,
    kernel_initializer = (...) np$load(weights_path("7B/output.weight.npy"))$`T`
  )

# slice out the logits for the final token
with_options(c(tensorflow.extract.warn_negatives_pythonic = FALSE), {
  output <- x[, -1, ]
})

llama <- keras_model(enter, output) %>%
  compile(jit_compile = TRUE)

The enter to the mannequin is tokenized textual content and the output is the
(unnormalized) possibilities for every token in tokenizer$vocab_size()
being the following token within the sequence.

next_token_probs <- immediate %>%
  tokenizer$tokenize() %>%
  llama()

next_token_probs
tf.Tensor(
[[-2.4503722e+00 -3.4463339e+00  1.3200411e+01 ...  4.8804146e-01
  -1.3277926e+00  9.9985600e-03]], form=(1, 32000), dtype=float32)

Sampling methods for choosing a token from the token logits is a
wealthy subject, (additionally coated totally within the Deep Learning with
R
ebook), however this weblog submit is lengthy sufficient
already. So for now, let’s simply take the argmax().

sampler <- (logits) tf$argmax(logits, axis = -1L, output_type = "int32")

(next_token <- sampler(next_token_probs))
tf.Tensor([304], form=(1), dtype=int32)
tokenizer$detokenize(next_token) |> as.character()
[1] "to"

Let’s run it for a couple of tokens and let LLaMa end the sentence:

prompt_tokens <- tokenizer$tokenize("The greatest solution to appeal to bees")

for (i in 1:20) {

  next_token_probs <- prompt_tokens |> llama()
  next_token <- sampler(next_token_probs)

  prompt_tokens %<>% { tf$concat(c(., next_token), axis = -1L) }

  # finish of sentence
  if (as.logical(next_token == tokenizer$string_to_id(".")))
    break
}

prompt_tokens |>
  tokenizer$detokenize() |>
  as.character() |>
  strwrap(60) |> writeLines()
The greatest solution to appeal to bees to your backyard is to plant a
number of flowers that bloom at completely different instances.

Wrapping up

In this weblog submit we’ve walked by means of the LLaMA structure
applied in R TensorCirculate, together with the way to load pretrained weights,
after which run the mannequin to generate a sentence. Note, a lot of the code in
this weblog submit is tailor-made for didactic functions. While the
implementation of the LLaMA structure coated on this weblog submit is
applicable for coaching, there are a couple of modifications you’ll wish to
make earlier than doing quite a lot of textual content era. Those embrace issues like:

  • In the Attention layer, caching the okay and v tensors. Then,
    after the primary ahead go with the preliminary immediate, solely feeding
    the mannequin the one new token from the sampler(), relatively than
    feeding the mannequin all of the tokens of the total immediate on every ahead
    go.

  • Only producing the causal masks make_mask() and rotary_matrix
    slices as soon as per ahead go, as a substitute of inside every Attention
    name.

  • Updating the TransformerBlock to be cache-aware and to go
    by means of the suitable arguments to Attention()

  • Wrapping all the extra book-keeping logic in a customized
    TransformerDecoder() class.

The modifications required to implement these optimizations for inference
balloon the code measurement and are principally about book-keeping, so we gained’t go
by means of them on this weblog submit. However, you’ll find a fuller
implementation of LLaMA in R Tensorflow, together with a cache-aware
generate() technique that solely feeds the mannequin one token at a time throughout
the primary inference loop, (and compiles to XLA!),
right here.

That’s all for now. Thanks for studying and glad travels to all
exploring this thrilling LLM terrain!

Photo by Sébastien Goldberg on Unsplash

Biderman, Stella, Sid Black, Charles Foster, Leo Gao, Eric Hallahan, Horace He, Ben Wang, and Phil Wang. 2021. “Rotary Embeddings: A Relative Revolution.” blog.eleuther.ai/rotary-embeddings/.
Falbel, Daniel, and Sigrid Keydana. 2023. “Posit AI Blog: De-Noising Diffusion with Torch.” https://blogs.rstudio.com/tensorflow/posts/2023-04-13-denoising-diffusion/.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. “Training Compute-Optimal Large Language Models.” https://arxiv.org/abs/2203.15556.
Shazeer, Noam. 2020. “GLU Variants Improve Transformer.” https://arxiv.org/abs/2002.05202.
Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2022. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” https://arxiv.org/abs/2104.09864.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023. “LLaMA: Open and Efficient Foundation Language Models.” https://doi.org/10.48550/ARXIV.2302.13971.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” https://arxiv.org/abs/1706.03762.

LEAVE A REPLY

Please enter your comment!
Please enter your name here