State-of-the-art NLP fashions from R

0
100
State-of-the-art NLP fashions from R



State-of-the-art NLP fashions from R

Introduction

The Transformers repository from “Hugging Face” comprises a variety of prepared to make use of, state-of-the-art fashions, that are simple to obtain and fine-tune with Tensorflow & Keras.

For this objective the customers often have to get:

  • The mannequin itself (e.g. Bert, Albert, RoBerta, GPT-2 and and many others.)
  • The tokenizer object
  • The weights of the mannequin

In this put up, we’ll work on a traditional binary classification process and prepare our dataset on 3 fashions:

However, readers ought to know that one can work with transformers on quite a lot of down-stream duties, corresponding to:

  1. characteristic extraction
  2. sentiment evaluation
  3. textual content classification
  4. query answering
  5. summarization
  6. translation and many extra.

Prerequisites

Our first job is to put in the transformers package deal by way of reticulate.

reticulate::py_install('transformers', pip = TRUE)

Then, as typical, load commonplace ‘Keras’, ‘TensorFlow’ >= 2.0 and a few traditional libraries from R.

Note that if working TensorFlow on GPU one might specify the next parameters with a purpose to keep away from reminiscence points.

physical_devices = tf$config$list_physical_devices('GPU')
tf$config$experimental$set_memory_growth(physical_devices[[1]],TRUE)

tf$keras$backend$set_floatx('float32')

Template

We already talked about that to coach a knowledge on the precise mannequin, customers ought to obtain the mannequin, its tokenizer object and weights. For instance, to get a RoBERTa mannequin one has to do the next:

# get Tokenizer
transformer$RobertaTokenizer$from_pretrained('roberta-base', do_lower_case=TRUE)

# get Model with weights
transformer$TFRobertaModel$from_pretrained('roberta-base')

Data preparation

A dataset for binary classification is offered in text2vec package deal. Let’s load the dataset and take a pattern for quick mannequin coaching.

Split our information into 2 components:

idx_train = sample.int(nrow(df)*0.8)

prepare = df[idx_train,]
check = df[!idx_train,]

Data enter for Keras

Until now, we’ve simply coated information import and train-test break up. To feed enter to the community we’ve to show our uncooked textual content into indices by way of the imported tokenizer. And then adapt the mannequin to do binary classification by including a dense layer with a single unit on the finish.

However, we need to prepare our information for 3 fashions GPT-2, RoBERTa, and Electra. We want to jot down a loop for that.

Note: one mannequin normally requires 500-700 MB

# listing of three fashions
ai_m = listing(
  c('TFGPT2Model',       'GPT2Tokenizer',       'gpt2'),
   c('TFRobertaModel',    'RobertaTokenizer',    'roberta-base'),
   c('TFElectraModel',    'ElectraTokenizer',    'google/electra-small-generator')
)

# parameters
max_len = 50L
epochs = 2
batch_size = 10

# create an inventory for mannequin outcomes
gather_history = listing()

for (i in 1:size(ai_m)) {
  
  # tokenizer
  tokenizer = glue::glue("transformer${ai_m[[i]][2]}$from_pretrained('{ai_m[[i]][3]}',
                         do_lower_case=TRUE)") %>% 
    rlang::parse_expr() %>% eval()
  
  # mannequin
  model_ = glue::glue("transformer${ai_m[[i]][1]}$from_pretrained('{ai_m[[i]][3]}')") %>% 
    rlang::parse_expr() %>% eval()
  
  # inputs
  textual content = listing()
  # outputs
  label = listing()
  
  data_prep = perform(information) {
    for (i in 1:nrow(information)) {
      
      txt = tokenizer$encode(information[['comment_text']][i],max_length = max_len, 
                             truncation=T) %>% 
        t() %>% 
        as.matrix() %>% listing()
      lbl = information[['target']][i] %>% t()
      
      textual content = textual content %>% append(txt)
      label = label %>% append(lbl)
    }
    listing(do.name(plyr::rbind.fill.matrix,textual content), do.name(plyr::rbind.fill.matrix,label))
  }
  
  train_ = data_prep(prepare)
  test_ = data_prep(check)
  
  # slice dataset
  tf_train = tensor_slices_dataset(listing(train_[[1]],train_[[2]])) %>% 
    dataset_batch(batch_size = batch_size, drop_remainder = TRUE) %>% 
    dataset_shuffle(128) %>% dataset_repeat(epochs) %>% 
    dataset_prefetch(tf$information$experimental$AUTOTUNE)
  
  tf_test = tensor_slices_dataset(listing(test_[[1]],test_[[2]])) %>% 
    dataset_batch(batch_size = batch_size)
  
  # create an enter layer
  enter = layer_input(form=c(max_len), dtype='int32')
  hidden_mean = tf$reduce_mean(model_(enter)[[1]], axis=1L) %>% 
    layer_dense(64,activation = 'relu')
  # create an output layer for binary classification
  output = hidden_mean %>% layer_dense(models=1, activation='sigmoid')
  mannequin = keras_model(inputs=enter, outputs = output)
  
  # compile with AUC rating
  mannequin %>% compile(optimizer= tf$keras$optimizers$Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
                    loss = tf$losses$BinaryCrossentropy(from_logits=F),
                    metrics = tf$metrics$AUC())
  
  print(glue::glue('{ai_m[[i]][1]}'))
  # prepare the mannequin
  historical past = mannequin %>% keras::match(tf_train, epochs=epochs, #steps_per_epoch=len/batch_size,
                validation_data=tf_test)
  gather_history[[i]]<- historical past
  names(gather_history)[i] = ai_m[[i]][1]
}


Reproduce in a           Notebook

Extract outcomes to see the benchmarks:

Both the RoBERTa and Electra fashions present some further enhancements after 2 epochs of coaching, which can’t be mentioned of GPT-2. In this case, it’s clear that it may be sufficient to coach a state-of-the-art mannequin even for a single epoch.

Conclusion

In this put up, we confirmed easy methods to use state-of-the-art NLP fashions from R.
To perceive easy methods to apply them to extra complicated duties, it’s extremely really useful to overview the transformers tutorial.

We encourage readers to check out these fashions and share their outcomes beneath within the feedback part!

Corrections

If you see errors or need to counsel adjustments, please create a problem on the supply repository.

Reuse

Text and figures are licensed underneath Creative Commons Attribution CC BY 4.0. Source code is out there at https://github.com/henry090/transformers, until in any other case famous. The figures which were reused from different sources do not fall underneath this license and will be acknowledged by a observe of their caption: “Figure from …”.

Citation

For attribution, please cite this work as

Abdullayev (2020, July 30). RStudio AI Blog: State-of-the-art NLP fashions from R. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/

BibTeX quotation

@misc{abdullayev2020state-of-the-art,
  creator = {Abdullayev, Turgut},
  title = {RStudio AI Blog: State-of-the-art NLP fashions from R},
  url = {https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/},
  12 months = {2020}
}

LEAVE A REPLY

Please enter your comment!
Please enter your name here