Posit AI Blog: safetensors 0.1.0

0
416

[ad_1]

safetensors is a brand new, easy, quick, and protected file format for storing tensors. The design of the file format and its unique implementation are being led
by Hugging Face, and it’s getting largely adopted of their standard ‘transformers’ framework. The safetensors R package deal is a pure-R implementation, permitting to each learn and write safetensor information.

The preliminary model (0.1.0) of safetensors is now on CRAN.

Motivation

The fundamental motivation for safetensors within the Python neighborhood is safety. As famous
within the official documentation:

The fundamental rationale for this crate is to take away the necessity to use pickle on PyTorch which is utilized by default.

Pickle is taken into account an unsafe format, because the motion of loading a Pickle file can
set off the execution of arbitrary code. This has by no means been a priority for torch
for R customers, for the reason that Pickle parser that’s included in LibTorch solely helps a subset
of the Pickle format, which doesn’t embody executing code.

However, the file format has extra benefits over different generally used codecs, together with:

  • Support for lazy loading: You can select to learn a subset of the tensors saved within the file.

  • Zero copy: Reading the file doesn’t require extra reminiscence than the file itself.
    (Technically the present R implementation does makes a single copy, however that may
    be optimized out if we actually want it sooner or later).

  • Simple: Implementing the file format is straightforward, and doesn’t require advanced dependencies.
    This implies that it’s an excellent format for exchanging tensors between ML frameworks and
    between completely different programming languages. For occasion, you possibly can write a safetensors file
    in R and cargo it in Python, and vice-versa.

There are extra benefits in comparison with different file codecs frequent on this house, and
you possibly can see a comparability desk right here.

Format

The safetensors format is described within the determine beneath. It’s principally a header file
containing some metadata, adopted by uncooked tensor buffers.

Diagram describing the safetensors file format.

Basic utilization

safetensors might be put in from CRAN utilizing:

set up.packages("safetensors")

We can then write any named listing of torch tensors:

library(torch)
library(safetensors)

tensors <- listing(
  x = torch_randn(10, 10),
  y = torch_ones(10, 10)
)

str(tensors)
#> List of two
#>  $ x:Float [1:10, 1:10]
#>  $ y:Float [1:10, 1:10]

tmp <- tempfile()
safe_save_file(tensors, tmp)

It’s attainable to move extra metadata to the saved file by offering a metadata
parameter containing a named listing.

Reading safetensors information is dealt with by safe_load_file, and it returns the named
listing of tensors together with the metadata attribute containing the parsed file header.

tensors <- safe_load_file(tmp)
str(tensors)
#> List of two
#>  $ x:Float [1:10, 1:10]
#>  $ y:Float [1:10, 1:10]
#>  - attr(*, "metadata")=List of two
#>   ..$ x:List of three
#>   .. ..$ form       : int [1:2] 10 10
#>   .. ..$ dtype       : chr "F32"
#>   .. ..$ data_offsets: int [1:2] 0 400
#>   ..$ y:List of three
#>   .. ..$ form       : int [1:2] 10 10
#>   .. ..$ dtype       : chr "F32"
#>   .. ..$ data_offsets: int [1:2] 400 800
#>  - attr(*, "max_offset")= int 929

Currently, safetensors solely helps writing torch tensors, however we plan so as to add
assist for writing plain R arrays and tensorflow tensors sooner or later.

Future instructions

The subsequent model of torch will use safetensors as its serialization format,
that means that when calling torch_save() on a mannequin, listing of tensors, or different
kinds of objects supported by torch_save, you’ll get a legitimate safetensors file.

This is an enchancment over the earlier implementation as a result of:

  1. It’s a lot quicker. More than 10x for medium sized fashions. Could be much more for big information.
    This additionally improves the efficiency of parallel dataloaders by ~30%.

  2. It enhances cross-language and cross-framework compatibility. You can practice your mannequin
    in R and use it in Python (and vice-versa), or practice your mannequin in tensorflow and run it
    with torch.

If you wish to attempt it out, you possibly can set up the event model of torch with:

remotes::install_github("mlverse/torch")

Photo by Nick Fewings on Unsplash

Reuse

Text and figures are licensed beneath Creative Commons Attribution CC BY 4.0. The figures which have been reused from different sources do not fall beneath this license and might be acknowledged by a be aware of their caption: “Figure from …”.

Citation

For attribution, please cite this work as

Falbel (2023, June 15). Posit AI Blog: safetensors 0.1.0. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/

BibTeX quotation

@misc{safetensors,
  writer = {Falbel, Daniel},
  title = {Posit AI Blog: safetensors 0.1.0},
  url = {https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/},
  12 months = {2023}
}

LEAVE A REPLY

Please enter your comment!
Please enter your name here