Video understanding is a difficult drawback that requires reasoning about each spatial data (e.g., for objects in a scene, together with their areas and relations) and temporal data for actions or occasions proven in a video. There are many video understanding functions and duties, resembling understanding the semantic content material of net movies and robotic notion. However, present works, resembling ViViT and TimeSFormer, densely course of the video and require important compute, particularly as mannequin dimension plus video size and backbone enhance.
In “Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning”, to be offered at CVPR 2023, we introduce a easy method that turns a Vision Transformer (ViT) mannequin picture encoder into an environment friendly video spine utilizing sparse video tubes (learnable visible representations of samples from the video) to scale back the mannequin’s compute wants. This method can seamlessly course of each photographs and movies, which permits it to leverage each picture and video information sources throughout coaching. This coaching additional permits our sparse tubes ViT mannequin to coalesce picture and video backbones collectively to serve a twin function as both a picture or video spine (or each), relying on the enter. We exhibit that this mannequin is scalable, could be tailored to massive pre-trained ViTs with out requiring full fine-tuning, and achieves state-of-the-art outcomes throughout many video classification benchmarks.
Using sparse video tubes to pattern a video, mixed with a normal ViT encoder, results in an environment friendly visible illustration that may be seamlessly shared with picture inputs. |
Building a joint image-video spine
Our sparse tube ViT makes use of a normal ViT spine, consisting of a stack of Transformer layers, that processes video data. Previous strategies, resembling ViViT, densely tokenize the video after which apply factorized consideration, i.e., the eye weights for every token are computed individually for the temporal and spatial dimensions. In the usual ViT structure, self-attention is computed over the entire token sequence. When utilizing movies as enter, token sequences grow to be fairly lengthy, which may make this computation gradual. Instead, within the technique we suggest, the video is sparsely sampled utilizing video tubes, that are 3D learnable visible representations of varied styles and sizes (described in additional element beneath) from the video. These tubes are used to sparsely pattern the video utilizing a massive temporal stride, i.e., when a tube kernel is barely utilized to a couple areas within the video, quite than each pixel.
By sparsely sampling the video tubes, we will use the identical international self-attention module, quite than factorized consideration like ViViT. We experimentally present that the addition of factorized consideration layers can hurt the efficiency because of the uninitialized weights. This single stack of transformer layers within the ViT spine additionally permits higher sharing of the weights and improves efficiency. Sparse video tube sampling is finished through the use of a big spatial and temporal stride that selects tokens on a hard and fast grid. The massive stride reduces the variety of tokens within the full community, whereas nonetheless capturing each spatial and temporal data and enabling the environment friendly processing of all tokens.
Sparse video tubes
Video tubes are 3D grid-based cuboids that may have totally different shapes or classes and seize totally different data with strides and beginning areas that may overlap. In the mannequin, we use three distinct tube shapes that seize: (1) solely spatial data (leading to a set of 2D picture patches), (2) lengthy temporal data (over a small spatial space), and (3) each spatial and temporal data equally. Tubes that seize solely spatial data could be utilized to each picture and video inputs. Tubes that seize lengthy temporal data or each temporal and spatial data equally are solely utilized to video inputs. Depending on the enter video dimension, the three tube shapes are utilized to the mannequin a number of instances to generate tokens.
A set place embedding, which captures the worldwide location of every tube (together with any strides, offsets, and so forth.) relative to all the opposite tubes, is utilized to the video tubes. Different from the earlier discovered place embeddings, this fastened one higher permits sparse, overlapping sampling. Capturing the worldwide location of the tube helps the mannequin know the place every got here from, which is particularly useful when tubes overlap or are sampled from distant video areas. Next, the tube options are concatenated collectively to type a set of N tokens. These tokens are processed by a normal ViT encoder. Finally, we apply an consideration pooling to compress all of the tokens right into a single illustration and enter to a completely related (FC) layer to make the classification (e.g., enjoying soccer, swimming, and so forth.).
Scaling video ViTs
The strategy of constructing video backbones is computationally intensive, however our sparse tube ViT mannequin permits computationally environment friendly scaling of video fashions, leveraging beforehand educated picture backbones. Since picture backbones could be tailored to a video spine, massive picture backbones could be become massive video backbones. More particularly, one can switch the discovered video characteristic representations from a small tube ViT to a big pre-trained picture ViT and prepare the ensuing mannequin with video information for just a few steps, versus a full coaching from scratch.
Results
We consider our sparse tube ViT method utilizing Kinetics-400 (proven beneath), Kinetics-600 and Kinetics-700 datasets and examine its efficiency to a protracted listing of prior strategies. We discover that our method outperforms all prior strategies. Importantly, it outperforms all state-of-the-art strategies educated collectively on picture+video datasets.
Performance in comparison with a number of prior works on the favored Kinetics-400 video dataset. Our sparse tube ViT outperforms state-of-the-art strategies. |
Furthermore, we take a look at our sparse tube ViT mannequin on the Something-Something V2 dataset, which is usually used to guage extra dynamic actions, and in addition report that it outperforms all prior state-of-the-art approaches.
Performance on the Something-Something V2 video dataset. |
Visualizing some discovered kernels
It is attention-grabbing to grasp what sort of rudimentary options are being discovered by the proposed mannequin. We visualize them beneath, displaying each the 2D patches, that are shared for each photographs and movies, and video tubes. These visualizations present the 2D or 3D data being captured by the projection layer. For instance, within the 2D patches, varied frequent options, like edges and colours, are detected, whereas the 3D tubes seize fundamental shapes and the way they could change over time.
Conclusions
We have offered a brand new sparse tube ViT, which may flip a ViT encoder into an environment friendly video mannequin, and might seamlessly work with each picture and video inputs. We additionally confirmed that giant video encoders could be bootstrapped from small video encoders and image-only ViTs. Our method outperforms prior strategies throughout a number of common video understanding benchmarks. We imagine that this easy illustration can facilitate far more environment friendly studying with enter movies, seamlessly incorporate both picture or video inputs and successfully remove the bifurcation of picture and video fashions for future multimodal understanding.
Acknowledgements
This work is carried out by AJ Piergiovanni, Weicheng Kuo and Anelia Angelova, who at the moment are at Google DeepMind. We thank Abhijit Ogale, Luowei Zhou, Claire Cui and our colleagues in Google Research for his or her useful discussions, feedback, and assist.