VideoLLM: Modeling Video Sequence with Large
Guo Chen1,2, Yin-Dong Zheng1, Jiahao Wang1, Jilan Xu2,3, Yifei Huang2, Junting Pan4
Yi Wang2, Yali Wang2, Yu Qiao2, Tong Lu1, Limin Wang1,2
Nanjing University, OpenGVLab, Shanghai AI Laboratory, Fudan University
1 2 3
The Chinese University of Hong Kong
4
https://github.com/cg1177/VideoLLM
Abstract
With the exponential growth of video data, there is an urgent need for automated
technology to analyze and comprehend video content. However, existing video
understanding models are often task-specific and lack a comprehensive capability
of handling diverse tasks. The success of large language models (LLMs) like GPT
has demonstrated their impressive abilities in sequence causal reasoning. Building
upon this insight, we propose a novel framework called VideoLLM that leverages
the sequence reasoning capabilities of pre-trained LLMs from natural language
processing (NLP) for video sequence understanding. VideoLLM incorporates
a carefully designed Modality Encoder and Semantic Translator, which convert
inputs from various modalities into a unified token sequence. This token sequence
is then fed into a decoder-only LLM. Subsequently, with the aid of a simple
task head, our VideoLLM yields an effective unified framework for different
kinds of video understanding tasks. To evaluate the efficacy of VideoLLM, we
conduct extensive experiments using multiple LLMs and fine-tuning methods. We
evaluate our VideoLLM on eight tasks sourced from four different datasets. The
experimental results demonstrate that the understanding and reasoning capabilities
of LLMs can be effectively transferred to video understanding tasks.
1 Introduction
The advent of phenomenon-level language applications, such as ChatGPT [59], has showcased
LLMs’ [61; 62; 7; 58; 64; 101; 75; 83] remarkable zero-shot capability in effectively addressing
multiple NLP or vision-centric tasks. The remarkable sequence modeling and reasoning capabilities
that these large language models exhibited can be traced back to their acquisition through rigorous
pre-training with substantial parameters on large-scale corpora. Despite the amazing achievements in
processing language sequences, understanding video sequences that record the real world’s objective
laws and can be regarded as long image sequences is far from the level of present LLM.
Video sequence understanding involves various real-world applications, such as surveillance systems
[37], autonomous vehicles [70], robotics [66], and wearable devices [71]. Simply put, it involves
AI systems in the real-time processing of visual information streams, reasoning them in the context
of long-term time series, and then providing responses. The vanilla paradigm for video sequence
understanding tasks relies on task-specific designs [93; 103; 11; 98; 51; 100; 39] to encode or decode
video sequences, thereby achieving a promising performance but brings additional tailored cost.
Compared with natural language, there is no scalable video sequence model that can be seamlessly
adapted to different video sequence tasks. This is primarily attributed to the challenges associated
with large-scale video self-supervision, which arise from the expensive nature of temporal-intensive
visual annotation, as well as the time-consuming process of acquiring and processing extensive video
Figure 1: Overview of our motivation and method. (a) LLM taking words as input is pretrained on large-scale
nature language composed of word sequences. (b) VideoLLM encodes video stream to token sequences and
applies large-scale pre-trained LLMs to video sequence reasoning tasks.
data. As a result, there is a pressing demand for an efficient method that can offer fundamental
modeling capabilities for tasks involving video sequence understanding.
In this work, we present a novel paradigm called VideoLLM, as shown in Figure 1, which aligns
video and language sequences and harnesses LLMs’ reasoning and understanding capabilities. This
paradigm enables videos to engage in reasoning about real-world events through the medium of
language. Specifically, it is composed of three core components: (1) a temporal-wise unitization
method to encode unit-wise data stream, (2) an appended semantic translator to transfer visual
semantics to language semantics, and (3) a decoder-only LLM as a generalist video sequence reasoner
for various video sequence understanding tasks. The design allows sequence tasks with different
modalities (e.g. visual and text) to be seamlessly integrated, as we verified in the experiments visual
only tasks such as temporal action detection and action anticipation, etc., and visual-language tasks
such as temporal grounding and highlight detection, etc. The unit-wise encoding and decoder-only
reasoning enable the system to run with minimal delay, greatly meeting real-time or interactive
systems’ experience requirements.
In contrast to the long-term temporal post-fusion approach proposed in [3], our method emphasizes
learning short-term visual token representations for effectively integrating frozen LLMs. This adapta
tion is conducted within a well-pretrained LLM with robust sequence processing and causal reasoning
abilities. Consequently, long-term video modeling can be disregarded, effectively simplifying the
complexity of the system design. Compared to recent API-based or ensemble-based visual under
standing applications [12; 97; 68; 54; 45], we offer an end-to-end system-level approach for video
understanding by bridging visual models and LLMs, enhancing the overall efficiency of the long-term
video sequence understanding pipeline. Moreover, our method achieves maximal decoupling between
short-term and long-term visual modeling, enabling the flexible adoption of heterogeneous short-term
visual encoding techniques while rapidly incorporating state-of-the-art LLMs.
Our contributions can be succinctly summarized as follows:
(1) We present VideoLLM, a novel framework that harnesses the sequence reasoning capabilities of
pre-trained LLMs to tackle video sequence understanding tasks through the medium of language. By
aligning videos with language, VideoLLM enables simultaneous reasoning about language logic and
the evolution of real-world states through unified modeling.
(2) We reexamine the characteristics and challenges associated with various video sequence under
standing tasks and develop a novel, plug-and-play adaptation scheme to adapt off-the-shelf visual
encoders and advanced LLMs effectively. This scheme is built upon a unified adaptation principle,
eliminating the need for task-specific customization.
(3) We conduct extensive experiments across four datasets, encompassing eight video sequence
understanding tasks. These tasks encompass diverse settings, including data accessibility (causal or
non-causal), perceptual objectives (memory or anticipation), prediction granularity (segment-level
or frame-level), and modalities (vision-only or vision-language). The experiments employ a range
of LLMs, such as GPT-2, T5, and OPT. Comparative analyses against task-specific tailored models
demonstrate that our VideoLLM achieves state-of-the-art or comparable performance on these tasks,
employing comparable or fewer trainable parameters. These results effectively establish LLM as an
effective video reasoner, while validating the efficacy of our proposed VideoLLM framework for
multiple video sequence understanding tasks.
2 Related Work
2.1 Video Sequence Understanding
Video Sequence Understanding tasks can be categorized into two types based on the granularity of
predictions: timestamp-level tasks and segment-level tasks. Timestamp-level tasks aim to predict
closed-set properties at each time step or filter suitable time steps based on textual conditions. For
example, [25; 87; 93; 21; 98] implement online action detection or action segmentation tasks to
predict the category of each time step in a video stream. Similarly, [103; 26; 24; 67] implement action
anticipation tasks to predict the action category that occurs after a certain time gap. Additionally,
methods such as [39; 52] achieve text-based highlight detection. Segment-level tasks involve pre
dicting segment boundaries in a video sequence based on closed-set categories or open text. Related
tasks include moment query [50; 94; 102; 96; 99] and natural language query [100; 65; 92]. The
model proposed in this paper is tested on multiple video sequence understanding tasks to verify the
language models’ capability to reason about videos from different perspectives.
2.2 Vision Models
Vision Models, including image and video models, have recently been developed rapidly, mainly
focusing on representing short-term vision information. Vision models are divided into convolution,
transformer, and hybrid networks. Convolution models learn spatial [32; 28; 90; 56; 95; 84] or space
time [82; 9; 23; 77; 76; 81; 57] visual representations by aggregating neighborhood information
using 2D or 3D convolution operators. With the great success of the transformer [78] in the NLP
field, the visual transformer has also been continuously developed. The visual transformer models
space [18; 55; 86; 74; 5; 20] or space-time [19; 22; 6; 3; 73; 80] through an attention mechanism.
Due to the data-hungry problem caused by the lack of inductive bias in the transformer network,
a hybrid network [85; 46; 2; 47; 88] combining attention mechanism and convolution operator is
proposed to improve performance.
2.3 Large Language Models
Large Language Models have emerged in recent years in natural language processing. These models
usually contain billions to hundreds of billions of parameters and are trained on large text corpora
[61; 62; 89; 64; 30; 13; 75]. The core architecture of the model is based on the Transformer [78]
while the objective functions range from masked language modeling [17; 53; 35], generative language
modeling [61; 62; 7] and permuted language modeling [14]. Among these works, the generative
based language models showed promising results [62; 7] on a wide range of natural language
understanding benchmarks. Beginning with the representative work GPT-3 [7], a series of works
[69; 63; 30; 101; 13; 75] scaled up the model and pre-training data and demonstrated strong few-shot
and zero-shot performance. Despite the promising results on natural language tasks, the capability of
the models are still less explored in multimodal domain. In this paper, we attempt to discover the
long-range modeling capacity of LLMs in improving video understanding.
2.4 Multimodal Models
Multimodal Models aim to learn joint vision and language representation for multimodal downstream
tasks. The dominant works are VLP models trained end-to-end on large-scale image/video-text
pairs [60; 34; 44; 40; 79; 4; 49]. To relieve the high computation resources, modulated vision
language models adopted frozen unimodal or multimodal pre-trained encoders with learnable modules
[43; 42; 1]. These models leveraged strong representation ability of large language models for
alignment or generation tasks. BLIP-2 [42] trained a lightweight Transformer to compress the visual
tokens and built a bridge between vision output and language input. Flamingo [1] injected visual
features into LLM by adding intermediate cross-attention Transformer layers.