Vision Transformer (ViT) Explained By Google Engineer | MultiModal LLM

Vision Transformer (ViT) Explained By Google Engineer | MultiModal LLM | Diffusion

视觉Transformer革新图像处理多模态LLM新引擎

视频科技

媒体详情

上传日期: 2025-06-15 21:05
来源: https://www.youtube.com/watch?v=BxQep0qdeWA
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-preview-06-05

转录

speaker 1: Hello everyone, we all know the transformer paper attention is all you need revolutionized natural language processing and started the current large language model era. Now last people know that the same mechanism is also tried in computer vision and succeeded vision transformer or vit's sugment images into sequences, apply linear embeddings, and then unleash the power of multi headed self attention to directly model global relationships between image patches. Moreover, vit is shining even more in mulmodel ims. Alright, I can't wait to share this inftro with you, but dive into it. Let's start with some history. For a very long time, computer vision cv was gaining more attention when deep learning was starting to get popular. You probably heard of Alex nrasnet rcnn or Yolo. If you're in the industry for long enough, I'm surprised that Yolo is already at version eleven. When I checit yesterday, and then a famous paper changed the landscape at 2:17, attention is all you need. Since then, transformer has became the most popular architecture for many tasks, and this is the classical encoder decoder transformer architecture. Most people would probably know large language models are mostly based on transformer. However, it is also used in computer vision and is becoming more and more important as multimodel ioms, becoming what people want naturally. Similarly, you would assume diffusion model is used only in image generation, but it's not. It can also be used in tax based large language models. Let's save this topic for some other time today. Let's focus on vision transformer bit. So what is the vision transformer? This animated gif from the original author is a great overview of what happened in vision transformer. I'm going to pause a bit for you to take a look. Alright, there are several steps that animated gif. First, you break an entire image into a number of patches. In this case, it uses nine patches. For example, those patches are flattened into a vector, and each of them is like a word in a complete sentence. Afterwards, we do some flattening on those patches and combine the position encoding from one to nine. With those nine patches, there is an extra learnable class embedding, which is zero. This is a special information that capture a lot of global information. I'm gonna to go into details later. And then we send those flattened encoding to the transformer encoder. This is basically identical with what we have gone through in the transformer deep dive. Take a look if you're interested. And in the end, we go through a mlp hat and output the classification of the image. Keep in mind this original paper is using vision transformer for image classification task. To summarize, vision transformer is a vision model similar to the transformer architecture originally designed for tabased task. Vit represents an input image as a sequence of image patches, similar to the sequence of word embeddings used when applying transformer to text, and directly predicts class label for the image. When used for image classification, vit demonstrates excellent performance when trained on sufficient data, outperforming a comparable state of the art cnn with four times fewer computational resources on popular classification task when it first came out at 2:20, this is the comparison. The previous sulta state of the art cnn is this score on all this image related task. And this is vit. It all surpasses the cnn. Moreover, in the more recent multimodel world, vit plays a pivotal role by enabling these models to seamlessly integrate and understand visual information alongside texflow data. This integration is very, very important for tasks that require a holistic understanding of both modalities. Now let's do a walkthrough of vit. The first topic I wanna go through is image tokenization. It is treating images as a sequence of patches. If you have gone through my transformer deep dive, or if you have some basic knowledge of tabased transformer, you should know transformer models process inputs as tokens. Can we afford treating each pixel as a token when we use transformer on imagery? Recall that the computational complexity of calculating the attention matrix is n square, where n is the sequence. If we treat each pixel as a separate token, then assuming a very small image size of 100x 100, the attention matrix will be size of 10000x ten zero. This is too expensive even for GPU's. So a reasonable alternative is a patch of some size, say 16x 16 as one token, an rgb image of size, big W X, big H X three, as in rgb is divided up into different patches, each of size, small W, small H X three. Each patch is flattened and passed through a dense or fully connected or feet forward network layer without activation for flattening. And this dense emmatting layer transformed the patch into a hidden learn representation of dimension. This graph tries to demonstrate what I went through just now. For example, this is a big image of size, big H by big W. We break it down into different patches, and for each patch it has size W by H. We firstly flatten this image into A H by W by three vector, and we use this as input and go through a dance layer without activation. And the output is a learned embedding that we want for subsequent steps. As described, the initial and crucial step in a vis to transform a 2D image into a sequence of one d tokens conceptually similar to words in a sentence, and this is achieved by splitting the input image into a grid of fixed size, typically non overlapping square patches. For instance, a 224x 224 pixel image might be divided into 16x 16 pixel patches. Each of these 2D image patches is then flattened into a one d vector of pixel values, effectively becoming a visual word or token in the sequence that a transformer can process. Now, the next key point is patch embedding. It is a linear projection of patches. After the patches are flattened, each patch vector undergoes a linear projection into a higher dimension embedding space. This linear transformation converts the raw pixel data of each patch into a dense vector representation, making it suitable for the transformer's encoder operation. To facilitate image classification, a special learnable classification token s is typically prependent to this sequence of patch embeddings. This token is designed to aggregate global information from all image patches as it path through the transformer layers, and its final output state is then used for the image classification decision. This is the math formula of how we do linear projection for the image. The first parameter is cs token, the second one is the linear transformation in the flattin image patches, and the last one is position encoding. I want to spend more time on cos token. Don't get misguided by the name classification token. The token is not only useful in classification task. The cos token is a special learnable embedding that is prepended to the sequence of image patch embeddings before they fat into the transformer encoding. Conceptually, it serves as a representation of the entire input sequence or sentence in a standard vit for image classification, cos tokens primary function is to aggregate global information from all the individual image patches. Cos tokens learns to weigh the importance of different visuregions across the entire image. And the final output state of this cos token is then typically Plato, a nlp multilayer perceptron. Essentially, the cos token becomes a condensed, holistic representation of the entire image, suitable for a single classification decision. And if we move beyond classification tasks in multimodel lms, cos token acts as a powerful aggregator, an importance indicator for visual information within vit, enabling efficient visual understanding in complex multimodel systems, it acts as a visual importance indicator. Crs token can learn which visual token or image patches are most important for the model's overall understanding and for generating relevant textual responses. Its attention scores on the visual tokens can serve as a direct indicator of their importance. It can also use for visual token compression. Methods like visual token compression utilize the attention scores of the cos token to prune redundant visual information. And in higher resolution image processing, attention patterns of the cos tokens are also very useful. By aggregating cos attention across multiple heads from the final layer as a feature important score, the model can intelligently select the most informative visual tokens within an allocated budget. Lastly, it can also be used as semantic awareness. Attention of a seos token has been shown to correlate with visual conthelping to identify main objects and distinguish them from irrelevant backgrounds within an image. The next key part is position encoding. We have been spending a lot of time in this part, but I think it's worth it. Position encoding is used to retain spatial information. If you remember, a critical aspect of the transformer architecture is its inherent for mutation invvariance. That means it doesn't care about the position. It process sequences without an intrinsic understanding of the order of spatial arrangement of its input tokens. In based transformer based iim's, we use thermosoidal positional encoding or rope to encode positional information. Similarly, in vto reintroduce this vital spatial context for images, positional encodings are added to the patch embeddings. These encodings provide the model with the information about the original position of each patch within the image. This is the position encoding. The next part is the transformer encoder block the combined sequence patch embeddings positional encodings, and the eos token is then fed into a standard transformer encoder, which is composed of multiple identical layers. You can see there's lx. The heart of each transformer layer is the multi head self attention mechanism. It enables each patch to dynamically weigh the importance of all other patches in the image. This global connectivity allows the model to capture complex relationship and long range dependencies across the entire visual input. This is the key difference between vand cnms. The multi head at aspect here means that the attention mechanism is performed multiple times in parallel, each with distinct learning targets. This parallel processing allows the model to learn different type of relationship and enrich the overall representation beyond the multi hat self attention. Each transformer encoder block typically includes a layer normalization and a multilayer percepfrom feet forward layer. Residual connections, also known as keep connections, are employed around each sublayer to facilitate radidient flow and stabilize training in deep networks. For image classification, the final output of the transformer encoder, specifically the transformer cis token, is then passed to a simple classification hat. Usually it's another mlp to predict the image category. Architectural innovations within these blogs are ongoing, such as replacing standard mlp with Khan to potentially capture more complex nonlinear dependencies. Further, thermore efficient attention mechanism like flash attention we have already gone through in a flash attention deep life are being integrated to optimize the computational efficiency of the attention process. By the way, let's save kfor another episode. I'm not going to deep dive into it right now. Lastly, I want to build some intuition using a tension map. And attention map is a visualization that reviews which parts of an image the model focuses on when making a classification decision. It essentially highlights the importance different image patches or tokens in the models 's prediction. These attention maps are pulled from the original paper. There are more. So if you're interested, take a look. This is very interesting. And most of the time it makes sense. For example, this is the original image and this is the attention map. So in order to classify this as a bird, you pretty much only care about this part, and the background is just irrelevant. Same with the human being here and the plane here. Next, I wanna do a quick comparison between vit and cnn. There are several advantages of vwhen comparing with cnn. The most important one I wanna start with is less inductive biases. Inductive bias in machine learning refers to the assumptions learning algorithm makes you generalize from observed training data to one scene data. It might sound alright, right, but this can be bad when the assumption is wrong. Cnns are inherently designed with strong inductive biases that reflect assumptions about the nature of image data. These include locality, where pixels are strongly correlated with their immediate neighbors. Also translational equivalents, where a pattern recognized in one part of an image will be recognized if it shifts to another. These biases are hot wired into their convolutional kernels and pooling layers, making them highly efficient at extracting local features and patterns. In a lot of time, it works for cnn. However, this can be bad when the assumption is wrong. In contrast, vit assume minimal prior knowledge or inductive bias about the spatial structure of images. They treat images as flat sequence of patches, relying solely on the self attention mechanism to learn our relationship from scratch. The next advantage of vit is the capability to dynamically compute filters for every input sequence. This allows the model to adapt its features extraction to the particular contacts of the input data, unlike cnn static prelearn weights. The next one is vhave better global contacts modeling. Vit excel at capturing low range dependencies and global relationship across an entire image. Thanks to the multi head self attention mechanism. It allows every patch to interact with and weigh the importance of every other patch, providing a holistic view of the image that cns struggle to achieve. Lastly, it has enhanced scalability and generalization when compared with cns. When ptrain ined on sufficiently large data sets, vdemonstrate remarkable scalability, flexibility and better generalization capabilities, they usually outperform state of rcnn on challenging benchmarks if it's trained on enough data. As a result, vit is usually more robust for real world applications. However, it also comes with a bunch of limits. The first one is significant data hunger. Like other transformer based models, vit depend on very large data sets for pre training to achieve competitive performance. Without those data, vmay underperform compared to cnn, which can learn effectively from smaller data sets due to their stronger inductive biases. The next one is related. High computational and memory demands. The quadratic complexity of the self attention mechanism with respect to a number of tokens, which directly correlates with image resolution, lead to a substantial combination tional overhead and high memory consumption. This can be eased by, say, flash attention. Next one, training vit can be more challenging and time consuming than training cnns. They often require more epochs to converge and are sensitive to optimization strategies. Also similar to iom's, it's hard to interpret vit. The intricate multi head at attention mechanism, particularly the complex blounding of attention weights from each layer, makes vit less transparent than cns. The last one is about the fixed size input tokens and embeddings. While vit process images as patches, the initial tokenization often assume fixed size patches, which can sometimes limit their flexibility. Now we already know what is a vit and how does a vit work. And we also now compare with cnn. It come with a bunch of advantages. Now I want to spend more time to go through vitposition in mulmodel lom. This is a brief overview of mulmodel loms 's key components. I will do a deep dive in future episodes, but this overview will just work. The key components are first modality encoders. The primary function is transform raw data from various modalities, say image, audio, or text into numerical future representations, usually embeddings. The next component is input projector. It aligns encoded features from different modalities into a common space typically compatible with the lm backbones input. And this is really using mlps, cross tensions, q formers, etc. The third key at component is lm backbone. This serves as the central reasoning and language processing engine integrating aligned multimodel information, usually ptrain lms, say GPT's, Geminis, llama, etc. The next key component is output projector. It maps lm output, for example, signal tokens for generation, into features suitable for modality specific generators, typically nlps or transformers. The last one is modality generator. This produces output in non textual modalities, say images or audios or videos based on input from the output projector. This can be Stable Diffusion or audio ldm or vo for video. For all those key components in multimodel, iom, vit is pretty important in two key components. The first example is within an image generator. Recall that diffusion, whether it's ddpm or ddim or ldm, uses units as its architecture, which contains a convolution layers to extract image features. More details. You can take a look at my diffusion deep type. We already know vit surpass ynn in many aspects, so naturally, if we replace units, convolution layers with mlp and attention, it will get us uvit. This improvement is used in recent generative diffusion models. And this graph is a comparison of the classical unand, and the next key components that's using vit is image encoder. In order to go through this part, I have to introduce another important concept clip contrasted language image ptraining. It is designed to learn visual concepts from natural language supervision and is an important framework for mulmodlm. I probably will also do another deep dive in a separate episode, but I will try to provide enough information in this intro. Unlike traditional image classification models that are trained on fixed categories, clip learns an open set of visual concepts by associating images with their natural language descriptions. This allows for remarkable zero shot capabilities, meaning it can classify image or understanding visual concepts that's never explicitly seen during training simply by given a textual description. This image from OpenAI is a good summary of what is clip doing. The most important part is the contrasted ptraining, which I'm gonna na go through later. After we have the contrasted ptraining results, we can create dataset classifier from labbel text, and then we can use this dataset for zero shot ck prediction. So there are several key components for clip. The first one is image encoding. This neural network takes ks an image as input and transforms it into a numerical representation called an embedding or feature vector. This embedding captures the salienvisual features of the image. This is where bit shines. The next key components is text encoder. This neural network takes a piece of text as input and transforms it into a numerical representation that captures its semantic meaning. The last one is the shared embedting space. The crucial innovation of clip is that both the image encoder and the text encoder are trained to map their respective inputs into a shared high dimensional embedding space. In this space, embeddings or semantically similar images and text are closed together, while those of these similar pairs are far apart. The training is sense of clip is contrative training clip is trained on the massive data set of image taxpairs. The training process involves a contrasted learning objective. It learns robust and generalizable representation for both modalities, text, imagery. It effectively learns to tell what goes with what individual and linguistic world. Ds, first, it trains with the positive pairs. For a given batch of image text pairs, the model considers the actual matching image and text as a positive pair. It aims to maximize the cosine similarity between their embedin, the shared space, and then it learns on the negative pairs. All other image text combination within the batch are treated as negative pairs. The model aims to minimize the cosine similarity between their embeddings. Now, why is vit grid for clip? First, we already gone through global context, the self attention mechanism in vallows clip to learn global relationships and dependencies within an image. This is very important for understanding complex scenes and associating them with rich descriptive tags in multimodomms. The next one is scalability. Vbenefit from scaling out model size and training data, larger vit model trained on more data tend to yield better performance. The last one is unified architecture. Vit is using a transformer based architecture, the same with GPT surgeminiser allamas. This facilitates the alignment of their embedding in the shared space as they share similar underlying computational principles. This is a example of viit acting as encoder. In a multimodlm, the input image goes through a vision encoder, which is viit in this case, and the result is a bunch of encoded patch embeddings. In parallel, the texrole modality also goes through a bunch of structures, say tokenizer q formand. The output is some kind of similar embedding, and then we somehow map those embeddings from different modalities into a share space and feed it into the ptrain lm. And then the ptrain iom should understand this request from different modalities and act upon it, whether it's generated an image or video and audio or just some text. Alright, this is the last slide of the Vintro. Hope this helps. If you like my video, please subscribe, comment and like I'll see you later. Bye.

概览/核心摘要

主讲人介绍了视觉 Transformer (Vision Transformer, ViT)，这是一种将最初革新了自然语言处理 (NLP) 的 Transformer 机制应用于计算机视觉 (CV) 任务的架构。ViT 通过将图像分割成图像块 (patch) 序列来处理图像，接着对每个图像块进行线性嵌入 (linear embedding)，并利用多头自注意力 (multi-headed self-attention) 机制直接建模图像块之间的全局关系。ViT 于 2020 年首次提出时，在图像分类任务上，当使用足够大的数据集进行训练时，其性能超越了当时先进的卷积神经网络 (CNN) 模型，且计算资源消耗减少了四倍。

具体而言，ViT 将二维图像转换为一维令牌 (token) 序列，类似于文本中的词元。一个特殊的可学习的分类令牌 (CLS token) 用于聚合全局信息以进行分类，而位置编码 (positional encoding) 则保留了空间信息。与 CNN 相比，ViT 的优势包括更少的归纳偏置 (inductive bias)、动态计算的滤波器、更优越的全局上下文建模能力，以及在海量数据支持下更好的可扩展性和泛化能力。其局限性在于需要大量数据、计算需求高以及可解释性较差。ViT 在多模态大语言模型 (LLM) 中扮演着关键角色，尤其作为图像编码器 (例如在 CLIP 框架中) 和图像生成架构 (如 U-ViT) 的组成部分。

Vision Transformer (ViT) 简介

主讲人首先指出，因论文《Attention is All You Need》而闻名并开启了当前大语言模型 (LLM) 时代的 Transformer 机制，也已成功应用于计算机视觉 (CV) 领域。
* ViT 定义：视觉 Transformer (ViT) 是一种将图像分割成序列，应用线性嵌入，然后利用多头自注意力机制直接建模图像块之间全局关系的模型。
* 在多模态 LLM 中的作用：ViT 在需要对多模态数据进行整体理解的多模态 LLM 中展现出日益重要的作用。

历史背景与演进

主讲人简要回顾了计算机视觉领域的发展背景：
* CNN 的主导地位：在很长一段时间里，随着深度学习的普及，计算机视觉领域备受关注，涌现了如 AlexNet、ResNet、RCNN 和 YOLO (You Only Look Once) 等模型。主讲人提到，在他检查时，“YOLO 已经更新到第十一版”。
* 转折点：2017 年的论文《Attention is All You Need》改变了行业格局，Transformer 成为许多任务中最受欢迎的架构。
* Transformer 的广泛应用：尽管基于 Transformer 的大语言模型广为人知，但该架构同样应用于计算机视觉领域，并且随着人们对多模态 LLM 兴趣的增长，其重要性日益凸显。
* 主讲人还提到，通常与图像生成相关联的扩散模型 (diffusion model)，也可用于基于文本的大语言模型，但这一主题将在未来讨论。

Vision Transformer (ViT) 工作原理

主讲人通过 ViT 原作者提供的一个动画 GIF，概述了 ViT 的工作流程，主要包括以下几个步骤：

图像分块 (Patching)：将输入图像分割成若干图像块 (例如，GIF 示例中的 9 个图像块)。这些图像块随后被展平 (flattened) 成向量。每个图像块被视为“如同完整句子中的一个单词”。
展平与位置编码 (Flattening and Positional Encoding)：展平后的图像块与位置编码 (position encoding)（例如，对于 9 个图像块，编码从 1 到 9）相结合。
CLS (分类) 令牌 (CLS Token)：添加一个“额外的可学习类别嵌入” (extra learnable class embedding)，索引为零 (0)。这个被称为 CLS 令牌的特殊标记，负责捕获全局信息。
Transformer 编码器 (Transformer Encoder)：这些经过展平编码的图像块被送入 Transformer 编码器，其结构“与我们深入探讨 Transformer 时所介绍的基本一致”。
MLP 头与分类 (MLP Head and Classification)：最终，编码器的输出通过一个 MLP (多层感知机) 头，以生成图像的分类结果。ViT 的原始论文将此架构用于图像分类任务。

ViT 功能总结：
* ViT 将输入图像表示为图像块序列，类似于 Transformer 处理文本时使用的词嵌入序列。
* ViT 直接预测图像的类别标签。
* 性能表现：在 2020 年首次发布时，当有充足数据进行训练的情况下，ViT 在主流分类任务上展现出“卓越的性能，以四分之一的计算资源超越了同等水平的先进 CNN”。
* 主讲人引用了性能对比数据，表明 ViT 在各项图像相关任务中的得分“均超越了 CNN”。

1. 图像令牌化：将图像视为块序列

令牌化的必要性：Transformer 模型以令牌 (token) 的形式处理输入。那么，是否可以将每个像素都视为一个令牌呢？
- 计算复杂度：“计算注意力矩阵的复杂度是 N 的平方，其中 N 是序列长度。如果我们把每个像素都看作一个独立的令牌，那么即使对于一张 100x100 的小图像，注意力矩阵的大小也将是 10000x10000。这对于 GPU (图形处理器) 来说也过于昂贵。”
解决方案：将特定大小的图像块 (patch)（例如 16x16 像素）作为一个令牌。
处理流程：
1. 将一个 RGB (红绿蓝) 图像（尺寸为大W x 大H x 3）分割成多个图像块（每个块尺寸为小w x 小h x 3）。
2. 每个图像块被展平。
3. 展平后的图像块通过一个无激活函数的全连接层（或称为稠密层/前馈网络层）进行线性变换。
4. 这个稠密嵌入层将图像块转换为一个特定维度的、学习到的隐藏表示。
5. 例如：“一张 224x224 像素的图像可能被分割成 16x16 像素的图像块。”每个这样的二维图像块被展平成一个一维向量，成为一个“视觉词汇或令牌”。

2. 块嵌入：块的线性投影

线性投影：图像块展平后，每个图像块向量会经过线性投影，映射到一个更高维度的嵌入空间。这种线性变换将每个图像块的原始像素数据转换为稠密的向量表示。
CLS (分类) 令牌 (CLS Token)：
- 为了便于图像分类，一个“特殊的可学习分类令牌 s”（称为 CLS 令牌）通常被添加到图像块嵌入序列的开头。
- 该令牌在通过 Transformer 各层时，旨在“聚合来自所有图像块的全局信息”。
- 其最终输出状态用于图像分类决策。
- 提及的数学公式包含：CLS 令牌、展平图像块的线性变换以及位置编码。
CLS 令牌详解：
- 主讲人强调，此令牌“不仅在分类任务中有用”。
- 它充当“整个输入序列或句子的表示”。
- 在标准的 ViT 图像分类中，其主要功能是“聚合来自所有单个图像块的全局信息”。CLS 令牌“学习权衡整个图像中不同视觉区域的重要性”。
- CLS 令牌的最终输出“通常被送入 MLP (多层感知机)”。CLS 令牌成为“整个图像的凝练、整体性表示”。
- 超越分类任务 (在多模态 LLM 中)：
  - 视觉重要性指标：“CLS 令牌可以学习哪些视觉令牌或图像块对于模型的整体理解最为重要。”其注意力得分可作为直接指标。
  - 视觉令牌压缩：诸如视觉令牌压缩之类的方法利用 CLS 令牌的注意力得分来“修剪冗余的视觉信息”。
  - 高分辨率图像处理：CLS 令牌的注意力模式对于在给定预算内智能选择信息最丰富的视觉令牌也很有用。
  - 语义感知：CLS 令牌的注意力“已被证明与视觉内容相关，有助于识别主要对象并将其与图像中不相关的背景区分开来”。

3. 位置编码：保留空间信息

空间信息的重要性：Transformer 架构本身具有“排列不变性” (permutation invariant)，这意味着它本身不理解输入令牌的顺序或空间排列。
ViT 中的解决方案：类似于在基于文本的 LLM 中使用正弦位置编码或旋转位置编码 (Rotary Position Embedding, RoPE)，ViT 将位置编码添加到图像块嵌入中，以便为图像“重新引入这一至关重要的空间上下文”。
这为模型提供了关于每个图像块在原始图像中位置的信息。

4. Transformer 编码器模块：核心处理单元

输入：包含图像块嵌入、位置编码和 CLS 令牌的组合序列被送入标准的 Transformer 编码器，该编码器由多个相同的层组成 (提及的“Lx”表示 L 层)。
Transformer 层的核心：多头自注意力 (multi-head self-attention) 机制。
- 它允许每个图像块“动态地权衡图像中所有其他图像块的重要性”。
- 这种全局连接性使模型能够“捕获整个视觉输入中的复杂关系和远程依赖性”。这是“ViT 和 CNN 之间的关键区别”。
- “多头”方面意味着注意力机制并行执行多次，每次都有不同的学习目标，从而使模型能够学习不同类型的关系。
其他组件：每个 Transformer 编码器模块通常包括：
- 层归一化 (Layer Normalization)。
- 多层感知机 (MLP) 前馈网络层。
- 残差连接 (Residual connections)，也称为跳跃连接 (skip connections)，应用于每个子层周围，以促进梯度流动并稳定深度网络的训练。
用于图像分类：Transformer 编码器的最终输出，特别是 CLS 令牌的输出，随后被传递到一个简单的分类头 (通常是另一个 MLP)，以预测图像类别。
架构创新：
- 用基于卷积的架构 (转录文本为 "Khan"，可能指代 ConvNet 或 ConvNeXt 等) 替换标准 MLP，以潜在捕获更复杂的非线性依赖关系。主讲人表示将在另一期节目中讨论。
- 集成如 FlashAttention 等高效注意力机制，以优化注意力过程的计算效率。

ViT 可视化：注意力图

定义：注意力图 (Attention map) 是一种可视化技术，它“揭示了模型在做出分类决策时关注图像的哪些部分”。它突出了不同图像块或令牌在模型预测中的重要性。
来自原始论文的示例：
- 鸟类图像：模型主要关注鸟本身，而非背景。
- 人物和飞机图像：表现出类似的、符合直觉的关注模式。

Vision Transformer (ViT) vs. 卷积神经网络 (CNN)

主讲人比较了 ViT 与 CNN，强调了 ViT 的优缺点。

ViT 的优势

更少的归纳偏置 (Inductive Biases)：
- CNN：设计时带有强烈的关于图像数据本质的归纳偏置，例如局部性 (locality)（像素与其近邻像素强相关）和平移等变性 (translational equivariance)（在图像某部分识别的模式若平移到其他部分也能被识别）。这些偏置“固化在其卷积核和池化层中”。
- ViT：“对图像的空间结构仅做最少的先验知识或归纳偏置假设。”ViT 将图像视为扁平的图像块序列，“完全依赖自注意力机制从头学习所有关系”。当 CNN 的假设不成立时，这可能成为一个优势。
动态计算滤波器的能力：ViT 能够根据输入数据的特定上下文调整其特征提取方式，这与 CNN 中静态的、预先学习的权重不同。
更优的全局上下文建模：得益于多头自注意力机制，ViT 在“捕获整个图像的远程依赖和全局关系”方面表现出色。
增强的可扩展性与泛化能力：
- 当在足够大的数据集上进行预训练时，ViT展现出“卓越的可扩展性、灵活性和更强的泛化能力”。
- “如果训练数据充足，ViT 通常在具有挑战性的基准测试中优于先进的 CNN。”因此，ViT 对于实际应用通常更具鲁棒性。

ViT 的局限性

显著的数据需求 (Data Hunger)：与其他基于 Transformer 的模型类似，ViT “依赖非常庞大的数据集进行预训练”才能达到有竞争力的性能。若缺乏此类数据，ViT 的性能可能不及 CNN。
高计算与内存需求：自注意力机制的计算复杂度与令牌数量（直接关联图像分辨率）成二次方关系，导致巨大的计算开销和高内存消耗。这可以通过 FlashAttention 等技术得到缓解。
训练挑战性高：训练 ViT 可能比训练 CNN“更具挑战性且更耗时”，通常需要更多的训练轮数 (epoch) 并且对优化策略敏感。
难以解释：复杂的多头注意力机制，特别是各层注意力权重的复杂混合，使得 ViT 比 CNN“透明度更低”。
固定大小的输入令牌和嵌入：初始的令牌化过程通常假设图像块大小固定，这有时会限制其灵活性。

Vision Transformer (ViT) 在多模态大语言模型 (LLM) 中的应用

ViT 在多模态大语言模型 (LLM) 架构中扮演着重要角色。

多模态 LLM 的关键组件 (概述)

模态编码器 (Modality Encoders)：将来自不同模态（如图像、音频、文本）的原始数据转换为数值特征表示（通常是嵌入）。
输入投影器 (Input Projector)：将来自不同模态的编码特征对齐到共享空间，该空间通常与 LLM 主干网络的输入兼容。可使用 MLP、交叉注意力 (cross-attention)、Q-former 等。
LLM 主干网络 (LLM Backbone)：作为核心的推理和语言处理引擎，整合对齐后的多模态信息。通常是预训练的 LLM，如 GPT、Gemini、Llama 等。
输出投影器 (Output Projector)：将 LLM 的输出（例如，用于生成的信号令牌）映射为适用于特定模态生成器的特征。通常是 MLP 或 Transformer。
模态生成器 (Modality Generator)：根据输出投影器的输入，生成非文本模态的输出（如图像、音频、视频）。例如 Stable Diffusion、AudioLDM 或通用/特定视频模型 (转录文本为 "vo for video")。

ViT 在图像生成中的作用 (例如 U-ViT)

扩散模型 (如 DDPM, DDIM, LDM) 使用 U-Net 架构，其中包含用于提取图像特征的卷积层。
鉴于 ViT 在许多方面优于 CNN，“如果我们将 U-Net 的卷积层替换为 MLP 和注意力机制，就能得到 U-ViT”。这一改进已应用于近期的生成式扩散模型中。

ViT 在图像编码中的作用 (例如 CLIP 框架)

CLIP (对比语言-图像预训练, Contrastive Language-Image Pre-training)：
- 旨在“通过自然语言监督学习视觉概念”。
- 是多模态 LLM 的一个重要框架。
- 与传统的图像分类模型不同，CLIP 通过将图像与其自然语言描述相关联，学习“开放集合的视觉概念”。
- 这使其具备“卓越的零样本 (zero-shot) 能力”，即仅通过文本描述，就能分类图像或理解在训练期间从未明确见过的视觉概念。
CLIP 的关键组件：
1. 图像编码器 (Image Encoder)：一个神经网络，接收图像输入并将其转换为嵌入或特征向量。 “ViT 在此大放异彩。”
2. 文本编码器 (Text Encoder)：一个神经网络，接收文本输入并将其转换为捕捉其语义的数值表示。
3. 共享嵌入空间 (Shared Embedding Space)：CLIP 的关键创新在于，图像编码器和文本编码器都被训练成将其各自的输入映射到同一个高维共享嵌入空间。在此空间中，语义相似的图像和文本的嵌入彼此靠近。
CLIP 中的对比学习 (Contrastive Training)：
- 在海量的图像-文本对数据集上进行训练。
- 正样本对 (Positive Pairs)：对于给定批次中的图像-文本对，模型将实际匹配的图像和文本视为正样本对，目标是最大化它们在共享空间中嵌入之间的余弦相似度 (cosine similarity)。
- 负样本对 (Negative Pairs)：批次内所有其他图像-文本组合均被视为负样本对，目标是最小化它们嵌入之间的余弦相似度。
- 因此，CLIP “有效地学会在视觉和语言世界中辨别哪些内容是相互匹配的”。
为何 ViT 适用于 CLIP？：
1. 全局上下文：ViT 中的自注意力机制使 CLIP 能够学习图像内的全局关系和依赖性，这对于理解复杂场景至关重要。
2. 可扩展性：ViT 受益于模型大小和训练数据的扩展；在更多数据上训练的更大 ViT 模型往往能产生更好的性能。
3. 统一架构：ViT 使用基于 Transformer 的架构，与文本 LLM (如 GPT, Gemini, Llama) 相同，这“有助于它们在共享空间中对齐嵌入，因为它们共享相似的底层计算原理”。
ViT 作为多模态 LLM 编码器的示例：
- 输入图像 -> 视觉编码器 (ViT) -> 编码后的图像块嵌入。
- 文本模态 -> 令牌化器 (Tokenizer), Q-former -> 类似的嵌入。
- 来自不同模态的嵌入被映射到共享空间 -> 送入预训练的 LLM。
- 预训练的 LLM 随后理解来自不同模态的请求并采取行动（生成图像、视频、音频或文本）。

结论

主讲人最后表示，希望关于视觉 Transformer 的讲解对听众有所帮助。总而言之，ViT 已将 Transformer 的范式引入视觉领域，为图像处理提供了一种新方法，该方法具有强大的全局上下文建模能力和良好的可扩展性，并已成为开发先进多模态人工智能系统的重要组成部分。

摘要历史 (3)

StreamSparkAI