speaker 1: Hello everyone, we all know the transformer paper attention is all you need revolutionized natural language processing and started the current large language model era. Now last people know that the same mechanism is also tried in computer vision and succeeded vision transformer or vit's sugment images into sequences, apply linear embeddings, and then unleash the power of multi headed self attention to directly model global relationships between image patches. Moreover, vit is shining even more in mulmodel ims. Alright, I can't wait to share this inftro with you, but dive into it. Let's start with some history. For a very long time, computer vision cv was gaining more attention when deep learning was starting to get popular. You probably heard of Alex nrasnet rcnn or Yolo. If you're in the industry for long enough, I'm surprised that Yolo is already at version eleven. When I checit yesterday, and then a famous paper changed the landscape at 2:17, attention is all you need. Since then, transformer has became the most popular architecture for many tasks, and this is the classical encoder decoder transformer architecture. Most people would probably know large language models are mostly based on transformer. However, it is also used in computer vision and is becoming more and more important as multimodel ioms, becoming what people want naturally. Similarly, you would assume diffusion model is used only in image generation, but it's not. It can also be used in tax based large language models. Let's save this topic for some other time today. Let's focus on vision transformer bit. So what is the vision transformer? This animated gif from the original author is a great overview of what happened in vision transformer. I'm going to pause a bit for you to take a look. Alright, there are several steps that animated gif. First, you break an entire image into a number of patches. In this case, it uses nine patches. For example, those patches are flattened into a vector, and each of them is like a word in a complete sentence. Afterwards, we do some flattening on those patches and combine the position encoding from one to nine. With those nine patches, there is an extra learnable class embedding, which is zero. This is a special information that capture a lot of global information. I'm gonna to go into details later. And then we send those flattened encoding to the transformer encoder. This is basically identical with what we have gone through in the transformer deep dive. Take a look if you're interested. And in the end, we go through a mlp hat and output the classification of the image. Keep in mind this original paper is using vision transformer for image classification task. To summarize, vision transformer is a vision model similar to the transformer architecture originally designed for tabased task. Vit represents an input image as a sequence of image patches, similar to the sequence of word embeddings used when applying transformer to text, and directly predicts class label for the image. When used for image classification, vit demonstrates excellent performance when trained on sufficient data, outperforming a comparable state of the art cnn with four times fewer computational resources on popular classification task when it first came out at 2:20, this is the comparison. The previous sulta state of the art cnn is this score on all this image related task. And this is vit. It all surpasses the cnn. Moreover, in the more recent multimodel world, vit plays a pivotal role by enabling these models to seamlessly integrate and understand visual information alongside texflow data. This integration is very, very important for tasks that require a holistic understanding of both modalities. Now let's do a walkthrough of vit. The first topic I wanna go through is image tokenization. It is treating images as a sequence of patches. If you have gone through my transformer deep dive, or if you have some basic knowledge of tabased transformer, you should know transformer models process inputs as tokens. Can we afford treating each pixel as a token when we use transformer on imagery? Recall that the computational complexity of calculating the attention matrix is n square, where n is the sequence. If we treat each pixel as a separate token, then assuming a very small image size of 100x 100, the attention matrix will be size of 10000x ten zero. This is too expensive even for GPU's. So a reasonable alternative is a patch of some size, say 16x 16 as one token, an rgb image of size, big W X, big H X three, as in rgb is divided up into different patches, each of size, small W, small H X three. Each patch is flattened and passed through a dense or fully connected or feet forward network layer without activation for flattening. And this dense emmatting layer transformed the patch into a hidden learn representation of dimension. This graph tries to demonstrate what I went through just now. For example, this is a big image of size, big H by big W. We break it down into different patches, and for each patch it has size W by H. We firstly flatten this image into A H by W by three vector, and we use this as input and go through a dance layer without activation. And the output is a learned embedding that we want for subsequent steps. As described, the initial and crucial step in a vis to transform a 2D image into a sequence of one d tokens conceptually similar to words in a sentence, and this is achieved by splitting the input image into a grid of fixed size, typically non overlapping square patches. For instance, a 224x 224 pixel image might be divided into 16x 16 pixel patches. Each of these 2D image patches is then flattened into a one d vector of pixel values, effectively becoming a visual word or token in the sequence that a transformer can process. Now, the next key point is patch embedding. It is a linear projection of patches. After the patches are flattened, each patch vector undergoes a linear projection into a higher dimension embedding space. This linear transformation converts the raw pixel data of each patch into a dense vector representation, making it suitable for the transformer's encoder operation. To facilitate image classification, a special learnable classification token s is typically prependent to this sequence of patch embeddings. This token is designed to aggregate global information from all image patches as it path through the transformer layers, and its final output state is then used for the image classification decision. This is the math formula of how we do linear projection for the image. The first parameter is cs token, the second one is the linear transformation in the flattin image patches, and the last one is position encoding. I want to spend more time on cos token. Don't get misguided by the name classification token. The token is not only useful in classification task. The cos token is a special learnable embedding that is prepended to the sequence of image patch embeddings before they fat into the transformer encoding. Conceptually, it serves as a representation of the entire input sequence or sentence in a standard vit for image classification, cos tokens primary function is to aggregate global information from all the individual image patches. Cos tokens learns to weigh the importance of different visuregions across the entire image. And the final output state of this cos token is then typically Plato, a nlp multilayer perceptron. Essentially, the cos token becomes a condensed, holistic representation of the entire image, suitable for a single classification decision. And if we move beyond classification tasks in multimodel lms, cos token acts as a powerful aggregator, an importance indicator for visual information within vit, enabling efficient visual understanding in complex multimodel systems, it acts as a visual importance indicator. Crs token can learn which visual token or image patches are most important for the model's overall understanding and for generating relevant textual responses. Its attention scores on the visual tokens can serve as a direct indicator of their importance. It can also use for visual token compression. Methods like visual token compression utilize the attention scores of the cos token to prune redundant visual information. And in higher resolution image processing, attention patterns of the cos tokens are also very useful. By aggregating cos attention across multiple heads from the final layer as a feature important score, the model can intelligently select the most informative visual tokens within an allocated budget. Lastly, it can also be used as semantic awareness. Attention of a seos token has been shown to correlate with visual conthelping to identify main objects and distinguish them from irrelevant backgrounds within an image. The next key part is position encoding. We have been spending a lot of time in this part, but I think it's worth it. Position encoding is used to retain spatial information. If you remember, a critical aspect of the transformer architecture is its inherent for mutation invvariance. That means it doesn't care about the position. It process sequences without an intrinsic understanding of the order of spatial arrangement of its input tokens. In based transformer based iim's, we use thermosoidal positional encoding or rope to encode positional information. Similarly, in vto reintroduce this vital spatial context for images, positional encodings are added to the patch embeddings. These encodings provide the model with the information about the original position of each patch within the image. This is the position encoding. The next part is the transformer encoder block the combined sequence patch embeddings positional encodings, and the eos token is then fed into a standard transformer encoder, which is composed of multiple identical layers. You can see there's lx. The heart of each transformer layer is the multi head self attention mechanism. It enables each patch to dynamically weigh the importance of all other patches in the image. This global connectivity allows the model to capture complex relationship and long range dependencies across the entire visual input. This is the key difference between vand cnms. The multi head at aspect here means that the attention mechanism is performed multiple times in parallel, each with distinct learning targets. This parallel processing allows the model to learn different type of relationship and enrich the overall representation beyond the multi hat self attention. Each transformer encoder block typically includes a layer normalization and a multilayer percepfrom feet forward layer. Residual connections, also known as keep connections, are employed around each sublayer to facilitate radidient flow and stabilize training in deep networks. For image classification, the final output of the transformer encoder, specifically the transformer cis token, is then passed to a simple classification hat. Usually it's another mlp to predict the image category. Architectural innovations within these blogs are ongoing, such as replacing standard mlp with Khan to potentially capture more complex nonlinear dependencies. Further, thermore efficient attention mechanism like flash attention we have already gone through in a flash attention deep life are being integrated to optimize the computational efficiency of the attention process. By the way, let's save kfor another episode. I'm not going to deep dive into it right now. Lastly, I want to build some intuition using a tension map. And attention map is a visualization that reviews which parts of an image the model focuses on when making a classification decision. It essentially highlights the importance different image patches or tokens in the models 's prediction. These attention maps are pulled from the original paper. There are more. So if you're interested, take a look. This is very interesting. And most of the time it makes sense. For example, this is the original image and this is the attention map. So in order to classify this as a bird, you pretty much only care about this part, and the background is just irrelevant. Same with the human being here and the plane here. Next, I wanna do a quick comparison between vit and cnn. There are several advantages of vwhen comparing with cnn. The most important one I wanna start with is less inductive biases. Inductive bias in machine learning refers to the assumptions learning algorithm makes you generalize from observed training data to one scene data. It might sound alright, right, but this can be bad when the assumption is wrong. Cnns are inherently designed with strong inductive biases that reflect assumptions about the nature of image data. These include locality, where pixels are strongly correlated with their immediate neighbors. Also translational equivalents, where a pattern recognized in one part of an image will be recognized if it shifts to another. These biases are hot wired into their convolutional kernels and pooling layers, making them highly efficient at extracting local features and patterns. In a lot of time, it works for cnn. However, this can be bad when the assumption is wrong. In contrast, vit assume minimal prior knowledge or inductive bias about the spatial structure of images. They treat images as flat sequence of patches, relying solely on the self attention mechanism to learn our relationship from scratch. The next advantage of vit is the capability to dynamically compute filters for every input sequence. This allows the model to adapt its features extraction to the particular contacts of the input data, unlike cnn static prelearn weights. The next one is vhave better global contacts modeling. Vit excel at capturing low range dependencies and global relationship across an entire image. Thanks to the multi head self attention mechanism. It allows every patch to interact with and weigh the importance of every other patch, providing a holistic view of the image that cns struggle to achieve. Lastly, it has enhanced scalability and generalization when compared with cns. When ptrain ined on sufficiently large data sets, vdemonstrate remarkable scalability, flexibility and better generalization capabilities, they usually outperform state of rcnn on challenging benchmarks if it's trained on enough data. As a result, vit is usually more robust for real world applications. However, it also comes with a bunch of limits. The first one is significant data hunger. Like other transformer based models, vit depend on very large data sets for pre training to achieve competitive performance. Without those data, vmay underperform compared to cnn, which can learn effectively from smaller data sets due to their stronger inductive biases. The next one is related. High computational and memory demands. The quadratic complexity of the self attention mechanism with respect to a number of tokens, which directly correlates with image resolution, lead to a substantial combination tional overhead and high memory consumption. This can be eased by, say, flash attention. Next one, training vit can be more challenging and time consuming than training cnns. They often require more epochs to converge and are sensitive to optimization strategies. Also similar to iom's, it's hard to interpret vit. The intricate multi head at attention mechanism, particularly the complex blounding of attention weights from each layer, makes vit less transparent than cns. The last one is about the fixed size input tokens and embeddings. While vit process images as patches, the initial tokenization often assume fixed size patches, which can sometimes limit their flexibility. Now we already know what is a vit and how does a vit work. And we also now compare with cnn. It come with a bunch of advantages. Now I want to spend more time to go through vitposition in mulmodel lom. This is a brief overview of mulmodel loms 's key components. I will do a deep dive in future episodes, but this overview will just work. The key components are first modality encoders. The primary function is transform raw data from various modalities, say image, audio, or text into numerical future representations, usually embeddings. The next component is input projector. It aligns encoded features from different modalities into a common space typically compatible with the lm backbones input. And this is really using mlps, cross tensions, q formers, etc. The third key at component is lm backbone. This serves as the central reasoning and language processing engine integrating aligned multimodel information, usually ptrain lms, say GPT's, Geminis, llama, etc. The next key component is output projector. It maps lm output, for example, signal tokens for generation, into features suitable for modality specific generators, typically nlps or transformers. The last one is modality generator. This produces output in non textual modalities, say images or audios or videos based on input from the output projector. This can be Stable Diffusion or audio ldm or vo for video. For all those key components in multimodel, iom, vit is pretty important in two key components. The first example is within an image generator. Recall that diffusion, whether it's ddpm or ddim or ldm, uses units as its architecture, which contains a convolution layers to extract image features. More details. You can take a look at my diffusion deep type. We already know vit surpass ynn in many aspects, so naturally, if we replace units, convolution layers with mlp and attention, it will get us uvit. This improvement is used in recent generative diffusion models. And this graph is a comparison of the classical unand, and the next key components that's using vit is image encoder. In order to go through this part, I have to introduce another important concept clip contrasted language image ptraining. It is designed to learn visual concepts from natural language supervision and is an important framework for mulmodlm. I probably will also do another deep dive in a separate episode, but I will try to provide enough information in this intro. Unlike traditional image classification models that are trained on fixed categories, clip learns an open set of visual concepts by associating images with their natural language descriptions. This allows for remarkable zero shot capabilities, meaning it can classify image or understanding visual concepts that's never explicitly seen during training simply by given a textual description. This image from OpenAI is a good summary of what is clip doing. The most important part is the contrasted ptraining, which I'm gonna na go through later. After we have the contrasted ptraining results, we can create dataset classifier from labbel text, and then we can use this dataset for zero shot ck prediction. So there are several key components for clip. The first one is image encoding. This neural network takes ks an image as input and transforms it into a numerical representation called an embedding or feature vector. This embedding captures the salienvisual features of the image. This is where bit shines. The next key components is text encoder. This neural network takes a piece of text as input and transforms it into a numerical representation that captures its semantic meaning. The last one is the shared embedting space. The crucial innovation of clip is that both the image encoder and the text encoder are trained to map their respective inputs into a shared high dimensional embedding space. In this space, embeddings or semantically similar images and text are closed together, while those of these similar pairs are far apart. The training is sense of clip is contrative training clip is trained on the massive data set of image taxpairs. The training process involves a contrasted learning objective. It learns robust and generalizable representation for both modalities, text, imagery. It effectively learns to tell what goes with what individual and linguistic world. Ds, first, it trains with the positive pairs. For a given batch of image text pairs, the model considers the actual matching image and text as a positive pair. It aims to maximize the cosine similarity between their embedin, the shared space, and then it learns on the negative pairs. All other image text combination within the batch are treated as negative pairs. The model aims to minimize the cosine similarity between their embeddings. Now, why is vit grid for clip? First, we already gone through global context, the self attention mechanism in vallows clip to learn global relationships and dependencies within an image. This is very important for understanding complex scenes and associating them with rich descriptive tags in multimodomms. The next one is scalability. Vbenefit from scaling out model size and training data, larger vit model trained on more data tend to yield better performance. The last one is unified architecture. Vit is using a transformer based architecture, the same with GPT surgeminiser allamas. This facilitates the alignment of their embedding in the shared space as they share similar underlying computational principles. This is a example of viit acting as encoder. In a multimodlm, the input image goes through a vision encoder, which is viit in this case, and the result is a bunch of encoded patch embeddings. In parallel, the texrole modality also goes through a bunch of structures, say tokenizer q formand. The output is some kind of similar embedding, and then we somehow map those embeddings from different modalities into a share space and feed it into the ptrain lm. And then the ptrain iom should understand this request from different modalities and act upon it, whether it's generated an image or video and audio or just some text. Alright, this is the last slide of the Vintro. Hope this helps. If you like my video, please subscribe, comment and like I'll see you later. Bye.