2023-09-20 | Stanford CS224N NLP with Deep Learning | Lecture 16 - Multimodal Deep Learning, Douwe Kiela

多模态深度学习：NLP与图像融合前沿

视频科技

媒体详情

上传日期: 2025-05-20 23:43
来源: https://www.youtube.com/watch?v=5vfIT5LOkR0
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

speaker 1: So today, I'm delighted to introduce our first invited speaker, who's doakeila dohas also been as well as being invited. And I'll tell his background, he's also in the symbolic systems program, has been an adjunct professor and has been involved with some students in that role as well. But in his invited role, he's originally from the Netherlands, where he even learned some logic, among other things, back in the old days. But in more recent times, he's been a prominent deep learning researcher for a number of years. He worked at Facebook, now meta, in the fair unit, and was involved in various ideas, including retrieval, augmented generation. After that, he then spent some time at hugging face. He's become interested in looking at multimodal models, which is what he's going to be talking about today. And welcome, dit's. Great to have you. Thank you very much.
speaker 2: All right, that works. Yes. Yeah. Thanks everyone for coming. I understand that you get points for being here, so you're not really here for me, but thanks for coming anyway. So I'm going to talk about multimodal deep learning. It's going to have an nlp focus, of course, that's for discourse, but it's also because otherwise I would really be talking for many more hours than I have time for here. So I'll try to really keep it focused on the things that I think will be most useful for you to learn. And so the first thing you should understand is that this whole concept of multimodality is kind of ill defined, actually. So if you go to the dictionary, you'll see that it means having or involving several modes or modalities or maxima. And so what mode hair really means is, so it could be mode in a very generic sense, or it could be a very precise sense of the mode of a statistical distribution. And so depending on the paper you're reading, in some cases people really mean the statistical sense. In other cases, people really mean this sort of very vague concept of a modality where it really means the type of information that you're getting. So an example of modality in that case is an image or speech signal, or audio in general, or even olection tion. So smell or things like that. So in this lecture, we're just going to focus mostly on text because this is an np course, and we're going to focus on images mostly as the utter modality to keep it simple. All right. So why does it matter? Why do we care about multimodality? And so there are a couple of really good reasons in general for this. The first one is about faithfulness. So if you look at how we humans understand the world, how we make sense of what happens in the world, that is very multimodal, right? So we perceive the world not just using vision or just audio, but we synthesize information across all of these different modalities, and that's how we understand the world and each other. There's also a very practical argument for doing it is because the Internet is multimodal. So if you go to, I don't know, like Facebook or something like that, like it rarely happens that it's just text or just an image. There's usually a combination of multiple modalities. And then the final good reason that we're just starting to hit now, if you're really following where the field is going, we're kind of running out of text data for these large language models. So one, an interesting way to keep scaling on the data side is to make use of all of these other modalities. So if you can have your language model also watch all of the videos of cats in the world, it's going to understand the concept of catat much better. And that's what we want to have in these models. We want them to understand the world in the same way that humans understand it. So right now, multimodality is really one of the main frontiers of this new foundation model drive that we're all in right now. There's a thing called the mcurk effect. Let's see if it loads up. But so what we'll see when this loads is this guy over here, and we'll have the same audio effect being played. So the audio is exactly the same. And this man is going to say something like, and so you're hearing A B there, I think if you look at my mouth, because that's what I said, but if you then change the video to where he says, fffwith exactly the same audio, you're going to hear the other version. So unfortunately, I can't really like swap in the different audio here. So you have to trust me for it. We might suddenly start hearing a guy saying ffa and then, all right, so multimodal applications. So when we have multiple modalities, we can do all kinds of interesting things. And as I said, most of the use cases we have on the Internet, they're all multimodal. And there are some really kind of obvious things we would be interested in if we have information from these different data sources, from different modalities. So obviously, we might want to do retrieval. So maybe give a bit of text, we want to find the right image, or maybe given some image, we want to find the right text for it so we can match them up. Obviously, we can also do this in a generative setting ings. So then we have image captioning, which you've probably heard of. We can do text to image generation. So that's image synthesis. And so Stable Diffusion. Everybody in the audience here has probably seen that. Then we can do visual question answering where we have an image ntext. And then we need to generate some new text. We have multimodal classification where we have image ntext. And we need to have a label, for example, whether something is hate speech or not. And then in general, we want to be able to have a richer understanding of information, which means that we combine images and text and then use it for downstream applications that require better understanding or better generation. So this field really is super hot right now. So there's this nice paper title. I predict that this paper is going to do really well in terms of citations. Just because it has such a citable title, I think a lot of people are not actually going to read it. And so I mean, I've been in this field for quite a while now, and people have been saying this for a really long time. I think Chris would agree that. So for decades, people have been saying that multimodal is the next big thing. But now it's really, I think, all right, so the outline for what we're going to be talking about. So first I'm going to tell you a little bit about early models. Then we're going to do a bit of a deep dive on some of the specifics. Then we're going to go over a particular type of fusion, contrasted models or lake fusion. Then we're going to go through a little bit of the history of multimodal foundation models. Then we're going to talk a little bit about evaluation, a little bit about other modalities, and then I'll make some predictions for the future and hopefully maybe give you some cool research ideas or things to talk or think about. All right. So obviously, there's a lot of work that happened before deep learning, but I think if you want to start from like the deep learning revolution and what was happening in images and text, then a good starting point is, for example, wasabi or device, or Richard Soker, who you've probably heard of, has done some really cool early work in this that really pioneered a lot of these ideas. And the basic gist of this is that we have a vision model. On the one hand, we have a language model. So this really, I mean, the first lecture of discourse, I think, was about word embeddings. So that's just your basic word embedding model. And now we need to figure out how to align them in the same multimodal space. So the way you do that is you get some sort of similarity metric, right? A score function or like a kernel function, if you're thinking about this from a support vector machine literature perspective, now you need to figure out in a max margin or margin loss how you want to align these two points in your embedding space. So things that are similar, you want to bring them closer together. Things that are not, you want to bring them further apart. And if you do that in this multimodal embedding space, that means that you can do interesting cross ssmoal transfer where you can take the word embedding for something like auto or like horse, then you can find close images into embedding space to that thing. And now you've solved the retrieval problem. So this is a really nice early application. And I think a lot of the stuff that I'm going na talk about in the early slides, you're gonna to see this income over and over again. You're gonna to see it get kind of reinvented with fancier models, but it's basically all the same stuff. So you can do cross modal transfer where you have images and text, but you can also combine them together so that you get a multimodal word embedding. And so this just gives you a more accurate representation of how humans understand word meaning. Because when we think about the word moon or cat or something, we can go to Wikipedia and read that a cat is a small carnivorous mammal that people like to keep as pets, or we can just go and look at pictures of cats. So now we understand what a cat is, right? And I would argue actually, that for a lot of people, the picture of the cat is much closer to the meaning of the concept of cat. So some early work where people were trying to do this is from brunea, all where they did multimodal distributional semantics using this very elegant approach called bag of visual words. So just like who has heard of bag of visual words? Very few people. Okay. So it's it's surprisingly simple. So I kind of like it. It's nicely elegant. So you take a picture of a moon, in this case, I think you can see it in the back turight. So we use an algorithm like sift to find interesting key points. So it's sort of where the difference between the pixels and the pixels next to it, where the difference is big, those are sort of the spots you want to be looking at. And for each of these key points, you get feature descriptors. So relatively small vectors, like 32 dimensional events kind of on the implementation of this. And what you can do now with these feature descriptors is you can cluster them using k means, and then you assign every one of these points so you can count how often they occur, right? So in this picture of the moon, we have like actually the count is, Oh Yeah, so there are three like red dots, right? So that's why the red dot one is three. So what that gives you is an idea of the visual words, very similar to the original beof words model that you hopefully have heard about maybe in the first sector. So that's the visual equivalent of the textual thing. And so if you do this and you then concatenate, or you apply svd to fuse the information, what you get is a word embedding that is much more representative of human meaning. So as reflected in the data sets that people used to care about at the time. So after that, there were a couple of people, me included, who tried to take these ideas and then really apply ed deep learning to them. So some of the very early versions of this use convolutional neural networks, and then you can transfer the features from your connet. And you take word embeddings, which you've seen in the first lecture, and then you can concatenate them. Now you have a multimodal word vector, or you can do something slightly fancier. So you've seen the skip gram model. You can also try to do skip gram predictions onto image features. So when you see your work like cat in some contacts, like the cute little cat sat on the met, then when you see cat, you also want to predict cat pictures. So super easy ideas, but it turned out that this gives you much richer word representations. So that's kind of cool. But obviously, words are very limited. What we really care about is not words, but sentences. So then people start really looking into sentence representations and how can we figure out how to get compositional understanding in the sentence representations, and how do we align that with images? So the loss here is very similar to what we saw with words and pictures. But now we just have the sentence encoder, right? And so there are some pretty cool early papers from anre ropothy, and Richard Soker also had some work here. And then so the basic idea is just that instead of having these word embeddings, we now have an llm in these papers or some other kind of recurrent neural network, or in the case of this one, recursive neural network. And then we try to align the features together. And so these three or four papers are actually very important. This one by me is less important, but it's still kind of interesting because we showed here that grounded sentence representation. So if you actually just use this part here as a sentence encoder for nlp tasks, the ability to just predict pictures from it already gives you a really good sentence representation. So just by predicting pictures, you can sort of imagine what things look like. And that gives you a really good meaning representation, which you can then transfer to, I don't know, sentiment classification or something else. And then of course, once we have census encoders, or then we also have decoders. And so when the sequence to sequence architecture came out, which you've probably also heard about in discourse, what you can do instead of having a text encoder for like your source language, if you're doing machine translation, is you can plug in a comfnet instead of an lstm encoder, and now you can generate captions. So that's exactly what people did. We used to have all of these fancy diagrams in our papers then where we explained the lstdm and how that works. Probably people don't learn that anymore these days. I do. Yeah, very good. They might make a comeback. I think, you know, at some point, transformers are going to go away. Listen. And so one of the things that people figured out in machine translation very early on is that you can do alignment of words between your source language and your target language. And you can do the same thing actually with images. So if you want to align the word in your generated sequence with something in your picture, then you can do the same, use the same approach for that. And that approach, of course, it's called attention. So you've learned a lot about the attention probably in disccourse se. And so Yeah, that was one of the building blocks of these systems as well, where you can do very interesting things and really see that when it has to generate stop for the stop sign, that is really actually looking at the stop sign. So there's some really cool alignment going on there in these models. And so the final kind of early model we should talk about a little bit is Gans, who here is sort of Gans. Okay, that's a lot more than backofficial words. I guess that makes sense. And so so Yeah, the basic idea of, again, is really that you have this generator and discriminator and you want to have to generator generate images that the discriminator cannot distinguish from. So it cannot distinguish fake and real images, right? And if you do that, you can actually condition that on the piece of text and then you can generate images using some text prompt, right? So that's what kind of the first versions of Stable Diffusion, we're doing things like this, and it's all a natural progression to that model. So those were the early models. Maybe. Do people have any like burning questions about this or does this all make sense? All right, so let's do a bit of a deeper dive then, in particular on features and fusion. So those are really the kind of core building blocks for all of this multimodal stuff. But before we go there, maybe very briefly, like if all of this multimodal stuff is cool and sort of useful and doesn't look that difficult, you know like why aren't we all doing multimodal things? So why do we focus on specific modalities? And I think there are a couple of problems just to be aware of. So one is modalities can sometimes dominate, especially text is much more dominant than vision or audio in many use cases. So you can already just have a model that picks up on the text signal and basically learns to ignore the image completely, which actually happened embarrassingly for visual question answering, we'll get to that. So visual question answering, you could do that without actually looking at the picture. The additional modalities can add a lot of noise. So it makes your machine learning problem more difficult. You don't always have full coverage, right? So as I said, if you look at Facebook posts, sometimes you have text, sometimes you have pictures, sometimes you have both. But you don't have a guarantee that you always have both. So how do you deal with that? In many cases, we just really weren't ready. It was too complicated to implement stuff. And also just in general, like how to design your model really to to combine all the information is actually quite complicated. So in order you know to maybe drive the point home a little bit, so featurizing text, I guess we all know how to do that by now, especially sort of in the age of transformers. And before in lm story, we just said, like you have your batch by your sequence. So batch size by sequence, length by embedding size, right? So it's always like a 3D tensor, and that's how you encode your textual information when you pump it through your neural net. And so with images, it's slightly trickier because you can just kind of look at the patches. But then if you do convolutions, you're kind of like shifting over the image and then you're aggregating, right? And in many cases, you don't really want to be this uniform. You want to have something that actually looks at the things in the picture. So this is called region features, where you would use an object detector as a first step for processing your image. And then you would have a confnet backbone that encodes the features for that particular suimage. Like this guy's like skateboard or something. It has its own vector representation, right? And then in terms of dense features, we now also have vision transformers. So we'll just very quickly go over that to make sure we're on the same page. So there are all these models like Yolo is a really good one, if you haven't heard of that yet. So we're at yellow v seven now. I think for eighi don't know. So there's a new one coming out every every other year or something. But the basic idea is that we get these bounding boxes for things into images or actually segmentations with the bounding boxes is what people tend to use, and they have labels, right? So this is labeled like backpack or something. And so you can do this as a preprocessing step on your image to get a much richer representation of what is really in that image, which you can then pump into your system, as we'll see later. And so then how you encode the information that is in these little bounding boxes or actually in the image itself in general, we just use a standard confinet for that. And so this probably feels like super obvious now. But in 2014, when people were starting to discover it is it was really very surprising that you could just use off the shelf confinet features to really replace the entire computer vision pipeline. So people used to do all of this very fancy, sophisticated stuff, and people spent decades on trying to refine this, and then it was all thrown away and replaced by a connet that does all of that stuff for free. And so the cool thing you get there is that you can transfer very easily across different tasks. So you can have a very generic confinet and then use it to all kinds of very specialized things, like spotting buildings in Paris, for example, or flowers or other stuff. And then of course, in the age of transformers, how far we're already quite a while. And this is only the verse transformer actually in the slide deck. So we're making good progress. So vision transformers are what we would use these days to encode the images where you have these flattened patches. And then you would do kind of the standard birth architecture, maybe as you would know it from this course. And then you do classification. So this is all like standard transformer, everything standard, except now your input here is not words or tokens, it's patches of an image. And then you classify that. All right? So then we have a bunch of features. And now how do we combine information? Right? So let's say we have two vectors, U V. So you know, sounds easy, right, to how we could combine them. It turns out that there are actually very many ways to combine them. So I don't think it's really useful to go over all the different ways here, but you can do very simple things, right? So obviously, like inner product through similarity is what you would use if you want to do cross modthings. So if you want to embed things in the same vector space, but you can do sort of fancier projections on top or different combinations that are kind of linear, or you can do multiplicative things where you multiply the components element wise, or you do some sort of gating over the different features. You can do attention, you can do fancier by linear things. You can do very fancy compact bilinear things. So there's really a wealth of literature kind of on all the different ways you can combine two vectors. And so this is called multimodal fusion. And most of the literature on multimodality is essentially about this question, what is the best way to do fusion? And that's it. So I think within that discussion, it's maybe useful to distinguish between different levels of fusion. So you can do it very early, where basically you make sure you have the different features, and then you just kind of in the sort of modern sense of attention, you would attend to everything in all the features from the beginning. You can first treat them separately and then combine them, or you can treat them as completely separate and then you only combine the final scores. And so that's kind of what we would call early diffusion. And then sort of my invention for calling the middle part would be sort of middle fusion. And then you have late fusion, where you really just combined the scores or the logets, but you don't really have any interaction between the information from the different modalities. So you could do really fun stuff with multimodal fusion. So this is a paper I verlike film, where you have this sort of very special feature map, this sort of f here, and it gets modulated by a multiplicative facector. So this gamma and additive sort of bias vector, this beta, and you have a different one for every layer of a renet that is conditioned on some encoding of the thing you're after. So in this case, are there more cubes than yellow things? So we have some vector representation for that, and we use that vector representation to modulate the rest net bloat, every layer of the connet. So you know, you can really do very fun things where you're sort of modulating one network with the other one and really try to have them learn as much as possible from that. All right, so let's talk about late fusion then. So late fusion is what we would now call contrasted models. But the basic idea is that we have this similarity score. So we have the two kind of we process the modalities completely independently, and then at the very end, we do some combination. And the most famous instance of that these days is clip. So who's sort of clip? Okay. So clip from OpenAI. So it's again, exactly the same contrast of loss that we've seen in all these early approaches. It does kind of negative sampling, but then in batch, so you just have a batch, you have two things that are aligned, right? So like this, the first piece of text and the first image, they are aligned. So this is the right answer. And I just want to make sure that I rank this thing higher than all the alternatives, right? And I want to make sure I rank this thing higher than all the alternatives. So it's a very, very simple idea. Really, really nothing special about this architecture that was sort of invented here. But what made this thing so cool was, first of all, it was transformers. And it was transformers all the way. So your text encoder would be a transformer, and your image encoder would be a vit image encoder. So also a transformer. And it was trained on lots and lots of web data. So Ellec Radford is really a genius at creating very high quality datasets, and he created, I think, 300 million image text pairs for this. Datset trained a bigger model on it than people used to do, and then we got this amazing model out of it. And so moving away from the words there to the sort of text that you would see on the Internet. So the caption for an image on the web, it's not gonna to say dog or cat. It's going to say a photo of the cat doing something something right? That means that you can do kind of zero shot label predictions where you have a photo of the and then you need to figure out what the right label is for a given image using this kind of prompt, right? So the thing you probably all know about prompting large language models, and so you can prompt vision and language models in very much the same way and do zero shot generalization. So if you want a really, really good paper, I would recommend that you read this paper. This is really one that's going to teach you how to write really good papers. It's duro, and it's really worth a very close read, I think, if you're interested in this field. And so I think when it came out actually on imanet itself, it didn't really outperform renet. So you might think, Oh Yeah, it actually is not all that special, but what really made it special was that it generalized much better to these other data sets, right? So this this renet thing here is pretty terrible at some of these kind of edverssarial versions of imanet. And clip is super robust sudeath. So it's just a way better image encoder in general. So very, very quickly after clip, there was this paper from Google using a line which was basically exactly the same idea. You know the field is not really that creative at all, is like the same idea, but then you just keep like throwing more data and more compute at it, and it often works much better. So that's what they found here too. 1.8 billion image taxpairs instead of 300 million gives you a better model surprise. But so still very cool. And what is really cool, I think, is that there's this organization called Lyon where they've started this open source collective to create really high quality data sets. And so the lie on the initial data set was how many examples in the initial lie on 400 million? He notes. I know he notes. And so now there's a much bigger version of Lyon that's even multilingual, and it has 5 billion examples. So Stable Diffusion was trained on sort of image, the English subsets of this thing. And that's one of the reasons that it's so awesome. It's because it's just seen a ton of data, and that really makes your system a lot better. So if you're looking for like the ultimate data set to play around with with your own ideas, if you have enough computes, obviously, then you should really look at this data set. All right. Any questions about up until this point? No. All right. So then we'll move on from late fusion to kind of middle fusion, early fusion. And this really is kind of the core of what I think a lot of people in the field right now, or if you're interested in getting in this field, or if you're going to go into industry and you're going to be using this stuff, like this is what you should really understand. And again, like the idea sort of stack onto each other. So I've kind of sequenced the slides to give you an idea sort of of how the scientists kind of came up with the next step. And you can really see the architecture just get slightly more and more advanced. But basically a lot of it is just more data and more compute again, so who knows how birth works? Everybody should raise their head in this way. Yeah. So Buris kind of so canonical. I think everybody kind of gets how burworks. So I don't think we need a real refresher, but I think you can think. And so the reason I have to slide this, because I want you to think about, if you have a burmodel and you have a bunch of images, how are you going to turn that burmodel into something multimodal? So so there are a bunch of like obvious things you could do given the kind of features I told you about in the sort of fusion process. So you know how are you going to do that? Does anybody want to like say something. Like if you're doing classification, even can convert and then just concatenated to whatever encoder like maybe an a and n or whatever your training on the data exactly. Yeah. So you can take take the confinet features and classify your token from bird, concatenate them and then classify for like a cat or something like that or whatever the thing you're interested in yet. Yeah. So that's one thing. You could also like take the confnet features and like give them to the bird model in lots of different ways. We can use the region features. So I think a lot of people when bircame out who were working in vision and language processing, were thinking exactly about, okay, so do we do like middle fusion, late fusion? Do we do early fusion? How do we do diffusion? And so there were a lot of papers all coming out basically at around the same time where people were doing versions of this. So burwas really kind of the innovation. And then everybody sort of just plugged it into their own thing because of hugging face transformers and things like that. So the first thing is visual. Birthis was one of the very early ones where you have this image and people would do object detection on this. So you get like a hat and a racket and a shirt and things like that. So you can just really take these features and then plug them into your transformer model, and then you try to recover the features. And so this really is probably the simplest way to do it, right? And so this is what we call a single stream architecture, where you have all of these kind of concatenating the original input features and then putting them through the same transformer. What you can also do, and that's something that this model called Wilbert did, is where you have two different streams. So you essentially have these two parallel transformers, but at every layer you kind of give them cross attention, right? Or co attention, as they call it. But it's basically like, so you just make sure you have an attention map that ends both, and then you just do your full normal transformer layer again. And then so this you can train, just like your regular birso. You have your MaaS language model here, and here you do sort of some equivalent of that. And then you also have your next sentence prediction, which you probably remember from your birth lecture. But instead here we're saying, okay, is this image aligned with this piece of text or not? There's also lex merth. I mean, there I could go on forever there, like 100 papers that came out that did all at the same time. So lex mert had a different cross modal output encoder or a bunch of different ways of encoding the positional information. So you could say, okay, I just have a bunch of bounding boxes that are featurized, but I don't care about where they are in the image. So it's just kind of like just a bag of bounding boxes. Or you could say, I found it here. Like this is the particular like top left and bottom right cord in it, and that's what you feature. Rise into your network. You can also do something even dumber. And I can say that because this is my paper, where you just take the image itself, you put it through a renet, and then you do a little bit of pooling on the final feature maps, and you just give those feature maps to Bert. And so you then need to distinguish between like your text segment embeddings, right, and your vision segment embeddings. But so this actually works surprisingly well. You don't have to do any additional training. You can just take birout of the box. Initially you freeze it, you learn to project into burtoken space, then you unfreeze your restnet, and then finally you unfreeze your bird. And now you have a very good multimodal classifier on the problem you care about. So a lot of these other papers, they're doing what they call multimodal pre training, where first you have a burmodal and a renet, so they're kind of unimoddily pre trained, and then you cobble them together, and then you have a multimodal sort of intermediary pre training step before you fine tune in on the problem you care about. And what we showed here is that you don't really need that actually in many cases. So it's a very strong baseline. You can also go to the pixel level completely. So that's what they did in this other paper called pixel birds where they, it's basically exactly mmbt. So the previous supervised one, but here they do do the multimodal pre training step and showed that I think for vqa it helps a little bit. So there are many of these births doing sort of visual things. People really tried everything. Here's another one called under where they added a bunch of different losses. We can really talk about this for a very long time. We're not gonna to do that. I'm just gonna to kind of talk you through some of the more interesting ones. So this one, I think is quite interesting, built because here, this is really the first instance where we are completely gone from confinet features. So we don't do any preprocessing on the image. No region features, no backbone that featrizes the parts of the image we care about. We just have these patches of the image. So really in grit, we flatten those patches, we just pump them into the transformer straight away. So this really is like sort of Burd and vit together in one model. And this worked really very well. So so that's been the trend. So here's a nice very long list of all of these different models and what they do. And so really the distinctions are just in what is the text encoder that you use? So do you use birth or something fancier or better? Roberta, what is your vision encoder? So in many cases, you have these region features. So you would do an rcnn styor. You could just do a renet or a vit. You have different kinds of fusion. So either single or dual stream, as we talked about, right? So visual beror Vilbert, different pre training tasks. So MaaS language modeling, image text matching, there's a bunch of like funkier ones you can do. So and then finally, you can do multimodal pre training on all of these different data sets that have aligned data. So you were probably wondering, okay, so what is really the interesting difference between a lot of these? And so I have another recommended paper that if you're interested in this space, you should really take a look at. It's also really well done paper where they unmask multimodal prere training. So basically, they say, if you take all of these little model inventions and you train these different models on exactly the same data in exactly the same way, it turns out that they're all basically the same. So that's a lot of kind of you know wasted effort on the part of the field because everybody is saying, wow, my model is better. But it's actually just because you trained it on different data and there's no real sort of model innovation going on in a lot of these things. So I don't mean to sound discouraging or anything like that, but you know I think that's why this paper is really nice and really important, is because it just shows us what really matters. So this is also work that I did myself called flavor with my team, where we wanted to take these ideas really to the limit. So a lot of the things that you've seen now, so the visual words and the villberand, things like that, they're all about multimodal questions. So how can we do visual question answering something like that, where we just have these two modalities? We only care about problems that always involve these two modalities. And where we want to go, and this is kind of the basic premise, I think, of foundation models in general, is that we have one model to rule them all. So this one model can consume data from all of these different modalities. And you can synthesize across all of these different modalities and then do useful things with that information. So with flavor, that's exactly what we tried to build. So we wanted to have one foundation model that is good at vision and language. And computer vision and natural language processing is jointly pre trained on all of these different data sources. So it's also trained on just cc news, common crawl and book corpus. So it's very good at the sort of things you would expect Burto be good at. It's strained on image net for image data. So it's good at the things that you would expect this kind of basic image model to be good at. And then you have this pmd data set that we created out of publicly available image text pairs that we also trained on. So this pmd data sets is really just if you take all the data asets that were ever created that have image text pairs that are publicly available. So unfortunately, the clip data and the Google aligned data and all of these data asets, they haven't been open source. So this is before rely on. So now there's a good alternative to this. But so this p and d data set, if you combine all of these image taxpayers, you get 70 million of them. So that's still pretty decent size. And then you can take all of these data basically to solve all of these problems that we know, we care about in these different fields. So you can do multimodal reasoning, you can do language understanding, you can do visual recognition, all with exactly the same model. And that's a very powerful idea. I think if you work at a company like Facebook, you don't want to have different models for all kinds of different things. You want to have one model that you can really use for everything that's going to really make your life a lot easier. So the exact architecture here is that on the one hand, we have this image encoder where we take the image we encoded as patches, and we just do what we call maimage modeling, but it's basically MaaS language modeling. And then just on the image tokens, right? And then on the other side, we have the MaaS language modeling on the language. So your regular sort of bird thing. And then we have a multimodal park where all of this information gets combined. So we have a MaaS multimodal modeling loss term where you can also do image text matching. So this is like your burneccenprediction thing. And then we also have the global contrasted loss, which is exactly like a clip. So if you do all of this stuff, it's just all transformers all the way down. And it's sort of a very elegant way, I think, to combine a lot of this information. And when you do that, you get something that can really do a lot of things very well. So we're not going to talk about that tjust way too many numbers. But so just trust me, we're pretty thorough generating the table here. And so over 35 different tasks. If you compare flavor to all kinds of different ablations in terms of clip models, then this is just a much better way to get to this information. So I think this is a nice example of like where we're probably going to go with the field in the near future. So the other trend that we see very obviously in the field right now is that everybody cares about generative models, right? So language models and image generative models, there's just a trend where we want to be generative. We want na move away from this contrasted discriminative stuff to the more interesting, more richer representations maybe that you get out of generating sequences or images. So this sim vm paper was one of the first ones where they really had this separate decoder that was trying to generate or kind of complete captions, which they showed gives you a lot richer representations. I think this is actually the current state of the art. Now it's called coca. So a lot of these models, they all again look very similar. But in this case, now we're starting to really see these text decoder. So initially with clip, I think that's also what they were trying to go for, like OpenAI being a company that really likes generative models, but they couldn't really get it to work. And I think so it took as a while, as a field to really figure out how to do this the right way. And so right now, we're really kind of in the age of language models, right? And so one of the interesting things you can do with language models is just keep them frozen and then learn how to project into the language models the mmbt architecture. I talked about where we had this Burk model, we kind of kept it frozen, and we learn to project into the Burk token space. You can do exactly the same thing. But then with a much fancier model or something like t five, even where you just have an encoder decoder or some kind of generative part of this, you keep that thing frozen, and then you learn to project into the token space off that frozen language model, and then you can do lots of fun stuff, it turns out. So what they show in this paper is that you then get few shot learners. So all of the things you see with GPT -3, where you can just give it some kind of in context examples and it's going to figure out binding kind of on the fly. So it says like, this is a dex and this is a bliket. So is what is this? And then it gives you the answer that it's a dex. So it really learns in context how you decide the feature mappings, which is really kind of solving the grounding problem that a lot of this multimodal stuff started with. So I think that's very cool. And then probably one of the coolest papers right now or models right now that you might have heard of, if you follow the field, this flamingo out of DeepMind, where they take a chchilla language model. And so this is really an optimal language model. And now you have this vision encoder that encodes multiple different images that you can then do reasoning over and then kind of auto complete. So what this gets you is just a much more powerful model because you can do you know your generative over lots of different images. So it's really like stepwise. You can see it, right? We started off with very simple transformers, and now we're actually at something that is starting to get pretty complicated, because we have these building blocks like a perceiver resampler, where we have a bunch of different images that we featurize. And now we need to compress the information, because sometimes we have three images, sometimes we have five images. So we want to make sure that we can compress it so that it's always ready for consumption by the next layer of the language model. And then so this paper, again, is a really good paper to read because they actually so this is not me, this is not my code. This comes from the actual paper. So they just have the diagram together with the code so that you can really understand what it's doing, which I think is really great. And so once you have your perceiver resampling step, what you would then do is you do gate ated cross attention. This is how you implement it. And so this gate ated cross attention, you do that before your frozen language model layer. So you really just have a frozen chchilla language model. And you learn to kind of modulate the information that goes into that language model. You propagate the gradients all the way back. You just don't update the language model. So you're really kind of trying to figure out, like, how am I going to design my signal so that my language model can do the most with it, right? How am I going to combine the information? So you'll notice that now we do it before the layer, right? And a lot of this other stuff, you would do the attention after the layer, but here you do it before. So garpothy, I think more than ten years ago had this image, which Barack Obama kind of setting his foot here on the scale to make somebody think like they're a lot heavier than they really are. So this is obviously funny to us, but not to an AI system, I think, unless it really understands the scene. And so that's why kpathy at the time said this would be a really good visual Turing test. Like if a system can figure this out, then it's actually really smart. And so obviously, it's been a bit of a challenge for everybody working in the field then to get something that actually works on this. And so flamingo, as it turns out, kind of gets the joke, but Yeah, so it's a bit unclear if it really gets the joke because if you read this conversation, it's sort of kind of getting steered in the right direction, but at least we're making progress, let's put it that way. And then so in flamingo, you still have a lot of moving parts, but you can really take this almost through the full extreme where you try to freeze almost everything and you just want to learn this kind of mapping between your image encoder and your language model, or your image encoder and your encoder decoder architecture. And all you really do is just a projection between the two. So there's this nice model called blip two, where they experiment with like opt for the language model in flanti five for the encoder decoder architecture. And this just gives you amazing results. It gives you really complex captions and things like that without any real direct supervision on the captions itself, which is pretty impressive, I think. So that just shows you the power of language models in general. So here I had some examples. So it can really do like different things from captioning to reasoning to visual question answering to like location detection. So you can have a long conversation with this system. This really is kind of the future where we're going, right? Where we're gonna to have a chat tgpt, but it's also going na be able to see the world in a way. And so I think an interesting thing, so you probably heard of like chain of thought prompting and things like that, where you ask the language model, like let's think step by step, and you can tell a vision and language model, generate a rationale for why something might be the case. So you generate a potential explanation for what your answer might be. And then after that, you ask it to answer your question, and it turns out that if you do that sort of multimodal chain of thought prompting, then a system gets much better. And so this is like the new state of the art on science qa, or a benchmark like that, just because it learns to unpack the information. And so I think we're really, as a feel, just starting to figure out what the potential is of this. And I think this paper is where they also showed that multimodal chain of thought prompting really gets you pretty amazing results. And they show very nice results on raven matrices and very complicated kind of iq test sort of things that humans are supposed to be really good at. But you have to be a pretty smart human to really be good at this. And this system just nails it. So you know, we're making super fast progress and we started off from a very simple burmodel that was able to look at some pictures, and now we're getting to these very sophisticated foundation models. So that was my short history of multimodal foundation models. So how much time do I have left?
speaker 1: All right.
speaker 2: Okay, plenty of time. Yeah, please. Questions. So I noticed a lot of the images that just looked like they were boxes just PaaS through some kind of no sense of shape in them. Yeah. Yeah. So so I think the history of computer vision has been very similar to the history of natural language processing, where we thought we needed all of this structure and all of these different things. And it turns out you can just throw it all away and just have a big transformer over to patches. Sorry, yes, cs 231 .
speaker 1: in one minute time. You mentioned a couple times. What does that mean?
speaker 2: Yeah, sorry, I should have explained that better maybe. So it just means that we are not updating the weights on. So like if we go to this area, I think is a nice example. So we have frozen self attention. So that just means that when we do a forward PaaS, we go all the way to whatever we want to predict. We get some gradients, we take them all the way down, but we only update the non frozen layers. So hair degradients actually do get updated, but these just never change. And so the reason you want to do that is because otherwise you're going to drift way too far. So then you're going to kind of destroy all of the cool stuff your language model has learned because you're just going to focus on the small data set that you're training it on. So you want to preserve the abilities of the language model, but you want it to become good at the thing you care about. Other questions .
speaker 1: characmultimodfusion, is there a benefit to doing like really middle fusion as opposed to fusion? Yeah.
speaker 2: So so I mean, we're going to talk about evaluation next. But so it really depends on the tathat you care about. And so I would say the earlier is always the better if you can afford it. And so like clip is very efficient to train is very late, fusion right at the very end. So there's no interaction between the different modalities. And so that's really good if you want to be very efficient and if you want to be for training, it's much nicer, right? But if you want to have a richer understanding of the multimodal signal, then you want to do earlier fusion. So Yeah, it's always a trade off.
speaker 1: It seems like images are just a lot more data than text. So how much more difficult are these to train and how much bigger does like the image processing have to be compared to the language model? Yeah.
speaker 2: So images are more complex in a way, but there are also kind of higher bandwidth representations, right? So there's a lot of kind of like pixels that our brains just abstract away, right? It's really about the scene that you're seeing and like you're not really thinking too much about the pixels themselves. So like jlakuon likes to say that language is just a kind of low bandwidth, a proxy for a language of thought, which is much richer and much higher bandwidth. And like he thinks probably visual, I'm not so sure. But so Yeah, I don't think that there's necessarily a difference between kind of the scaling laws that you see in these systems, or at least we still have to figure that out. We'll kind of talk about that towards the end as well. Also have certain social and culture bias, just like the natural anone. Oh yes, they have terrible biases. Yeah so Yeah, some people are actually working on this who are in this very room. But so these models can be very racist also in what they generate or the kind of predictions they make. So if you have an Asian basketball player standing sort of like this with a basketball very obviously there, then the model will think that he's playing ping pong because he's Asian. I'm not. Okay. So these models, Yeah just like all neural networks, right? This is really a big problem. And one of the most interesting problems that you should be working on if you're a student and you want na make a difference, is how do we get these systems to be much better at these sorts of things? So examples you show like the model interpreted from the content of an image. So when we want to understand the content for video, so what extra challenges you might see like along this path and what improvements we can make to broadthis? Yeah. So you're asking about the attention MaaS, sort of it. Yeah. So you can use the same idea for videos and you just look at the video. And so these systems are so good now, the object detectors are so good, you can really track objects kind of real time as they go through your video. And so you can try to check how that aligns with your attention mask and your model. So a lot of like so videos, I think, are sort of interesting, but they're also not really interesting because you can very often just subsample images and solve the images rather than having to deal with the complex video with chat. Maybe more one more question and then we'll go do some evaluation. So these multimodel model models, when you only let's say you only provided symbols for some media to the only text or vision, how does it perform in that case? But it's obviously more gear for multilittle cases. Yeah. So I mean, that's one of the giant shortcomings of a lot of these models is that they're really just built for multimodal stuff. And so what if I don't have an image, right? And so I mean that that's why we did flavor, because we want to have one model that can do all of that stuff. And that's why in mmt, so to supervise multimodal bitransformer, we actually have an analysis of how robust is this model to missing images or missing text. But so I think A A lot of folks working on these early visual birth models that were kind of myopically focused on vq eight, which is actually a great segue to what I want to talk about next. So it really depends on the task that you care about, as I said. And so I think if I'm going to tell you about multimodality, I also have to tell you how you're going to check that the multimodal system is actually good to have multimodal things. And so that's the topic of evaluation, which actually is a super important topic. And a lot of people, they want to be cool and build big models, but I think it should be way cooler to do proper evaluation of these models, especially if you're in academia, because you only have limited GPU's anyway. So what can you do? Sorry, don't want to rub it in with it. So how do you check? Well, there's this amazing project. So like imanet really changed, like the history of deep learning, I think, and this other data set, coco, I think, also really changed, especially vision and language, but also, I think, vision in general, where they have just a bunch of main sort of multimodal tasks. So these images are very richly annotated with all kinds of different things. So like the segmentation of the objects, the bounding boxes, the labels of the bounding boxes, they come at like sort of different pixel granularities. It's a huge data set. It's very fine grained, annotated in terms of like the categories that it has. And then you have five captions for each of these images. And so this really was the first data set that unlocked a lot of sort of vision and language processing at scale because you had your picture and you had your caption. And now you need to figure out, okay, how do I give the right caption for this image? So does image captioning or can I retrieve, given some piece of text, the right image, the or the image for the piece of text? So there's a bunch of very impactful data sets that do this stuff that we already talked about, Lyon, but coco really is the main one. Still, I think that a lot of people kind of use as the canonical instance of this data set category. And then the other thing that people really care about in vision and language processing is visual question answering. And so there really are a bunch of academic groups who are or have been so focused on this task that they didn't really care about anything else. And that's why you see a lot of models that are really optimized just for multimodal and nothing else. And you can see that I kind of reflected in the citation counts as of last night, 3:00a.m., where so the vqs has way more citations than image captioning datsets even, right? And so what you do here is you just have an image, and then people ask very simple questions. So annotators, they ask these simple questions, they give the answers. And now we want to be able to answer these questions with machines. And as I alluded to earlier, one of the kind of embarrassing backstories of this data set was that the initial version of the datset was actually fun to have. Images not really matter at all. So you could just look at the question and it could have something like how many slices of pizza are there? And so well, not in that particular case, but in almost all of the data set, the right answer for how much or how many question was too. So if you just predicted too to every how much or how many questions you got like 70 accuracy on the counting category. So careful data set or evaluation benchmark design is also really a skill and you really need to think about what you're doing. You can't just like set some data aside and evaluate it on. You have to really think about what you're doing. And so there's gqa by Chris actually, which is also just, I think, a better design version of this data set maybe so you might want to use that these days. They're also a kind of very targeted data sets that really tried to measure one particular thing. And I think one of the things we really want to get at with these models is what we would call compositionality. So we want to be able to really take the parts and reason about the whole and understand the relationships between the different concepts. So clever was a very clever datset that was designed really to measure the compositionality both on the language side and on the vision side. So you have to understand the relationships between all of these different objects in the images. So that's been a pretty impactful data set, I think, for really forcing people to think about compositionality. But a lot of these data sets really had big problems. So one of the problems is you know they were too easy. So vqa is sort of like plateauing out. We could talk about that a little bit too. Wasn't really realistic. So you could solve vqa, and that's probably going to make some people's lives better. You're all like trying to process the means, how you can see everyone. Okay, let's get to the memfirst then. So obviously, so these memes are not actually in the data set. So I could put some really hateful memes about sort of Hitler or something which are in the data set, but that would be less fun. So these are mean meme examples to kind of demonstrate how the datset was constructed. And so one of the problems we had, as I said, like vqa, the v didn't really matter. What we want to have is a data set. If we care about multimodality specifically, is like how do we get a data set that you can only get right if you are good at multimodal reasoning and otherwise, you're just gonna to screw it up. And so this is what we came up with, is if you have a meme like this one, love the way you smell today. I mean, that's not very nice if you send this to your friends, right? But so it turns out that if you just swap out the background now, it's a very nice thing to say. And like this one is, you know, I don't know, you're maybe a bit weird if you like this, but there's nothing wrong with it, right? And so it's the same for this one here. Like look how many people love you with the tumbleweat. That's really sad. And you know, if you change just one word suddenly is like a really nice thing to say, right? So if you want to solve this, if you want to classify this correctly for the meanness, then you have to really understand multimodal reasoning. You have to understand the relationship between the image and the text in order to get to the right label, right? And so it was really constructed by design to do that. And so how we did it exactly is we use some really highly trained annotators. And then one of the big problems with a lot of these data sets is that nobody really knows who owns the meme. For example, so somebody makes this meme now, they technically own a copyright. And so when I made this data set, I was working at the Facebook, and they were very afraid of copyright things. So what we actually had to do is we had to pay people to make new memes, and so not from strerch, so we could show them kind of the actual examples. And then they had to try to find images that were kind of corresponding to the original source image and try to recreate the meme. But now with an image that we could buy from Getty. And so we gave a lot of money to Getty so that we could then release the data set to the public so that people could do actually research on this and understand for their multimodal models whether they're good or not. And so we really tried to make it so that we had these ign der benign confounders. Sorry, it's a startup world with cofounders. So the confounder here is obviously that you have your original meme, and then you have your confounder where you swap out one of the modalities in here, you have the other one, right? So we had our annotators do that as well. And so this led to a really nice data set, I think, because it showed some of the intuitions that I think a lot of people in the field had, which is that multimodal prere training .
speaker 1: doesn't really .
speaker 2: work in the arm. So multimodal ptraining doesn't really work. And so all of this stuff that people have been doing with all their fancy visual burmodels actually turned out maybe to not really be that useful anyway. And so maybe it got you like one point extra right from visual Burto, like a different visual burlike, less than a point just just by doing that multimodal pre training. So that means like we still have to figure this stuff out, right? This data set is far from salt, and we still have a long way to go despite all these fancy models and. A new paper coming out every week that does something new, like we're not there yet. And I think that's encouraging especially for you, like when you can go out and solve it. So what we did with this data set is we organized the competition. We had 100k in price money to try to see what people could come up with. And so there was a lot of nice work coming out of that, and we really kind of managed to crank the numbers up by Clyde a lot, but the solutions were slightly disappointing. So I don't know if you've ever used Kaggle, but if you want to really win on Kaggle, you just have to ensemble the hell out of all of the different models that are the current state of the art, and then you're very likely to win that. And so that's what happened here where there wasn't really the fundamental breakthrough we had maybe been hoping for. So that still needs to be built, I think. So this utter dataset I just want to kind of briefly talk about. So the theme sort of of this section is like if you make a data set, think about it very carefully because you can really be very creative with this and really, really measure the things you're trying to get at. So this this datset winter ground, we were trying to figure out, okay, how good is clip actually? So it looks really amazing and it's way better than things that were previously there. But does it understand compositional relationships in the same way that humans would understand it? Or is it sort of just fitting onto the data distribution? And it can be very good at the head of the distribution, but it's terrible at the and you can probably already guess where this is going. So just to give you an illustration of what is in this data set, you would have some plant surrounding a light bulb, or you would have a light bulb surrounding some plants. So notice that the words here are exactly the same words, but in a different order. And so the visual depiction of these words is very, very different. So if you're delyour contrasted model is actually good at understanding the visual semantic or the visual linguistic compositionality of these examples, then you can get it right. But again, if it's actually just overfitting on the data distribution that it's seen and it just kind of is biased or what it sees often, then it doesn't really get it right. And so one paper that we use as a source of inspiration for this work is this paper here, order word matters pre training for little. So we actually found that the order of words doesn't even matter that much for general pre training very often, which is also kind of a scary thing, right? So this is deep learning for nlp. We think that language is really important, but these models can reason about language even if you shuffle all the words. And so that's probably not what we want to have. And so that doesn't tell you something about how great we are as researchers. It tells you something about how terrible our evaluation benchmarks are. And that's what we need to fix. So so what we did with this data set here are some other nice examples. Like there's a mug in some grass or there's some grass in a mug. Like these are very different pictures, right? And so for us, these are trivial. Like so like you know what's the difference between a truck fire and a fire truck? They're pretty important, I think also to get that distinction right. So guess what? State of art models often perform below random. So you know, as I said, we still will have a lot of work to do, which is good. And so when this paper came out, I think that the reaction was really nice. And so when Dolly two came out, which so you've probably heard of Dolly two, right? So it's sort of like Stable Diffusion, but then before Stable Diffusion, and so this was really the first model that really showed like just how in these generative models can be when they're creating images. So this is there's a mug in some grass. You do have to kind of cheat a little bit because you have to add digital art here. If you don't add that, then it breaks down completely, right? So it's sort of promphacking. I think we're sort of tuning on the test set, but okay, you know so this is pretty good, right? So so it definitely is better than I think a lot of people would have expected even a couple of years ago. But it's not perfect because people on the Internet like to take more pictures of spoons than forks. So if you say there are fewer spoons than forks or there are fewer forks than spoons, it just really likes spoons more. You know and so maybe it's like the matrix or something, I don't know. But so spoons are just nicer. So again, what you can see here is that these models really are just reflections of the data that they're trained on, right? And Yeah, so models are getting better, but if you've looked at Stable Diffusion, like it still can't count fingers and things like that, right? So again, there's still a lot of cool work to be done. Any questions on evaluation? No. Okay. So let's let's talk about other modalities then because so we've really just been focused on images and images are great. There are lots of images on the Internet. And so that makes it sort of an obvious thing to focus on. It's also, I think if you look at our brain, like vision is a very dominant modality, right? So how we understand the world is very vision driven, but it doesn't have to be the case. So there's all these other interesting problems that involve different modalities. And so the most obvious one is just speech or audio, right? So after Zn comes hearing, and really we could do another lecture just like this, just on speech and audio. And there's lots of interesting stuff to talk about. Obviously we don't have time, but I'll give you another nice example of how amazing alc Redford is at creating data sets. So there's this whisper model that came out of OpenAI not too long ago, which was trained on 680, zero hours of multilingual multitask speech data. So speech with transcriptions. And they trained this very fancy thing on there, which actually is not very fancy at all. It's just the log mail spectrogram. So how you represent the audio signal, and then you feed that into a big transformer. So this is sort of your encoder self attention here, right? And then you have your decoder where you have your cross attention and then you just generate the sequence. So this is encoder decoder, basic transformer model. But your input is convolutions, one dimensional convolutions over the log naspectrogram. And so there's lots of papers that do very similar things. There's models like wave to vc that try to turn the wave signal into vectors, or you can discscretize it in lots of different ways. So there's a wealth of literature then. I think one of the funny observations actually is that you can just reduce audio to vision anyway. So that's what you could sort of argue this log mail spectrogram does, but so not to my own horbit. In 27, I did this paper where we showed it. You can just take a real audio sample, turn it into kind of a spectrogram, really just a spectrogram. So what does the spectrum of the audio file look like? Feed that to a regular connet, like an Alex net even, and then that gives you amazing auditory features. Ure, so now you can use this to distinguish between violins or guitars and things like that. So, you know, maybe you can just reduce all of this to vision. So one question maybe you could ask us, like, can we also reduce language to vision or vision to language? You know that. So that's sort of what people are thinking about. So we talked about the video. There was a question about video. So a lot of these ideas also extend pretty directly to video, but now you just have more data. So like flamingo already had a bunch of different images in it. You can do flamingo over videos. Probably a lot of the images are pretty useless for what you're trying to do with this video model, right? So they're too similar. It doesn't really add all that much information. So you want to subsample the frames so that you get the most useful information out of your video. And so there's a bunch of approaches that kind of take the keyframes, and then you just do a standard joint vision and language transformer and Cogar thing on top of that. So this is kind of becoming, hopefully by now, a very familiar recipe, right? And so there's this. So Merlo is a nice architecture that does this. And then they came up with merlow reserve, kind of a silly name, where they also added audio to this model. So this is now a trimodal model. And so we're going towards this foundation model that can consume all of these different modalities all in one go. And that's really like a clear trend in the field. Another very interesting direction, I think, where in the field, we were very excited about this for a while, but I think it's sort of gone now because it's too difficult to create lots of high quality data in this setting. But what you can do is you can have simulated environments. So this is a paper from DeepMind from 2:17 where they had this agent walk around in the maze. And then it could have natural language instructions. It could also generalize to like decks and Blicks and different sort of grounddings and assignments that you could do in that environment. So this is a super interesting direction, I think, in the long term, because this is how humans learn language, right? Like we walk around in the world, we interact with our environments. We have all of these different perceptual observations. We synthesize them in our brain, we manipulate objects, we change our own viewpoint. And that's how we learn everything we know about the world. And so our language is very intricately connected to that world and how we observe it. So I think that that might make a comeback at some point in the future. You can also do other stuff. So especially with this kind of conditioning or text that we're seeing a lot of, right? So so you know dolali two and sable diffusion and all of these different things. And the original gwe talked about at the beginning, you can do the same thing, but now you're generating 3D point clouds, right? So this is a 3D corgi using a corgi. And so this this prompt can probably become much more complex over time. And you can do like sort of AutoCAD design and just say like give me a house and it's just gonna to design the whole house for you. So you can just like tweak the prompt and things like that like that that's all coming or even already here in many cases. So the final modality I just briefly wanted to talk about is olfactory embeddings. And so olfaction means smell, if you didn't know. And so it turns out, so my PhD thesis was about grounding semantics in different perceptual modalities. So a lot of my work started in vision. And then, okay, now audio is sort of the obvious next one, right? So you can learn a meaning of violin, and then maybe you can learn that violin like what a violin looks like and what it is and what it sounds like. And that's gonna na give you a richer representation. But for a lot of these words, what's actually very primitive to their meaning is what they smell like. Because in our brains, that's really one of the core areas and one of the oldest areas in your brain. So what you can try to do if you want to complete all of your perceptual modalities is you can try to build olfactory embedding. So it was kind of a joke paper I did, but the funny thing is it actually worked. And so there's a catalog, this sigma aldriitch fine flavors and fragrances catalog, where you can look up words like melon and pineapple, and then it's going to give you all of the chemical compounds that produce the smell or taste. And so if you do that, then you can count the occurrences, and then you can sort of do svd or something like that, only to get it to be a bit more of a real embedding model. So now you get smell embedding, smell vectors, and then you can compute similarity judgments between these smells. So turns out apple smells like pear, and chocolate and cocoa and sweet coffee are sort of related. So you get these clusters of different smells just based off of their chemical compounds. So this bag of chemical compounds model gives you a very rich representation. And so if you look at all of the words that are concrete enough to have smell, so like if you have a word like democracy in there, that doesn't really smell like anything, right? So you ignore democracy and you just focus on the things that smell or that could smell, I guess. And then so the really interesting thing to me is that this is much more correlated with human similarity judgments than the linguistic vectors we had at the time. So for a word like apple, like you can just get a word vector like you've learned in your first lecture. And so you can do like you know skip gram and things like that. But that thing is not going to be as correlated with human similarity judgments as this bag of chemical compounds model. So that's pretty interesting, right? So even something like smell, where maybe we think you know this doesn't really matter. If you really want to understand how humans understand language, then maybe you want to include this in your foundation model too. But I would start with other modalities. All right. Okay. Yeah, sorry. Yeah. So where to next? I'll just, I think I've already said most of this actually. So one foundation model is going to rule them all. And so I mean, there will be many of these, but a lot of them are going to have very similar traits. I think we're going to be looking at scaling loss and trying to understand really what is the relationship between the different modalities, which one do we want more off that sort of stuff? We're going to have retrieval, augmentation. This thing is going to be really huge. If you've heard of rag or if you haven't, you should look it up. So all of these parts of these models can also be multimodal. We need way better evaluation and better measurements. We already talked about that too, and that's all I had. Thank you.

概览/核心摘要 (Executive Summary)

本讲座由斯坦福大学符号系统兼职教授 Douwe Kiela 主讲，深入探讨了多模态深度学习，尤其侧重于自然语言处理（NLP）与图像的结合。讲座强调了多模态学习的重要性，源于其对人类体验的忠实模拟、互联网应用的普遍性以及数据效率和可用性（尤其是在高质量文本数据日益稀缺的背景下）。Kiela 教授回顾了多模态学习的发展历程，从早期模型（如基于词袋模型的视觉词汇、CNN与词嵌入的简单融合）到句子级别的对齐和生成模型（如图像字幕生成、GANs）。核心技术点包括特征提取（如图像的 ResNet、ViT 特征，文本的词嵌入、Transformer 特征）与多模态融合（早期、中期、晚期融合策略，如拼接、门控、注意力机制）。

讲座重点介绍了对比学习模型，特别是 CLIP 及其后续工作（如 ALIGN、LAION 数据集），它们通过在大规模图文对上进行对比预训练，学习共享嵌入空间，实现了强大的零样本学习能力。随后，讲座转向了多模态基础模型，如 VisualBERT、VilBERT、FLAVA（Kiela 团队工作，强调统一处理视觉、语言及多模态任务）、Flamingo（利用冻结语言模型进行视觉推理）和 BLIP-2（高效连接图像编码器与语言模型）。Kiela 教授还探讨了其他模态（音频如 Whisper，视频如 Merlo Reserve，3D 数据，乃至嗅觉嵌入）的潜力。评估是多模态学习的一大挑战，讲座提及了 COCO、VQA 等基准，并重点介绍了 Kiela 团队开发的 Hateful Memes Challenge 和 Winoground 数据集，旨在更精确地衡量模型的真实多模态理解和组合泛化能力，揭示了当前模型在深层理解上的不足。最后，Kiela 展望了未来方向，包括统一基础模型、多模态缩放定律、检索增强生成、多模态泛化及具身智能。

引言与多模态的重要性

Speaker 1 (Introducer):
* 介绍了主讲人 Douwe Kiela 教授，他来自荷兰，曾在 Facebook (Meta) FAIR 部门和 Hugging Face 工作，是深度学习领域的杰出研究者，近期专注于多模态模型研究，并参与了检索增强生成等工作。

Speaker 2 (Douwe Kiela):
* 多模态的定义: 指涉及多种模式、形态或极大值。在NLP领域，主要指文本与其他一种或多种模态（如图像、语音、音频、嗅觉等）的结合，本次讲座主要关注文本与图像。
* 多模态的重要性:
* 忠实性: 人类体验本身就是多模态的，我们通过综合多种感官信息理解世界。
* 实用性: 互联网和许多应用本质上是多模态的（如社交媒体帖子常包含图文）。
* 数据效率和可用性:
* 多模态数据更丰富、“高带宽”，可能更利于学习。
* 随着高质量文本数据逐渐耗尽，利用其他模态数据成为扩展模型能力的关键。引用 LeCun 的话，语言是“一种不完美、不完整、低带宽的我们称之为思想的内部数据结构序列化协议”。
* Kiela 教授指出：“多模态是新基础模型革命的主要前沿之一。”
* 多模态大脑的例证: McGurk 效应展示了视觉信息如何影响听觉感知。

多模态应用

讲座列举了文本和图像结合的多种应用场景：
* 检索 (图像 <> 文本): 根据文本查找图像，或根据图像查找文本。
* 图像字幕 (图像 -> 文本): 为图像生成自然语言描述。
* 生成 (文本 -> 图像): 根据文本提示生成图像（如 Stable Diffusion）。
* 视觉问答 (VQA) (图像 + 文本 -> 文本): 根据图像和问题生成答案。
* 多模态分类 (图像 + 文本 -> 标签): 例如判断内容是否为仇恨言论。
* 多模态聊天机器人: 基于图像进行多轮对话。
* （课件补充：图像到图像转换（受文本条件限制）、多模态信息检索、文本到3D生成）
* Kiela 教授引用一篇论文标题预测其会被高频引用：“I predict that this paper is going to do really well in terms of citations. Just because it has such a citable title, I think a lot of people are not actually going to read it.” (暗指多模态领域的热度)

早期多模态模型

核心思想: 将视觉模型和语言模型的输出对齐到同一个多模态空间。
- 使用相似性度量（得分函数/核函数）和最大间隔损失函数。
- 实现跨模态迁移，如用词嵌入查找相关图像。
多模态词嵌入:
- Bruni 等人的工作使用“视觉词袋 (Bag of Visual Words)”模型：通过 SIFT 等算法提取图像关键点，获取特征描述符，使用 k-means 聚类形成视觉词汇，并统计词频。
- 与文本特征（如词嵌入）融合（如拼接、SVD），得到更丰富的词表示。
深度学习早期应用:
- 使用 CNN 提取图像特征，结合词嵌入（如 Word2Vec），通过拼接或让 Skip-gram 模型预测图像特征来创建多模态词向量。
- Kiela 教授提到，这些早期想法“super easy ideas, but it turned out that this gives you much richer word representations.”
句子级表示与对齐:
- 将句子编码器（RNN、Recursive Neural Network）的输出与图像对齐。
- Kiela 教授提及自己的工作表明，仅通过预测图片就能获得良好的句子表示，可迁移至情感分类等NLP任务。
图像字幕生成:
- 采用序列到序列架构：CNN 作为图像编码器，LSTM 作为解码器生成字幕。
- 注意力机制: 将生成序列中的词与图像中的特定区域对齐，例如生成“stop sign”时模型会关注图像中的停车标志。
生成对抗网络 (GANs):
- 通过生成器和判别器的对抗训练，可根据文本提示生成图像，是 Stable Diffusion 等模型的早期雏形。

多模态特征提取与融合的挑战与方法

特征提取:
- 文本: 通常表示为 [batch_size, sequence_length, embedding_size] 的3D张量。
- 图像:
  - 区域特征 (Region Features): 使用物体检测器 (如 YOLO) 识别图像中的物体，并为每个区域（如滑板）提取 CNN 特征。
  - 密集特征 (Dense Features):
    - 卷积神经网络 (CNN): 如 ResNet，其预训练特征可直接用于多种视觉任务，取代了传统复杂的计算机视觉流程。
    - 视觉 Transformer (ViT): 将图像分割成小块 (patches)，展平后输入 Transformer 进行分类或特征提取。
多模态融合 (Multimodal Fusion):
- 核心问题: 如何组合来自不同模态的信息。
- 融合方法:
  - 简单方法: 内积、拼接、逐元素乘法/加法。
  - 复杂方法: 门控机制、注意力机制、双线性模型 (Bilinear models)、紧凑双线性模型 (Compact Bilinear models)。
  - FiLM (Feature-wise Linear Modulation): 一个例子，用一个网络的输出（如文本编码）来调制另一个网络（如 ResNet 的每一层）的特征图，通过乘性因子 (gamma) 和加性偏置 (beta) 实现。
- 融合的层面:
  - 早期融合: 在模型早期阶段即融合特征。
  - 中期融合: 在模型中间层进行特征交互。
  - 晚期融合: 独立处理各模态，最后融合得分或 logits（如对比学习模型）。
多模态学习的固有挑战:
- 模态支配性: 某些模态（尤其是文本）可能主导学习过程，导致模型忽略其他模态信息（VQA 早期曾出现此问题）。
- 噪声引入: 额外模态可能引入噪声，增加学习难度。
- 数据覆盖不全: 并非所有样本都拥有所有模态的数据。
- 实现复杂性与模型设计: 设计有效的多模态模型本身具有挑战性。

对比学习模型 (如 CLIP)

属于晚期融合策略。
核心思想: 学习一个共享嵌入空间，使得来自不同模态的相似样本对（正样本对，如图像及其对应标题）在该空间中距离相近，不相似样本对（负样本对）距离较远。
CLIP (Contrastive Language-Image Pre-training) by OpenAI:
- 架构: 文本编码器 (Transformer) 和图像编码器 (ViT 或 ResNet)。
- 训练: 在大规模图文对上进行对比学习（Kiela 描述为在批次内，使匹配的图文对得分高于不匹配的图文对）。Alec Radford 贡献了高质量的包含约3亿图文对的数据集。
- 损失函数: （课件提及 InfoNCE loss） $L = - \log \frac{\exp(s(q, k_+)/\tau)}{\sum_{k \in K} \exp(s(q, k)/\tau)}$
- 零样本学习能力: 通过构造类似 "a photo of the [label]" 的文本提示，CLIP 可以在没有见过特定类别训练数据的情况下进行图像分类。
- 鲁棒性: 在多种数据集上表现出比传统 ResNet 更好的泛化能力，尤其是在对抗性版本的 ImageNet 上。
- Kiela 评价 CLIP 论文：“This is really one that's going to teach you how to write really good papers. It's thorough, and it's really worth a very close read.”
ALIGN (Google): 类似 CLIP，但使用了更大规模的数据集（18亿图文对）。
LAION (Large-scale Artificial Intelligence Open Network): 开源项目，创建了大规模高质量图文数据集。
- LAION-400M (4亿样本)
- LAION-5B (50亿多语言样本)，Stable Diffusion 在其英文子集上训练。
（课件补充：GLIP - Grounded Language-Image Pre-training，Florence - 多模态基础模型）

多模态基础模型

趋势: 从晚期融合向更早、更深度的融合发展，追求“one model to rule them all”。
早期探索 (基于BERT的视觉语言模型):
- VisualBERT: 单流架构，将图像区域特征和文本输入拼接后送入同一个 Transformer。
- VilBERT: 双流架构，图像和文本分别通过独立的 Transformer，层间通过共同注意力 (co-attention) 交互。
- LXMERT: 采用不同的跨模态编码器和位置信息编码方式。
- MMBT (Multimodal Bitransformer) (Kiela 参与的工作): 将图像通过 ResNet 编码后，经池化和投影层输入到预训练的 BERT 的词元空间。特点是不需要大规模多模态预训练，通过冻结和逐步解冻各部分进行微调。
- PixelBERT: 类似 MMBT，但进行了多模态预训练。
- VILT (Vision-and-Language Transformer): 直接将图像块 (patches) 输入 Transformer，无需 CNN 提取区域特征，实现了端到端。
- "Unmasking Multimodal Pretraining" 论文: 指出许多号称新颖的模型在相同数据和训练设置下表现相似，强调了数据和训练方法的重要性。
FLAVA (Kiela 团队工作):
- 目标: 构建一个能同时处理纯视觉、纯语言和视觉语言任务的统一基础模型。
- 训练数据: 包含纯文本数据 (CC-News, BookCorpus)、纯图像数据 (ImageNet) 和图文对数据 (PMD - 7000万公开图文对的集合)。
- 架构: 图像编码器 (进行类似 MLM 的 Masked Image Modeling)、文本编码器 (MLM)、以及一个多模态模块 (进行 Masked Multimodal Modeling, Image-Text Matching, 和类似 CLIP 的全局对比损失)。
- 在超过35个不同任务上展现了良好性能。
生成式多模态模型:
- SimVLM: 采用独立解码器完成图像标题补全任务。
- CoCa (Contrastive Captioner): Kiela 认为是当前 SOTA 之一，包含文本解码器。
- 利用冻结的大语言模型 (LLMs):
  - 将图像特征投影到冻结的 LLM (如 T5) 的词元空间，LLM 本身参数不更新。这种方法可以实现少样本学习 (few-shot learning)。
  - Flamingo (DeepMind):
    - 使用冻结的 Chinchilla LLM 和一个视觉编码器处理多张图像。
    - 包含 Perceiver Resampler 组件来压缩不同数量图像的特征。
    - 通过门控交叉注意力 (Gated Cross-Attention) 在冻结 LLM 层之前注入视觉信息。
    - 在 Kárpáthy 提出的“奥巴马踩体重秤”视觉图灵测试上取得进展。
  - BLIP-2:
    - 进一步简化，几乎冻结所有组件，仅学习图像编码器和 LLM (如 OPT, Flan-T5) 之间的简单映射 (投影层)。
    - 能够生成复杂的图像描述和进行多轮对话。
- 多模态思维链提示 (Multimodal Chain-of-Thought Prompting):
  - 让模型先生成一个解释或推理步骤 (rationale)，然后再给出最终答案。
  - 显著提升了模型在 ScienceQA、Raven Matrices 等复杂推理任务上的性能。

其他模态探索

音频/语音:
- Whisper (OpenAI): 在68万小时多语言、多任务语音数据上训练的大型语音识别模型。架构为 Log-Mel 频谱图输入到 Transformer 编码器-解码器。
- 其他方法: Wave2Vec (将波形转为向量)。
- Kiela 提及自己2017年的工作：将音频转为频谱图，用标准 CNN (如 AlexNet) 处理，可获得高质量听觉特征，用于区分乐器等。暗示模态间可能存在某种程度的“简化”关系。
视频:
- 许多图像处理思想可直接扩展到视频，但需处理更多数据。
- 通常对视频帧进行子采样，提取关键帧。
- Merlo: 联合视觉和语言 Transformer 处理视频。
- Merlo Reserve: 在 Merlo 基础上加入音频，成为三模态模型。
模拟环境与具身智能:
- 通过让智能体在模拟环境中（如迷宫）根据自然语言指令行动和交互来学习语言。
- Kiela 认为这是长期非常有趣的方向，因为它更接近人类在真实世界中通过感知和交互学习语言的方式。
3D 数据:
- 文本到3D生成: 如根据文本提示 "a 3D corgi" 生成柯基犬的3D点云模型。
嗅觉 (Olfaction):
- Kiela 的博士论文工作，构建嗅觉嵌入。
- 方法: 从香料香精目录 (如 Sigma-Aldrich) 中查找词汇（如“瓜”、“菠萝”）对应的化学成分，构建“化学成分词袋 (bag of chemical compounds)”，然后通过 SVD 等方法降维得到嗅觉向量。
- 结果: 发现这些嗅觉向量在具体名词的相似性判断上，比当时的纯语言向量与人类判断的相关性更高。这表明即使是嗅觉这样的模态也可能对理解语言意义有贡献。

多模态模型评估

评估的重要性与挑战: Kiela 强调评估是至关重要的，尤其在学术界资源有限的情况下，做好评估比单纯构建大模型更有价值。
常用数据集与任务:
- COCO (Common Objects in Context): 包含丰富的物体分割、边界框、标签和每个图像5个标题。是图像字幕、图文检索等任务的基准。
- VQA (Visual Question Answering):
  - 早期版本存在偏见，模型可以不看图像，仅凭问题中的统计规律猜答案（如对“how many”的问题回答“2”的准确率很高）。
  - GQA: Chris Manning 等人设计的改进版 VQA 数据集，更注重组合推理。
- CLEVR: 专门设计用于衡量模型对物体间关系和属性的组合理解能力。
Kiela 团队在评估方面的贡献:
- Hateful Memes Challenge:
  - 动机: 创建一个必须依赖真实多模态推理才能解决的数据集。单个模态（纯文本或纯图像）往往会产生误导。
  - 构建: 由于版权问题，团队付费让人根据真实 meme 的概念重新创作，使用可授权的图片。包含“良性混淆项 (benign confounders)”，即替换掉图片或文字后意义截然相反的样本。
  - 发现: 该数据集揭示了许多所谓的多模态预训练模型效果不佳，提升有限。竞赛的获胜方案多为复杂模型集成，缺乏根本性突破。
  - Kiela 指出：“This data set is far from solved, and we still have a long way to go despite all these fancy models.”
- Winoground:
  - 动机: 测试 CLIP 等模型是否真正理解组合关系，还是仅仅拟合数据分布。
  - 设计: 包含成对的图文样本，文本描述的词汇相同但顺序或结构不同，导致视觉场景截然不同（如“a plant surrounding a light bulb” vs “a light bulb surrounding some plants”；“a truck fire” vs “a fire truck”）。
  - 发现: 当时 SOTA 模型在该数据集上表现通常低于随机水平，表明模型在组合泛化上存在严重缺陷。DALL-E 2 在生成类似场景时有所进步，但仍受数据偏见影响（如更倾向于生成勺子而非叉子）。
核心反思: 当前许多评估基准存在严重问题，无法有效衡量模型的真实能力，需要大力改进。Kiela 强调：“That doesn't tell you something about how great we are as researchers. It tells you something about how terrible our evaluation benchmarks are. And that's what we need to fix.”

未来展望 (下一步去哪里？)

统一基础模型: 将出现能够读取和生成多种模态的、与模态无关的统一基础模型 (“One foundation model is going to rule them all.”)。
多模态缩放定律 (Scaling Laws): 需要深入理解不同模态之间的关系、数据量与模型性能的扩展规律。
检索增强生成 (Retrieval Augmented Generation, RAG): RAG 的各个部分也可以是多模态的，这将是一个重要方向。
更好的评估与测量: 持续改进评估方法和基准。
（课件补充：多模态泛化到未见过的模态、多模态数据增强、多模态具身智能）。

问答环节要点

图像块 (Patches) vs. 形状理解: 当前 ViT 等模型直接处理图像块效果很好，类似于 NLP 领域 Transformer 模型不依赖显式语法结构。
模型冻结 (Frozen Model): 指在训练过程中不更新模型特定部分的权重，以保留其预训练学到的能力，防止在小数据集上微调时发生灾难性遗忘。
早期融合 vs. 晚期融合: 早期融合通常能带来更丰富的多模态理解，但计算成本高；晚期融合（如 CLIP）训练效率高，但模态交互较少。选择取决于具体任务和资源。
图像数据的复杂性与偏见: 图像数据带宽更高，但模型同样会学习并放大训练数据中的社会和文化偏见（如种族、性别刻板印象），这是一个亟待解决的重要问题。
视频处理: 可以对视频帧进行子采样，结合物体追踪和注意力机制进行分析。
处理缺失模态: 许多模型仅为完整多模态输入设计。FLAVA 和 MMBT 等工作尝试处理单模态或部分模态缺失的情况。

摘要历史 (1)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-20 23:51

StreamSparkAI