speaker 1: So today, I'm delighted to introduce our first invited speaker, who's doakeila dohas also been as well as being invited. And I'll tell his background, he's also in the symbolic systems program, has been an adjunct professor and has been involved with some students in that role as well. But in his invited role, he's originally from the Netherlands, where he even learned some logic, among other things, back in the old days. But in more recent times, he's been a prominent deep learning researcher for a number of years. He worked at Facebook, now meta, in the fair unit, and was involved in various ideas, including retrieval, augmented generation. After that, he then spent some time at hugging face. He's become interested in looking at multimodal models, which is what he's going to be talking about today. And welcome, dit's. Great to have you. Thank you very much.
speaker 2: All right, that works. Yes. Yeah. Thanks everyone for coming. I understand that you get points for being here, so you're not really here for me, but thanks for coming anyway. So I'm going to talk about multimodal deep learning. It's going to have an nlp focus, of course, that's for discourse, but it's also because otherwise I would really be talking for many more hours than I have time for here. So I'll try to really keep it focused on the things that I think will be most useful for you to learn. And so the first thing you should understand is that this whole concept of multimodality is kind of ill defined, actually. So if you go to the dictionary, you'll see that it means having or involving several modes or modalities or maxima. And so what mode hair really means is, so it could be mode in a very generic sense, or it could be a very precise sense of the mode of a statistical distribution. And so depending on the paper you're reading, in some cases people really mean the statistical sense. In other cases, people really mean this sort of very vague concept of a modality where it really means the type of information that you're getting. So an example of modality in that case is an image or speech signal, or audio in general, or even olection tion. So smell or things like that. So in this lecture, we're just going to focus mostly on text because this is an np course, and we're going to focus on images mostly as the utter modality to keep it simple. All right. So why does it matter? Why do we care about multimodality? And so there are a couple of really good reasons in general for this. The first one is about faithfulness. So if you look at how we humans understand the world, how we make sense of what happens in the world, that is very multimodal, right? So we perceive the world not just using vision or just audio, but we synthesize information across all of these different modalities, and that's how we understand the world and each other. There's also a very practical argument for doing it is because the Internet is multimodal. So if you go to, I don't know, like Facebook or something like that, like it rarely happens that it's just text or just an image. There's usually a combination of multiple modalities. And then the final good reason that we're just starting to hit now, if you're really following where the field is going, we're kind of running out of text data for these large language models. So one, an interesting way to keep scaling on the data side is to make use of all of these other modalities. So if you can have your language model also watch all of the videos of cats in the world, it's going to understand the concept of catat much better. And that's what we want to have in these models. We want them to understand the world in the same way that humans understand it. So right now, multimodality is really one of the main frontiers of this new foundation model drive that we're all in right now. There's a thing called the mcurk effect. Let's see if it loads up. But so what we'll see when this loads is this guy over here, and we'll have the same audio effect being played. So the audio is exactly the same. And this man is going to say something like, and so you're hearing A B there, I think if you look at my mouth, because that's what I said, but if you then change the video to where he says, fffwith exactly the same audio, you're going to hear the other version. So unfortunately, I can't really like swap in the different audio here. So you have to trust me for it. We might suddenly start hearing a guy saying ffa and then, all right, so multimodal applications. So when we have multiple modalities, we can do all kinds of interesting things. And as I said, most of the use cases we have on the Internet, they're all multimodal. And there are some really kind of obvious things we would be interested in if we have information from these different data sources, from different modalities. So obviously, we might want to do retrieval. So maybe give a bit of text, we want to find the right image, or maybe given some image, we want to find the right text for it so we can match them up. Obviously, we can also do this in a generative setting ings. So then we have image captioning, which you've probably heard of. We can do text to image generation. So that's image synthesis. And so Stable Diffusion. Everybody in the audience here has probably seen that. Then we can do visual question answering where we have an image ntext. And then we need to generate some new text. We have multimodal classification where we have image ntext. And we need to have a label, for example, whether something is hate speech or not. And then in general, we want to be able to have a richer understanding of information, which means that we combine images and text and then use it for downstream applications that require better understanding or better generation. So this field really is super hot right now. So there's this nice paper title. I predict that this paper is going to do really well in terms of citations. Just because it has such a citable title, I think a lot of people are not actually going to read it. And so I mean, I've been in this field for quite a while now, and people have been saying this for a really long time. I think Chris would agree that. So for decades, people have been saying that multimodal is the next big thing. But now it's really, I think, all right, so the outline for what we're going to be talking about. So first I'm going to tell you a little bit about early models. Then we're going to do a bit of a deep dive on some of the specifics. Then we're going to go over a particular type of fusion, contrasted models or lake fusion. Then we're going to go through a little bit of the history of multimodal foundation models. Then we're going to talk a little bit about evaluation, a little bit about other modalities, and then I'll make some predictions for the future and hopefully maybe give you some cool research ideas or things to talk or think about. All right. So obviously, there's a lot of work that happened before deep learning, but I think if you want to start from like the deep learning revolution and what was happening in images and text, then a good starting point is, for example, wasabi or device, or Richard Soker, who you've probably heard of, has done some really cool early work in this that really pioneered a lot of these ideas. And the basic gist of this is that we have a vision model. On the one hand, we have a language model. So this really, I mean, the first lecture of discourse, I think, was about word embeddings. So that's just your basic word embedding model. And now we need to figure out how to align them in the same multimodal space. So the way you do that is you get some sort of similarity metric, right? A score function or like a kernel function, if you're thinking about this from a support vector machine literature perspective, now you need to figure out in a max margin or margin loss how you want to align these two points in your embedding space. So things that are similar, you want to bring them closer together. Things that are not, you want to bring them further apart. And if you do that in this multimodal embedding space, that means that you can do interesting cross ssmoal transfer where you can take the word embedding for something like auto or like horse, then you can find close images into embedding space to that thing. And now you've solved the retrieval problem. So this is a really nice early application. And I think a lot of the stuff that I'm going na talk about in the early slides, you're gonna to see this income over and over again. You're gonna to see it get kind of reinvented with fancier models, but it's basically all the same stuff. So you can do cross modal transfer where you have images and text, but you can also combine them together so that you get a multimodal word embedding. And so this just gives you a more accurate representation of how humans understand word meaning. Because when we think about the word moon or cat or something, we can go to Wikipedia and read that a cat is a small carnivorous mammal that people like to keep as pets, or we can just go and look at pictures of cats. So now we understand what a cat is, right? And I would argue actually, that for a lot of people, the picture of the cat is much closer to the meaning of the concept of cat. So some early work where people were trying to do this is from brunea, all where they did multimodal distributional semantics using this very elegant approach called bag of visual words. So just like who has heard of bag of visual words? Very few people. Okay. So it's it's surprisingly simple. So I kind of like it. It's nicely elegant. So you take a picture of a moon, in this case, I think you can see it in the back turight. So we use an algorithm like sift to find interesting key points. So it's sort of where the difference between the pixels and the pixels next to it, where the difference is big, those are sort of the spots you want to be looking at. And for each of these key points, you get feature descriptors. So relatively small vectors, like 32 dimensional events kind of on the implementation of this. And what you can do now with these feature descriptors is you can cluster them using k means, and then you assign every one of these points so you can count how often they occur, right? So in this picture of the moon, we have like actually the count is, Oh Yeah, so there are three like red dots, right? So that's why the red dot one is three. So what that gives you is an idea of the visual words, very similar to the original beof words model that you hopefully have heard about maybe in the first sector. So that's the visual equivalent of the textual thing. And so if you do this and you then concatenate, or you apply svd to fuse the information, what you get is a word embedding that is much more representative of human meaning. So as reflected in the data sets that people used to care about at the time. So after that, there were a couple of people, me included, who tried to take these ideas and then really apply ed deep learning to them. So some of the very early versions of this use convolutional neural networks, and then you can transfer the features from your connet. And you take word embeddings, which you've seen in the first lecture, and then you can concatenate them. Now you have a multimodal word vector, or you can do something slightly fancier. So you've seen the skip gram model. You can also try to do skip gram predictions onto image features. So when you see your work like cat in some contacts, like the cute little cat sat on the met, then when you see cat, you also want to predict cat pictures. So super easy ideas, but it turned out that this gives you much richer word representations. So that's kind of cool. But obviously, words are very limited. What we really care about is not words, but sentences. So then people start really looking into sentence representations and how can we figure out how to get compositional understanding in the sentence representations, and how do we align that with images? So the loss here is very similar to what we saw with words and pictures. But now we just have the sentence encoder, right? And so there are some pretty cool early papers from anre ropothy, and Richard Soker also had some work here. And then so the basic idea is just that instead of having these word embeddings, we now have an llm in these papers or some other kind of recurrent neural network, or in the case of this one, recursive neural network. And then we try to align the features together. And so these three or four papers are actually very important. This one by me is less important, but it's still kind of interesting because we showed here that grounded sentence representation. So if you actually just use this part here as a sentence encoder for nlp tasks, the ability to just predict pictures from it already gives you a really good sentence representation. So just by predicting pictures, you can sort of imagine what things look like. And that gives you a really good meaning representation, which you can then transfer to, I don't know, sentiment classification or something else. And then of course, once we have census encoders, or then we also have decoders. And so when the sequence to sequence architecture came out, which you've probably also heard about in discourse, what you can do instead of having a text encoder for like your source language, if you're doing machine translation, is you can plug in a comfnet instead of an lstm encoder, and now you can generate captions. So that's exactly what people did. We used to have all of these fancy diagrams in our papers then where we explained the lstdm and how that works. Probably people don't learn that anymore these days. I do. Yeah, very good. They might make a comeback. I think, you know, at some point, transformers are going to go away. Listen. And so one of the things that people figured out in machine translation very early on is that you can do alignment of words between your source language and your target language. And you can do the same thing actually with images. So if you want to align the word in your generated sequence with something in your picture, then you can do the same, use the same approach for that. And that approach, of course, it's called attention. So you've learned a lot about the attention probably in disccourse se. And so Yeah, that was one of the building blocks of these systems as well, where you can do very interesting things and really see that when it has to generate stop for the stop sign, that is really actually looking at the stop sign. So there's some really cool alignment going on there in these models. And so the final kind of early model we should talk about a little bit is Gans, who here is sort of Gans. Okay, that's a lot more than backofficial words. I guess that makes sense. And so so Yeah, the basic idea of, again, is really that you have this generator and discriminator and you want to have to generator generate images that the discriminator cannot distinguish from. So it cannot distinguish fake and real images, right? And if you do that, you can actually condition that on the piece of text and then you can generate images using some text prompt, right? So that's what kind of the first versions of Stable Diffusion, we're doing things like this, and it's all a natural progression to that model. So those were the early models. Maybe. Do people have any like burning questions about this or does this all make sense? All right, so let's do a bit of a deeper dive then, in particular on features and fusion. So those are really the kind of core building blocks for all of this multimodal stuff. But before we go there, maybe very briefly, like if all of this multimodal stuff is cool and sort of useful and doesn't look that difficult, you know like why aren't we all doing multimodal things? So why do we focus on specific modalities? And I think there are a couple of problems just to be aware of. So one is modalities can sometimes dominate, especially text is much more dominant than vision or audio in many use cases. So you can already just have a model that picks up on the text signal and basically learns to ignore the image completely, which actually happened embarrassingly for visual question answering, we'll get to that. So visual question answering, you could do that without actually looking at the picture. The additional modalities can add a lot of noise. So it makes your machine learning problem more difficult. You don't always have full coverage, right? So as I said, if you look at Facebook posts, sometimes you have text, sometimes you have pictures, sometimes you have both. But you don't have a guarantee that you always have both. So how do you deal with that? In many cases, we just really weren't ready. It was too complicated to implement stuff. And also just in general, like how to design your model really to to combine all the information is actually quite complicated. So in order you know to maybe drive the point home a little bit, so featurizing text, I guess we all know how to do that by now, especially sort of in the age of transformers. And before in lm story, we just said, like you have your batch by your sequence. So batch size by sequence, length by embedding size, right? So it's always like a 3D tensor, and that's how you encode your textual information when you pump it through your neural net. And so with images, it's slightly trickier because you can just kind of look at the patches. But then if you do convolutions, you're kind of like shifting over the image and then you're aggregating, right? And in many cases, you don't really want to be this uniform. You want to have something that actually looks at the things in the picture. So this is called region features, where you would use an object detector as a first step for processing your image. And then you would have a confnet backbone that encodes the features for that particular suimage. Like this guy's like skateboard or something. It has its own vector representation, right? And then in terms of dense features, we now also have vision transformers. So we'll just very quickly go over that to make sure we're on the same page. So there are all these models like Yolo is a really good one, if you haven't heard of that yet. So we're at yellow v seven now. I think for eighi don't know. So there's a new one coming out every every other year or something. But the basic idea is that we get these bounding boxes for things into images or actually segmentations with the bounding boxes is what people tend to use, and they have labels, right? So this is labeled like backpack or something. And so you can do this as a preprocessing step on your image to get a much richer representation of what is really in that image, which you can then pump into your system, as we'll see later. And so then how you encode the information that is in these little bounding boxes or actually in the image itself in general, we just use a standard confinet for that. And so this probably feels like super obvious now. But in 2014, when people were starting to discover it is it was really very surprising that you could just use off the shelf confinet features to really replace the entire computer vision pipeline. So people used to do all of this very fancy, sophisticated stuff, and people spent decades on trying to refine this, and then it was all thrown away and replaced by a connet that does all of that stuff for free. And so the cool thing you get there is that you can transfer very easily across different tasks. So you can have a very generic confinet and then use it to all kinds of very specialized things, like spotting buildings in Paris, for example, or flowers or other stuff. And then of course, in the age of transformers, how far we're already quite a while. And this is only the verse transformer actually in the slide deck. So we're making good progress. So vision transformers are what we would use these days to encode the images where you have these flattened patches. And then you would do kind of the standard birth architecture, maybe as you would know it from this course. And then you do classification. So this is all like standard transformer, everything standard, except now your input here is not words or tokens, it's patches of an image. And then you classify that. All right? So then we have a bunch of features. And now how do we combine information? Right? So let's say we have two vectors, U V. So you know, sounds easy, right, to how we could combine them. It turns out that there are actually very many ways to combine them. So I don't think it's really useful to go over all the different ways here, but you can do very simple things, right? So obviously, like inner product through similarity is what you would use if you want to do cross modthings. So if you want to embed things in the same vector space, but you can do sort of fancier projections on top or different combinations that are kind of linear, or you can do multiplicative things where you multiply the components element wise, or you do some sort of gating over the different features. You can do attention, you can do fancier by linear things. You can do very fancy compact bilinear things. So there's really a wealth of literature kind of on all the different ways you can combine two vectors. And so this is called multimodal fusion. And most of the literature on multimodality is essentially about this question, what is the best way to do fusion? And that's it. So I think within that discussion, it's maybe useful to distinguish between different levels of fusion. So you can do it very early, where basically you make sure you have the different features, and then you just kind of in the sort of modern sense of attention, you would attend to everything in all the features from the beginning. You can first treat them separately and then combine them, or you can treat them as completely separate and then you only combine the final scores. And so that's kind of what we would call early diffusion. And then sort of my invention for calling the middle part would be sort of middle fusion. And then you have late fusion, where you really just combined the scores or the logets, but you don't really have any interaction between the information from the different modalities. So you could do really fun stuff with multimodal fusion. So this is a paper I verlike film, where you have this sort of very special feature map, this sort of f here, and it gets modulated by a multiplicative facector. So this gamma and additive sort of bias vector, this beta, and you have a different one for every layer of a renet that is conditioned on some encoding of the thing you're after. So in this case, are there more cubes than yellow things? So we have some vector representation for that, and we use that vector representation to modulate the rest net bloat, every layer of the connet. So you know, you can really do very fun things where you're sort of modulating one network with the other one and really try to have them learn as much as possible from that. All right, so let's talk about late fusion then. So late fusion is what we would now call contrasted models. But the basic idea is that we have this similarity score. So we have the two kind of we process the modalities completely independently, and then at the very end, we do some combination. And the most famous instance of that these days is clip. So who's sort of clip? Okay. So clip from OpenAI. So it's again, exactly the same contrast of loss that we've seen in all these early approaches. It does kind of negative sampling, but then in batch, so you just have a batch, you have two things that are aligned, right? So like this, the first piece of text and the first image, they are aligned. So this is the right answer. And I just want to make sure that I rank this thing higher than all the alternatives, right? And I want to make sure I rank this thing higher than all the alternatives. So it's a very, very simple idea. Really, really nothing special about this architecture that was sort of invented here. But what made this thing so cool was, first of all, it was transformers. And it was transformers all the way. So your text encoder would be a transformer, and your image encoder would be a vit image encoder. So also a transformer. And it was trained on lots and lots of web data. So Ellec Radford is really a genius at creating very high quality datasets, and he created, I think, 300 million image text pairs for this. Datset trained a bigger model on it than people used to do, and then we got this amazing model out of it. And so moving away from the words there to the sort of text that you would see on the Internet. So the caption for an image on the web, it's not gonna to say dog or cat. It's going to say a photo of the cat doing something something right? That means that you can do kind of zero shot label predictions where you have a photo of the and then you need to figure out what the right label is for a given image using this kind of prompt, right? So the thing you probably all know about prompting large language models, and so you can prompt vision and language models in very much the same way and do zero shot generalization. So if you want a really, really good paper, I would recommend that you read this paper. This is really one that's going to teach you how to write really good papers. It's duro, and it's really worth a very close read, I think, if you're interested in this field. And so I think when it came out actually on imanet itself, it didn't really outperform renet. So you might think, Oh Yeah, it actually is not all that special, but what really made it special was that it generalized much better to these other data sets, right? So this this renet thing here is pretty terrible at some of these kind of edverssarial versions of imanet. And clip is super robust sudeath. So it's just a way better image encoder in general. So very, very quickly after clip, there was this paper from Google using a line which was basically exactly the same idea. You know the field is not really that creative at all, is like the same idea, but then you just keep like throwing more data and more compute at it, and it often works much better. So that's what they found here too. 1.8 billion image taxpairs instead of 300 million gives you a better model surprise. But so still very cool. And what is really cool, I think, is that there's this organization called Lyon where they've started this open source collective to create really high quality data sets. And so the lie on the initial data set was how many examples in the initial lie on 400 million? He notes. I know he notes. And so now there's a much bigger version of Lyon that's even multilingual, and it has 5 billion examples. So Stable Diffusion was trained on sort of image, the English subsets of this thing. And that's one of the reasons that it's so awesome. It's because it's just seen a ton of data, and that really makes your system a lot better. So if you're looking for like the ultimate data set to play around with with your own ideas, if you have enough computes, obviously, then you should really look at this data set. All right. Any questions about up until this point? No. All right. So then we'll move on from late fusion to kind of middle fusion, early fusion. And this really is kind of the core of what I think a lot of people in the field right now, or if you're interested in getting in this field, or if you're going to go into industry and you're going to be using this stuff, like this is what you should really understand. And again, like the idea sort of stack onto each other. So I've kind of sequenced the slides to give you an idea sort of of how the scientists kind of came up with the next step. And you can really see the architecture just get slightly more and more advanced. But basically a lot of it is just more data and more compute again, so who knows how birth works? Everybody should raise their head in this way. Yeah. So Buris kind of so canonical. I think everybody kind of gets how burworks. So I don't think we need a real refresher, but I think you can think. And so the reason I have to slide this, because I want you to think about, if you have a burmodel and you have a bunch of images, how are you going to turn that burmodel into something multimodal? So so there are a bunch of like obvious things you could do given the kind of features I told you about in the sort of fusion process. So you know how are you going to do that? Does anybody want to like say something. Like if you're doing classification, even can convert and then just concatenated to whatever encoder like maybe an a and n or whatever your training on the data exactly. Yeah. So you can take take the confinet features and classify your token from bird, concatenate them and then classify for like a cat or something like that or whatever the thing you're interested in yet. Yeah. So that's one thing. You could also like take the confnet features and like give them to the bird model in lots of different ways. We can use the region features. So I think a lot of people when bircame out who were working in vision and language processing, were thinking exactly about, okay, so do we do like middle fusion, late fusion? Do we do early fusion? How do we do diffusion? And so there were a lot of papers all coming out basically at around the same time where people were doing versions of this. So burwas really kind of the innovation. And then everybody sort of just plugged it into their own thing because of hugging face transformers and things like that. So the first thing is visual. Birthis was one of the very early ones where you have this image and people would do object detection on this. So you get like a hat and a racket and a shirt and things like that. So you can just really take these features and then plug them into your transformer model, and then you try to recover the features. And so this really is probably the simplest way to do it, right? And so this is what we call a single stream architecture, where you have all of these kind of concatenating the original input features and then putting them through the same transformer. What you can also do, and that's something that this model called Wilbert did, is where you have two different streams. So you essentially have these two parallel transformers, but at every layer you kind of give them cross attention, right? Or co attention, as they call it. But it's basically like, so you just make sure you have an attention map that ends both, and then you just do your full normal transformer layer again. And then so this you can train, just like your regular birso. You have your MaaS language model here, and here you do sort of some equivalent of that. And then you also have your next sentence prediction, which you probably remember from your birth lecture. But instead here we're saying, okay, is this image aligned with this piece of text or not? There's also lex merth. I mean, there I could go on forever there, like 100 papers that came out that did all at the same time. So lex mert had a different cross modal output encoder or a bunch of different ways of encoding the positional information. So you could say, okay, I just have a bunch of bounding boxes that are featurized, but I don't care about where they are in the image. So it's just kind of like just a bag of bounding boxes. Or you could say, I found it here. Like this is the particular like top left and bottom right cord in it, and that's what you feature. Rise into your network. You can also do something even dumber. And I can say that because this is my paper, where you just take the image itself, you put it through a renet, and then you do a little bit of pooling on the final feature maps, and you just give those feature maps to Bert. And so you then need to distinguish between like your text segment embeddings, right, and your vision segment embeddings. But so this actually works surprisingly well. You don't have to do any additional training. You can just take birout of the box. Initially you freeze it, you learn to project into burtoken space, then you unfreeze your restnet, and then finally you unfreeze your bird. And now you have a very good multimodal classifier on the problem you care about. So a lot of these other papers, they're doing what they call multimodal pre training, where first you have a burmodal and a renet, so they're kind of unimoddily pre trained, and then you cobble them together, and then you have a multimodal sort of intermediary pre training step before you fine tune in on the problem you care about. And what we showed here is that you don't really need that actually in many cases. So it's a very strong baseline. You can also go to the pixel level completely. So that's what they did in this other paper called pixel birds where they, it's basically exactly mmbt. So the previous supervised one, but here they do do the multimodal pre training step and showed that I think for vqa it helps a little bit. So there are many of these births doing sort of visual things. People really tried everything. Here's another one called under where they added a bunch of different losses. We can really talk about this for a very long time. We're not gonna to do that. I'm just gonna to kind of talk you through some of the more interesting ones. So this one, I think is quite interesting, built because here, this is really the first instance where we are completely gone from confinet features. So we don't do any preprocessing on the image. No region features, no backbone that featrizes the parts of the image we care about. We just have these patches of the image. So really in grit, we flatten those patches, we just pump them into the transformer straight away. So this really is like sort of Burd and vit together in one model. And this worked really very well. So so that's been the trend. So here's a nice very long list of all of these different models and what they do. And so really the distinctions are just in what is the text encoder that you use? So do you use birth or something fancier or better? Roberta, what is your vision encoder? So in many cases, you have these region features. So you would do an rcnn styor. You could just do a renet or a vit. You have different kinds of fusion. So either single or dual stream, as we talked about, right? So visual beror Vilbert, different pre training tasks. So MaaS language modeling, image text matching, there's a bunch of like funkier ones you can do. So and then finally, you can do multimodal pre training on all of these different data sets that have aligned data. So you were probably wondering, okay, so what is really the interesting difference between a lot of these? And so I have another recommended paper that if you're interested in this space, you should really take a look at. It's also really well done paper where they unmask multimodal prere training. So basically, they say, if you take all of these little model inventions and you train these different models on exactly the same data in exactly the same way, it turns out that they're all basically the same. So that's a lot of kind of you know wasted effort on the part of the field because everybody is saying, wow, my model is better. But it's actually just because you trained it on different data and there's no real sort of model innovation going on in a lot of these things. So I don't mean to sound discouraging or anything like that, but you know I think that's why this paper is really nice and really important, is because it just shows us what really matters. So this is also work that I did myself called flavor with my team, where we wanted to take these ideas really to the limit. So a lot of the things that you've seen now, so the visual words and the villberand, things like that, they're all about multimodal questions. So how can we do visual question answering something like that, where we just have these two modalities? We only care about problems that always involve these two modalities. And where we want to go, and this is kind of the basic premise, I think, of foundation models in general, is that we have one model to rule them all. So this one model can consume data from all of these different modalities. And you can synthesize across all of these different modalities and then do useful things with that information. So with flavor, that's exactly what we tried to build. So we wanted to have one foundation model that is good at vision and language. And computer vision and natural language processing is jointly pre trained on all of these different data sources. So it's also trained on just cc news, common crawl and book corpus. So it's very good at the sort of things you would expect Burto be good at. It's strained on image net for image data. So it's good at the things that you would expect this kind of basic image model to be good at. And then you have this pmd data set that we created out of publicly available image text pairs that we also trained on. So this pmd data sets is really just if you take all the data asets that were ever created that have image text pairs that are publicly available. So unfortunately, the clip data and the Google aligned data and all of these data asets, they haven't been open source. So this is before rely on. So now there's a good alternative to this. But so this p and d data set, if you combine all of these image taxpayers, you get 70 million of them. So that's still pretty decent size. And then you can take all of these data basically to solve all of these problems that we know, we care about in these different fields. So you can do multimodal reasoning, you can do language understanding, you can do visual recognition, all with exactly the same model. And that's a very powerful idea. I think if you work at a company like Facebook, you don't want to have different models for all kinds of different things. You want to have one model that you can really use for everything that's going to really make your life a lot easier. So the exact architecture here is that on the one hand, we have this image encoder where we take the image we encoded as patches, and we just do what we call maimage modeling, but it's basically MaaS language modeling. And then just on the image tokens, right? And then on the other side, we have the MaaS language modeling on the language. So your regular sort of bird thing. And then we have a multimodal park where all of this information gets combined. So we have a MaaS multimodal modeling loss term where you can also do image text matching. So this is like your burneccenprediction thing. And then we also have the global contrasted loss, which is exactly like a clip. So if you do all of this stuff, it's just all transformers all the way down. And it's sort of a very elegant way, I think, to combine a lot of this information. And when you do that, you get something that can really do a lot of things very well. So we're not going to talk about that tjust way too many numbers. But so just trust me, we're pretty thorough generating the table here. And so over 35 different tasks. If you compare flavor to all kinds of different ablations in terms of clip models, then this is just a much better way to get to this information. So I think this is a nice example of like where we're probably going to go with the field in the near future. So the other trend that we see very obviously in the field right now is that everybody cares about generative models, right? So language models and image generative models, there's just a trend where we want to be generative. We want na move away from this contrasted discriminative stuff to the more interesting, more richer representations maybe that you get out of generating sequences or images. So this sim vm paper was one of the first ones where they really had this separate decoder that was trying to generate or kind of complete captions, which they showed gives you a lot richer representations. I think this is actually the current state of the art. Now it's called coca. So a lot of these models, they all again look very similar. But in this case, now we're starting to really see these text decoder. So initially with clip, I think that's also what they were trying to go for, like OpenAI being a company that really likes generative models, but they couldn't really get it to work. And I think so it took as a while, as a field to really figure out how to do this the right way. And so right now, we're really kind of in the age of language models, right? And so one of the interesting things you can do with language models is just keep them frozen and then learn how to project into the language models the mmbt architecture. I talked about where we had this Burk model, we kind of kept it frozen, and we learn to project into the Burk token space. You can do exactly the same thing. But then with a much fancier model or something like t five, even where you just have an encoder decoder or some kind of generative part of this, you keep that thing frozen, and then you learn to project into the token space off that frozen language model, and then you can do lots of fun stuff, it turns out. So what they show in this paper is that you then get few shot learners. So all of the things you see with GPT -3, where you can just give it some kind of in context examples and it's going to figure out binding kind of on the fly. So it says like, this is a dex and this is a bliket. So is what is this? And then it gives you the answer that it's a dex. So it really learns in context how you decide the feature mappings, which is really kind of solving the grounding problem that a lot of this multimodal stuff started with. So I think that's very cool. And then probably one of the coolest papers right now or models right now that you might have heard of, if you follow the field, this flamingo out of DeepMind, where they take a chchilla language model. And so this is really an optimal language model. And now you have this vision encoder that encodes multiple different images that you can then do reasoning over and then kind of auto complete. So what this gets you is just a much more powerful model because you can do you know your generative over lots of different images. So it's really like stepwise. You can see it, right? We started off with very simple transformers, and now we're actually at something that is starting to get pretty complicated, because we have these building blocks like a perceiver resampler, where we have a bunch of different images that we featurize. And now we need to compress the information, because sometimes we have three images, sometimes we have five images. So we want to make sure that we can compress it so that it's always ready for consumption by the next layer of the language model. And then so this paper, again, is a really good paper to read because they actually so this is not me, this is not my code. This comes from the actual paper. So they just have the diagram together with the code so that you can really understand what it's doing, which I think is really great. And so once you have your perceiver resampling step, what you would then do is you do gate ated cross attention. This is how you implement it. And so this gate ated cross attention, you do that before your frozen language model layer. So you really just have a frozen chchilla language model. And you learn to kind of modulate the information that goes into that language model. You propagate the gradients all the way back. You just don't update the language model. So you're really kind of trying to figure out, like, how am I going to design my signal so that my language model can do the most with it, right? How am I going to combine the information? So you'll notice that now we do it before the layer, right? And a lot of this other stuff, you would do the attention after the layer, but here you do it before. So garpothy, I think more than ten years ago had this image, which Barack Obama kind of setting his foot here on the scale to make somebody think like they're a lot heavier than they really are. So this is obviously funny to us, but not to an AI system, I think, unless it really understands the scene. And so that's why kpathy at the time said this would be a really good visual Turing test. Like if a system can figure this out, then it's actually really smart. And so obviously, it's been a bit of a challenge for everybody working in the field then to get something that actually works on this. And so flamingo, as it turns out, kind of gets the joke, but Yeah, so it's a bit unclear if it really gets the joke because if you read this conversation, it's sort of kind of getting steered in the right direction, but at least we're making progress, let's put it that way. And then so in flamingo, you still have a lot of moving parts, but you can really take this almost through the full extreme where you try to freeze almost everything and you just want to learn this kind of mapping between your image encoder and your language model, or your image encoder and your encoder decoder architecture. And all you really do is just a projection between the two. So there's this nice model called blip two, where they experiment with like opt for the language model in flanti five for the encoder decoder architecture. And this just gives you amazing results. It gives you really complex captions and things like that without any real direct supervision on the captions itself, which is pretty impressive, I think. So that just shows you the power of language models in general. So here I had some examples. So it can really do like different things from captioning to reasoning to visual question answering to like location detection. So you can have a long conversation with this system. This really is kind of the future where we're going, right? Where we're gonna to have a chat tgpt, but it's also going na be able to see the world in a way. And so I think an interesting thing, so you probably heard of like chain of thought prompting and things like that, where you ask the language model, like let's think step by step, and you can tell a vision and language model, generate a rationale for why something might be the case. So you generate a potential explanation for what your answer might be. And then after that, you ask it to answer your question, and it turns out that if you do that sort of multimodal chain of thought prompting, then a system gets much better. And so this is like the new state of the art on science qa, or a benchmark like that, just because it learns to unpack the information. And so I think we're really, as a feel, just starting to figure out what the potential is of this. And I think this paper is where they also showed that multimodal chain of thought prompting really gets you pretty amazing results. And they show very nice results on raven matrices and very complicated kind of iq test sort of things that humans are supposed to be really good at. But you have to be a pretty smart human to really be good at this. And this system just nails it. So you know, we're making super fast progress and we started off from a very simple burmodel that was able to look at some pictures, and now we're getting to these very sophisticated foundation models. So that was my short history of multimodal foundation models. So how much time do I have left?
speaker 1: All right.
speaker 2: Okay, plenty of time. Yeah, please. Questions. So I noticed a lot of the images that just looked like they were boxes just PaaS through some kind of no sense of shape in them. Yeah. Yeah. So so I think the history of computer vision has been very similar to the history of natural language processing, where we thought we needed all of this structure and all of these different things. And it turns out you can just throw it all away and just have a big transformer over to patches. Sorry, yes, cs 231 .
speaker 1: in one minute time. You mentioned a couple times. What does that mean?
speaker 2: Yeah, sorry, I should have explained that better maybe. So it just means that we are not updating the weights on. So like if we go to this area, I think is a nice example. So we have frozen self attention. So that just means that when we do a forward PaaS, we go all the way to whatever we want to predict. We get some gradients, we take them all the way down, but we only update the non frozen layers. So hair degradients actually do get updated, but these just never change. And so the reason you want to do that is because otherwise you're going to drift way too far. So then you're going to kind of destroy all of the cool stuff your language model has learned because you're just going to focus on the small data set that you're training it on. So you want to preserve the abilities of the language model, but you want it to become good at the thing you care about. Other questions .
speaker 1: characmultimodfusion, is there a benefit to doing like really middle fusion as opposed to fusion? Yeah.
speaker 2: So so I mean, we're going to talk about evaluation next. But so it really depends on the tathat you care about. And so I would say the earlier is always the better if you can afford it. And so like clip is very efficient to train is very late, fusion right at the very end. So there's no interaction between the different modalities. And so that's really good if you want to be very efficient and if you want to be for training, it's much nicer, right? But if you want to have a richer understanding of the multimodal signal, then you want to do earlier fusion. So Yeah, it's always a trade off.
speaker 1: It seems like images are just a lot more data than text. So how much more difficult are these to train and how much bigger does like the image processing have to be compared to the language model? Yeah.
speaker 2: So images are more complex in a way, but there are also kind of higher bandwidth representations, right? So there's a lot of kind of like pixels that our brains just abstract away, right? It's really about the scene that you're seeing and like you're not really thinking too much about the pixels themselves. So like jlakuon likes to say that language is just a kind of low bandwidth, a proxy for a language of thought, which is much richer and much higher bandwidth. And like he thinks probably visual, I'm not so sure. But so Yeah, I don't think that there's necessarily a difference between kind of the scaling laws that you see in these systems, or at least we still have to figure that out. We'll kind of talk about that towards the end as well. Also have certain social and culture bias, just like the natural anone. Oh yes, they have terrible biases. Yeah so Yeah, some people are actually working on this who are in this very room. But so these models can be very racist also in what they generate or the kind of predictions they make. So if you have an Asian basketball player standing sort of like this with a basketball very obviously there, then the model will think that he's playing ping pong because he's Asian. I'm not. Okay. So these models, Yeah just like all neural networks, right? This is really a big problem. And one of the most interesting problems that you should be working on if you're a student and you want na make a difference, is how do we get these systems to be much better at these sorts of things? So examples you show like the model interpreted from the content of an image. So when we want to understand the content for video, so what extra challenges you might see like along this path and what improvements we can make to broadthis? Yeah. So you're asking about the attention MaaS, sort of it. Yeah. So you can use the same idea for videos and you just look at the video. And so these systems are so good now, the object detectors are so good, you can really track objects kind of real time as they go through your video. And so you can try to check how that aligns with your attention mask and your model. So a lot of like so videos, I think, are sort of interesting, but they're also not really interesting because you can very often just subsample images and solve the images rather than having to deal with the complex video with chat. Maybe more one more question and then we'll go do some evaluation. So these multimodel model models, when you only let's say you only provided symbols for some media to the only text or vision, how does it perform in that case? But it's obviously more gear for multilittle cases. Yeah. So I mean, that's one of the giant shortcomings of a lot of these models is that they're really just built for multimodal stuff. And so what if I don't have an image, right? And so I mean that that's why we did flavor, because we want to have one model that can do all of that stuff. And that's why in mmt, so to supervise multimodal bitransformer, we actually have an analysis of how robust is this model to missing images or missing text. But so I think A A lot of folks working on these early visual birth models that were kind of myopically focused on vq eight, which is actually a great segue to what I want to talk about next. So it really depends on the task that you care about, as I said. And so I think if I'm going to tell you about multimodality, I also have to tell you how you're going to check that the multimodal system is actually good to have multimodal things. And so that's the topic of evaluation, which actually is a super important topic. And a lot of people, they want to be cool and build big models, but I think it should be way cooler to do proper evaluation of these models, especially if you're in academia, because you only have limited GPU's anyway. So what can you do? Sorry, don't want to rub it in with it. So how do you check? Well, there's this amazing project. So like imanet really changed, like the history of deep learning, I think, and this other data set, coco, I think, also really changed, especially vision and language, but also, I think, vision in general, where they have just a bunch of main sort of multimodal tasks. So these images are very richly annotated with all kinds of different things. So like the segmentation of the objects, the bounding boxes, the labels of the bounding boxes, they come at like sort of different pixel granularities. It's a huge data set. It's very fine grained, annotated in terms of like the categories that it has. And then you have five captions for each of these images. And so this really was the first data set that unlocked a lot of sort of vision and language processing at scale because you had your picture and you had your caption. And now you need to figure out, okay, how do I give the right caption for this image? So does image captioning or can I retrieve, given some piece of text, the right image, the or the image for the piece of text? So there's a bunch of very impactful data sets that do this stuff that we already talked about, Lyon, but coco really is the main one. Still, I think that a lot of people kind of use as the canonical instance of this data set category. And then the other thing that people really care about in vision and language processing is visual question answering. And so there really are a bunch of academic groups who are or have been so focused on this task that they didn't really care about anything else. And that's why you see a lot of models that are really optimized just for multimodal and nothing else. And you can see that I kind of reflected in the citation counts as of last night, 3:00a.m., where so the vqs has way more citations than image captioning datsets even, right? And so what you do here is you just have an image, and then people ask very simple questions. So annotators, they ask these simple questions, they give the answers. And now we want to be able to answer these questions with machines. And as I alluded to earlier, one of the kind of embarrassing backstories of this data set was that the initial version of the datset was actually fun to have. Images not really matter at all. So you could just look at the question and it could have something like how many slices of pizza are there? And so well, not in that particular case, but in almost all of the data set, the right answer for how much or how many question was too. So if you just predicted too to every how much or how many questions you got like 70 accuracy on the counting category. So careful data set or evaluation benchmark design is also really a skill and you really need to think about what you're doing. You can't just like set some data aside and evaluate it on. You have to really think about what you're doing. And so there's gqa by Chris actually, which is also just, I think, a better design version of this data set maybe so you might want to use that these days. They're also a kind of very targeted data sets that really tried to measure one particular thing. And I think one of the things we really want to get at with these models is what we would call compositionality. So we want to be able to really take the parts and reason about the whole and understand the relationships between the different concepts. So clever was a very clever datset that was designed really to measure the compositionality both on the language side and on the vision side. So you have to understand the relationships between all of these different objects in the images. So that's been a pretty impactful data set, I think, for really forcing people to think about compositionality. But a lot of these data sets really had big problems. So one of the problems is you know they were too easy. So vqa is sort of like plateauing out. We could talk about that a little bit too. Wasn't really realistic. So you could solve vqa, and that's probably going to make some people's lives better. You're all like trying to process the means, how you can see everyone. Okay, let's get to the memfirst then. So obviously, so these memes are not actually in the data set. So I could put some really hateful memes about sort of Hitler or something which are in the data set, but that would be less fun. So these are mean meme examples to kind of demonstrate how the datset was constructed. And so one of the problems we had, as I said, like vqa, the v didn't really matter. What we want to have is a data set. If we care about multimodality specifically, is like how do we get a data set that you can only get right if you are good at multimodal reasoning and otherwise, you're just gonna to screw it up. And so this is what we came up with, is if you have a meme like this one, love the way you smell today. I mean, that's not very nice if you send this to your friends, right? But so it turns out that if you just swap out the background now, it's a very nice thing to say. And like this one is, you know, I don't know, you're maybe a bit weird if you like this, but there's nothing wrong with it, right? And so it's the same for this one here. Like look how many people love you with the tumbleweat. That's really sad. And you know, if you change just one word suddenly is like a really nice thing to say, right? So if you want to solve this, if you want to classify this correctly for the meanness, then you have to really understand multimodal reasoning. You have to understand the relationship between the image and the text in order to get to the right label, right? And so it was really constructed by design to do that. And so how we did it exactly is we use some really highly trained annotators. And then one of the big problems with a lot of these data sets is that nobody really knows who owns the meme. For example, so somebody makes this meme now, they technically own a copyright. And so when I made this data set, I was working at the Facebook, and they were very afraid of copyright things. So what we actually had to do is we had to pay people to make new memes, and so not from strerch, so we could show them kind of the actual examples. And then they had to try to find images that were kind of corresponding to the original source image and try to recreate the meme. But now with an image that we could buy from Getty. And so we gave a lot of money to Getty so that we could then release the data set to the public so that people could do actually research on this and understand for their multimodal models whether they're good or not. And so we really tried to make it so that we had these ign der benign confounders. Sorry, it's a startup world with cofounders. So the confounder here is obviously that you have your original meme, and then you have your confounder where you swap out one of the modalities in here, you have the other one, right? So we had our annotators do that as well. And so this led to a really nice data set, I think, because it showed some of the intuitions that I think a lot of people in the field had, which is that multimodal prere training .
speaker 1: doesn't really .
speaker 2: work in the arm. So multimodal ptraining doesn't really work. And so all of this stuff that people have been doing with all their fancy visual burmodels actually turned out maybe to not really be that useful anyway. And so maybe it got you like one point extra right from visual Burto, like a different visual burlike, less than a point just just by doing that multimodal pre training. So that means like we still have to figure this stuff out, right? This data set is far from salt, and we still have a long way to go despite all these fancy models and. A new paper coming out every week that does something new, like we're not there yet. And I think that's encouraging especially for you, like when you can go out and solve it. So what we did with this data set is we organized the competition. We had 100k in price money to try to see what people could come up with. And so there was a lot of nice work coming out of that, and we really kind of managed to crank the numbers up by Clyde a lot, but the solutions were slightly disappointing. So I don't know if you've ever used Kaggle, but if you want to really win on Kaggle, you just have to ensemble the hell out of all of the different models that are the current state of the art, and then you're very likely to win that. And so that's what happened here where there wasn't really the fundamental breakthrough we had maybe been hoping for. So that still needs to be built, I think. So this utter dataset I just want to kind of briefly talk about. So the theme sort of of this section is like if you make a data set, think about it very carefully because you can really be very creative with this and really, really measure the things you're trying to get at. So this this datset winter ground, we were trying to figure out, okay, how good is clip actually? So it looks really amazing and it's way better than things that were previously there. But does it understand compositional relationships in the same way that humans would understand it? Or is it sort of just fitting onto the data distribution? And it can be very good at the head of the distribution, but it's terrible at the and you can probably already guess where this is going. So just to give you an illustration of what is in this data set, you would have some plant surrounding a light bulb, or you would have a light bulb surrounding some plants. So notice that the words here are exactly the same words, but in a different order. And so the visual depiction of these words is very, very different. So if you're delyour contrasted model is actually good at understanding the visual semantic or the visual linguistic compositionality of these examples, then you can get it right. But again, if it's actually just overfitting on the data distribution that it's seen and it just kind of is biased or what it sees often, then it doesn't really get it right. And so one paper that we use as a source of inspiration for this work is this paper here, order word matters pre training for little. So we actually found that the order of words doesn't even matter that much for general pre training very often, which is also kind of a scary thing, right? So this is deep learning for nlp. We think that language is really important, but these models can reason about language even if you shuffle all the words. And so that's probably not what we want to have. And so that doesn't tell you something about how great we are as researchers. It tells you something about how terrible our evaluation benchmarks are. And that's what we need to fix. So so what we did with this data set here are some other nice examples. Like there's a mug in some grass or there's some grass in a mug. Like these are very different pictures, right? And so for us, these are trivial. Like so like you know what's the difference between a truck fire and a fire truck? They're pretty important, I think also to get that distinction right. So guess what? State of art models often perform below random. So you know, as I said, we still will have a lot of work to do, which is good. And so when this paper came out, I think that the reaction was really nice. And so when Dolly two came out, which so you've probably heard of Dolly two, right? So it's sort of like Stable Diffusion, but then before Stable Diffusion, and so this was really the first model that really showed like just how in these generative models can be when they're creating images. So this is there's a mug in some grass. You do have to kind of cheat a little bit because you have to add digital art here. If you don't add that, then it breaks down completely, right? So it's sort of promphacking. I think we're sort of tuning on the test set, but okay, you know so this is pretty good, right? So so it definitely is better than I think a lot of people would have expected even a couple of years ago. But it's not perfect because people on the Internet like to take more pictures of spoons than forks. So if you say there are fewer spoons than forks or there are fewer forks than spoons, it just really likes spoons more. You know and so maybe it's like the matrix or something, I don't know. But so spoons are just nicer. So again, what you can see here is that these models really are just reflections of the data that they're trained on, right? And Yeah, so models are getting better, but if you've looked at Stable Diffusion, like it still can't count fingers and things like that, right? So again, there's still a lot of cool work to be done. Any questions on evaluation? No. Okay. So let's let's talk about other modalities then because so we've really just been focused on images and images are great. There are lots of images on the Internet. And so that makes it sort of an obvious thing to focus on. It's also, I think if you look at our brain, like vision is a very dominant modality, right? So how we understand the world is very vision driven, but it doesn't have to be the case. So there's all these other interesting problems that involve different modalities. And so the most obvious one is just speech or audio, right? So after Zn comes hearing, and really we could do another lecture just like this, just on speech and audio. And there's lots of interesting stuff to talk about. Obviously we don't have time, but I'll give you another nice example of how amazing alc Redford is at creating data sets. So there's this whisper model that came out of OpenAI not too long ago, which was trained on 680, zero hours of multilingual multitask speech data. So speech with transcriptions. And they trained this very fancy thing on there, which actually is not very fancy at all. It's just the log mail spectrogram. So how you represent the audio signal, and then you feed that into a big transformer. So this is sort of your encoder self attention here, right? And then you have your decoder where you have your cross attention and then you just generate the sequence. So this is encoder decoder, basic transformer model. But your input is convolutions, one dimensional convolutions over the log naspectrogram. And so there's lots of papers that do very similar things. There's models like wave to vc that try to turn the wave signal into vectors, or you can discscretize it in lots of different ways. So there's a wealth of literature then. I think one of the funny observations actually is that you can just reduce audio to vision anyway. So that's what you could sort of argue this log mail spectrogram does, but so not to my own horbit. In 27, I did this paper where we showed it. You can just take a real audio sample, turn it into kind of a spectrogram, really just a spectrogram. So what does the spectrum of the audio file look like? Feed that to a regular connet, like an Alex net even, and then that gives you amazing auditory features. Ure, so now you can use this to distinguish between violins or guitars and things like that. So, you know, maybe you can just reduce all of this to vision. So one question maybe you could ask us, like, can we also reduce language to vision or vision to language? You know that. So that's sort of what people are thinking about. So we talked about the video. There was a question about video. So a lot of these ideas also extend pretty directly to video, but now you just have more data. So like flamingo already had a bunch of different images in it. You can do flamingo over videos. Probably a lot of the images are pretty useless for what you're trying to do with this video model, right? So they're too similar. It doesn't really add all that much information. So you want to subsample the frames so that you get the most useful information out of your video. And so there's a bunch of approaches that kind of take the keyframes, and then you just do a standard joint vision and language transformer and Cogar thing on top of that. So this is kind of becoming, hopefully by now, a very familiar recipe, right? And so there's this. So Merlo is a nice architecture that does this. And then they came up with merlow reserve, kind of a silly name, where they also added audio to this model. So this is now a trimodal model. And so we're going towards this foundation model that can consume all of these different modalities all in one go. And that's really like a clear trend in the field. Another very interesting direction, I think, where in the field, we were very excited about this for a while, but I think it's sort of gone now because it's too difficult to create lots of high quality data in this setting. But what you can do is you can have simulated environments. So this is a paper from DeepMind from 2:17 where they had this agent walk around in the maze. And then it could have natural language instructions. It could also generalize to like decks and Blicks and different sort of grounddings and assignments that you could do in that environment. So this is a super interesting direction, I think, in the long term, because this is how humans learn language, right? Like we walk around in the world, we interact with our environments. We have all of these different perceptual observations. We synthesize them in our brain, we manipulate objects, we change our own viewpoint. And that's how we learn everything we know about the world. And so our language is very intricately connected to that world and how we observe it. So I think that that might make a comeback at some point in the future. You can also do other stuff. So especially with this kind of conditioning or text that we're seeing a lot of, right? So so you know dolali two and sable diffusion and all of these different things. And the original gwe talked about at the beginning, you can do the same thing, but now you're generating 3D point clouds, right? So this is a 3D corgi using a corgi. And so this this prompt can probably become much more complex over time. And you can do like sort of AutoCAD design and just say like give me a house and it's just gonna to design the whole house for you. So you can just like tweak the prompt and things like that like that that's all coming or even already here in many cases. So the final modality I just briefly wanted to talk about is olfactory embeddings. And so olfaction means smell, if you didn't know. And so it turns out, so my PhD thesis was about grounding semantics in different perceptual modalities. So a lot of my work started in vision. And then, okay, now audio is sort of the obvious next one, right? So you can learn a meaning of violin, and then maybe you can learn that violin like what a violin looks like and what it is and what it sounds like. And that's gonna na give you a richer representation. But for a lot of these words, what's actually very primitive to their meaning is what they smell like. Because in our brains, that's really one of the core areas and one of the oldest areas in your brain. So what you can try to do if you want to complete all of your perceptual modalities is you can try to build olfactory embedding. So it was kind of a joke paper I did, but the funny thing is it actually worked. And so there's a catalog, this sigma aldriitch fine flavors and fragrances catalog, where you can look up words like melon and pineapple, and then it's going to give you all of the chemical compounds that produce the smell or taste. And so if you do that, then you can count the occurrences, and then you can sort of do svd or something like that, only to get it to be a bit more of a real embedding model. So now you get smell embedding, smell vectors, and then you can compute similarity judgments between these smells. So turns out apple smells like pear, and chocolate and cocoa and sweet coffee are sort of related. So you get these clusters of different smells just based off of their chemical compounds. So this bag of chemical compounds model gives you a very rich representation. And so if you look at all of the words that are concrete enough to have smell, so like if you have a word like democracy in there, that doesn't really smell like anything, right? So you ignore democracy and you just focus on the things that smell or that could smell, I guess. And then so the really interesting thing to me is that this is much more correlated with human similarity judgments than the linguistic vectors we had at the time. So for a word like apple, like you can just get a word vector like you've learned in your first lecture. And so you can do like you know skip gram and things like that. But that thing is not going to be as correlated with human similarity judgments as this bag of chemical compounds model. So that's pretty interesting, right? So even something like smell, where maybe we think you know this doesn't really matter. If you really want to understand how humans understand language, then maybe you want to include this in your foundation model too. But I would start with other modalities. All right. Okay. Yeah, sorry. Yeah. So where to next? I'll just, I think I've already said most of this actually. So one foundation model is going to rule them all. And so I mean, there will be many of these, but a lot of them are going to have very similar traits. I think we're going to be looking at scaling loss and trying to understand really what is the relationship between the different modalities, which one do we want more off that sort of stuff? We're going to have retrieval, augmentation. This thing is going to be really huge. If you've heard of rag or if you haven't, you should look it up. So all of these parts of these models can also be multimodal. We need way better evaluation and better measurements. We already talked about that too, and that's all I had. Thank you.