speaker 1: Hello, thank you all for joining ctwenty five transformers today. For today's talk, we have Ming ding, a research scientist at jpu AI based in Beijing. He obtained his bachelor's and doctoral degrees at ting shua University, and he does research on multimodal generative models and pre training technologies. He has led or participated in the research works about multimodal generative models such as cog view and cog video, and multimodal understanding models such as cog vlm and cog agent. For today's attendance, the attendance form is up on the course website and if you have any questions, ask them through slido sdo. And for the code, you just have to input cs 25. Thank you Ming for today's talk and I'm gonna na PaaS it off to you.
speaker 2: Thank you for the instructors of cs 25 to it's very happy to give a talk in Stanford University about mulmodality ity training. And actually I have tracked the all the previous talks in cs 25 and they are really diverse topics. Someone shares intuitions in their research about training. Someone one's shared recent works about maybe moe and some other technique. Actually I'm working in A A large language model company in China and our company working on training and maybe there's lots of different area from a large lmodel and multimodity model and generto model diffusion and tattoo speech, something like that. So I lead all the multimodity model research in juple AI. So I will share lots of different topics in in in in this talk. Some some of them may be not very familiar to you. So Yeah, it's okay, but you can get more information of different areas. Yeah, I will talk about several aspects of transformers, and I will generally follow the history of a large language model and say, why are we here? It's about the large language model introduction and history and how did we get here. It's about some practical techniques for training, like language models and what are we working on. It's about the last one year, the real language models and other techniques in the papers of all the real language model community. And finally, I will talk about the some possible and valuable direction for research in mulmodity. Okay, okay. I will share three moments. I think the most important three moments in the development of language model, the first moment is called birth moment. Actually, I get I got into the area at this moment. It's very honored that I'm the first among the first group of people who published papers on the next year. They l when bocame out. And at that time, since we don't really know what he's like with modeling, so at that time, nearly all the people are talking about how can we get better self supervised methods for an option. At that time, a common opinion is masculine model is just for is good at understanding the the tax and GPT the autoregressive model is better for tax generations and t five maybe can can do the both but is redundant. That's but nowadays we we will say that dvt and has now nearly A A sevbullet of all the nlp problem. Sometimes the things changes and we will back from that time point and know how the language model changes and how we got more and more knowledge about language model. So at that time, the I'm also among one of them who want to develop a new self supervised learning methood for an option, we a publish paper called grm, and we want to unify the board, the mask language model and the autographmodel and divyeah in a decoder only style. The actually, the method is very simple. We just select sequence, a part of the sequence and only do auto autoregressive modeling during this sequence. So if the, we select the mask area as all the sequis become a GPT and part of them, it become bored. So that's a method we find, we find very efficient. And because we, it's like a Bois, about 15% of the masked area and perform better than boards between treat it as a GPT. They perform as the same as GPT. It's quite a very proud thing. But there's the second moment I think is very important is the GPT -3 movement. It telus, the scaling law is very important. So you can design different architectures, define different laws, different self supervised tasks and different methods to schedule different models. But the performance maybe has some upper bound. But if you add more compute, you can get a guaranteed the performance improvement. You can predict the results probplexity based on the, based on the fitted curve. So at that time, the language modeling has become more and more engineering. If if you have, find a very good point. You train a language model. If you want to scale it, you and your boss gave you four times of monyou, can buy four times the computer te, you just assign the compute for more parameters or training more tokens. This is, this is called scaling low. And you did it. Tell you how how you can assign a different potential of of your monies. So so at that time, don't the language model don't really need some maybe architectural innovation or as algorithm innovation. So it's become an engineering thing. And the third moment, I think, which is more more important, it's called chagpt moment. At that moment, it telus, a very important fact is task adaptation is cheap. And what is the the what is really important is the knowledge from printraining. This is very bitter lesson. So I have told you that at that time, we designed different losses, different architectures, but some of the the aim of design different losses is to perform different tasks. For example, the autographing model cannot fill in the blank in the sequence, but glm and bocan, so we use different training tabut. Currently, we know that the task adaptation is very cheap, you just need to fine tune your language model as the final, the final period. What the only important thing is your laws. The left figure is from instruct GPT t. It's actually the paper about chagbt. And how can we align a model to a chat model? He told that the alignment can give a very huge improvement on human preference compared to the original printrained language model. And the right figure is actually a recent paper in our company. It's it's tell ls us a very important fact. Maybe it's intuitive because the fact is the downstream the performance of downstream task is only related to the loss to the loss of training. Yeah and it's not directly relevant to the model size, which means it's a large model, which very high loss because of lack of training and a small model, we train more and to reach the same level of laws, they performed exactly the same in the downstream tasks. So the so called emergent ability and some other maybe ability, strange rumors are not actually not that the ability is not not from the number of parameters of language model is actually from only relevant to the loss of your language model. So all the language model became a game of curl fitting is actually the current situation of language model research. So there's also some technical details of a large language model. Even we we know it's not curfitting, but there's a lot of important things. So we will back from the some basics and talk about the transformer the transformer architecture. A very interesting thing is the most important improvements in nowadays are still from the the the first author of the transformer paper the norm and maybe from his other papers. So actually the real innovation in the architecture is very small. I can summarize some common adaptation on transformer currently. First is decoder. Only the original transformer is a ender decoder architecture. So is redundant because the the important, the the encoder and decoder should learn the how, how to understand the text from different parameters is redundant. And currently we only care about decoder only, decoder only architectures. The second one is prenorm in the original transformer layer, the norm is after the the residual connection. It's called a postliar norm. And currently we usually use preal norm. The rotary position embinding is something very special because it not published from a paper, is not published from a Chinese bloit's, but currently it's proven very efficient and the group attention is actually from another people know it can save the inference memory. An tu warrant is also from norm, is just a replacement of the mlp and mixture of exporting actually also from the norms paper, and you can use the same flow of small parameter to get better performance. So this is what the current most ones open source language model, the architecture of most advanced open source language model, for example. Lama, okay, we know there's a texture, but how to train the, this transformer is also very important that we need to prepare very powerful code base to train the large language model. The the first choice is deep speed, is it's a library from Microsoft. And some of the most important optimization method is from the paper called zero in from deep speed group. If several years ago, some of the us not really know how to train a very large model, how to efficiently train them, but there give us some advice. For example, if we we can find the most a memory concsumption is actually the Adam states, the optimizer states you must keep its a full preciis, a float, and the master, which is also a float, the parameter and gradient, you can keep it high preciand, you can have a fast computation and save memories, but the zero one can scatter the maweight and automtimate or state into all the data parallel ranks. So if you have more more ranks, more GPU cards, you just use less GPU memory for for each rank. Another important technique is called activation. Checkpointing is actually recthe intermediate state and recompute when backwards. So we don't really need record all the computation flow graph, we just need to record some of the hidden states, it reduce all the activation from many layers from to into one layers. And there's other methods to reduce the memory consumption. For example, zero to cpu offload, which means you can offload some GPU memory to cpu and zero. Three, I also call fully shdata first data parallel. You can just shyour model into different cards. And when you use the parameter, you gather this parameter from the other ones. So all, all, all this method is very complicated, but the deep speed library have already give a very clean api to use it. So currently it's not very hard to train a very large language model efficiently, and Macron is another framework to train like language models, and it is also the most available framework to train a super large language model. More than 100 billion parameters. It use another set of optimization meththe first incalled tensor parallel, the tensor parallel splis the hidden size and heads into different ranks, and it costs ses an additional or reduce for attention and mlp, but reduce the all the parameters consumption and computing consumption into different tp ranks. The pipeline parallel is to split the layers into different ranks, and it also introducbubbles in pipeline. And there's some method, for example, leave a very bubble to remove this consumption. Yeah maybe if you are. If you want to train a very large language model, one day you need to learn about all these kind of system system things, because the current large language of training is actually an engineering work. Yeah, an lp is not very important. The the important is emses. Okay. So another very important thing is long contacts is actually loless long contacts, which means we don't use sparse attention or other methods to change the full attention behavior. Is the the current infto tree long context is beyond the imagination for AI guys four, five years ago the left figure is actually my paper when I Yeah published several years ago in Europe. It's at that time there's no such thing like bbc three. It's only bought so these papers are actually very very complicated to schedule to different boards to mimic the retrieval rehearsal and forget process in working memory or human to let the model to understand very long contstep by step. But actually we can see that we can use different system level techniques to understand very, very long contacts. For example, than one more than 100000 length is full attention. So it's just different from several years ago. And many things is super simplified because of this improvement. A key, a technique is called context parallel, which means we split the sequence into different ranks and use ring attention or Ulysses and other technials to finish the attention. There's a library called transformer engine, and all of this function is warped in this library. And we need to handle the load balance of the attention to make everyone have the same computation. So this is actually change lots of different research of and applications over an option. For example, we we summary and extract some facts from the documents several years ago using like bm 25 or and other retrieval methods. And currently we can just use a transformer and full attention to get the information and understand it. It's quite quite important improvement. So using this very powerful info, we can train very large language models. And for the alignment, the first period is called sft and supervised fine tunit's, actually a very ordinary fine tuning for language model. A high quality data. And the high quality data is usually from human annotation. The human, this human annotation is not not just cross sourcing. You need to hear experts from different domains to write this high quality answers to train the model. For example, if you want to, I want the model to write some code and explain the code in a very formated way. You need to hia very experienced programmer to write some example to teach this language model. It's not just cold sourcing. This is quite different from the curious human annotation. And we can also extract the pasanswer pairs from more powerful models like gbt t four turbo to train our model. But this is actually not allowed Ned, by OpenAI. So you cannot use this method to develop their can a model to competing with them. But you actually, if you for research, you don't worry about this. Using that method will never surpass GPT -4 because there's a paper called way to strong generalization. And recall what I said just now. Now what your what what what really important is your loss? If your loss is lower than your teacher math teacher model, you can also surpass your teacher model, even you use the accepdata from your teacher model. Yeah and another period of alignment is called ihf that use reinforcement learning from human feedback to improved model. But actually the most common open language model, didn't use this method. The main reason is, ppo is very hard to implement. It could be it could be very powerful if the reward model is good enough, but very not easy to train. So there's some more easy method and most open source language model to use. The dpo method is from paper from stanand. We only need some prreference pairs and use this formula to update your model. You don't need a, you don't really need a reward model. You you don't really need a reward model, you just need some pairs. Maybe these pairs should some on policy pairs, but Yeah but it's much simple and also very powerful. So these are basics of how to train language model currently. And it seemed like it's nothing about nlp, it's actually the a party of mlc guys. So what are the lm doing? Is actually the most important things is data currently the data cleaning, filtering, syntheis. The most important thing of all the large language model company, which is a open secret. So the training infis actually a it's basically what I said in the last several slides. Maybe there's some other more advanced method, but the improvement is maybe 20% or something like that. But if you have a better a better approdata and the performance of your language model is quite quite obvious, something like the language remote and something are are told by the media is most one thing. And but actually most of the ml engineering in large language model companies, they are actually cleanthe data. So is this something a Stanford graduate student should do? Maybe someone thing is very, it's very low. I want to design some new algorithm architectures. This is rare ml research. But I have opinion that the data, the algorithm and architecture can transform, transform to each other. So the data is most generbut. Sometimes if you don't have enough compute, it could be it could be very hard to fit this kind of data. And algorithm is very hard to implement and not very general architecture. It's hard to perform what you want. You design a new architecture very hard. I will take a multi help question answering task as an example. The right figure is from the cocuis, also one of my papers when I was a Pech student. It's actually about a task to we have very complex question and we need to find the task, the task find the answer from several documents, but you need to have, you need to find the chain reasoning between different documents to get the final answer. So at that time I proposed a method involved boards and graph neural network is very complicated. And finally I get I get a very good performance and ten points better than the prior method. But Yeah, there's this is actually some algorithm or architecture innovation. It's very fancy and get very high scin acl review, but there's some other cocurrent work, use mcts monocolor research and boards and something like that. It's this looks like algorithm level innovation to solve this problem. But currently this problem can be easily solved by a very long context GPT and chain of thought reasoning. If you include nearly all the documents into your contacts, you don't need any things like graph, new Ural network and cts to jump to jump between the documents. You have all the contacts and you can just finish using chaalso. It's a data level solution. So the data level solution is of course most simple one because you just add the data into your training training purpose and you can just finish this task while not affect other task. So the the data cleaning, filtering and exercising is not a very easy work and it actually very important view to to important to do this is we should transform our view of data and alsome are to fit the current current air. So Yeah, I have introduced some knowledge about the the about language models. So I will jump into the second part, which is Green language models in the past one year. So the past one year we have seen the very models jump from nearly a very, very silly one to a currently very powerful ones. So I will start from blip two, which is actually a maybe I think the first work to bridge the clip and contain large language model to give the language model the ability to understand the images. Then actually if we if we have a image encoder from clip and a large language model from anywhere, so you can just insert a transformer called q former to extract some important features from image encoder and insert these features into large line model. The but the space of image features and text which are different. So the q formis trinable you need lots of text image pairs and align the space of image features and language and text features. The space are so Yeah, Yeah, this the q format is actually did this but there's a more simple a method called lava is it actually actually don't use a simple projection weight to transform the the the feature from weencoder into the features in the large line model input. So it's quickly become the most popular architecture of language models. Cowill m is a work from our group at the motivation of cowill m is to keep all the language behavior while we add an image understanding ability to the language model for lava and for the previous method, maybe you can you actually can train the language model and get a better performance. But but it's about mulmodity task, the language model, language viability of the model will be reduced if you train the language model during the text, the image alignment. So we first we first use a vision export to add new parameters in the backbone, and the new vision exports only deal with the image features, and the original wiits in with four layers and qkv matrix deal with the original text features, so the original behavior of language model is kept, and we add lots of new parameter to train and get a better performance of language of multimodity models. The cowill m achieves a state of art performance of several benchmarks including image captioning, grounding and qa and some other real large model benchmarks and is also open source. You can download it from our GitHub. Last last month I found that I'm is downloaded more than five 500000 times in the in the last month. So I think it already help lots of people. And coagent is another works from our group is use a different architectures because we want high resolution with cross attention. Why is cross attention? Because we don't want to. We we just want want a high resolution input. I don't want to let all the hidden size is the same as the the the language model hidden size, which is very large. So we use the cross attention to deal with the low resolution, the high resolution channels. Slightly complicated, but the performance is really good. We can find this, this this model is actually trained to be a web agent and it's just take a screenshot as input and it will perform different operation on the screenshot. For example, this is example for a search, the last year's best paper in spr. So we ask the model this question. They told me you need to type the best paper of the vpr 2023 in the box, that this position and step by step, finally, we gather information and we now we can also use this method to do some tickets or some other task, some other task. Ks, Yeah, this is also open resource, open source, some other popular architectures about. Welanguage modeling includes wit is actually a example different version features input and is largely improved ocr performance. But what I want to strengis we actually the most in our most wants very language model G M for we we actually use a more simple architecture. It's actually a small adaptation upon lava. We just replace the projection weight of lava into a striconvolution to suppose high resolution input but keep the computation in language model. Using this architecture, we can train the real language model mixed with the text. And finally, we get a good performance. We can say that gem for, we can underpr of a gbt for or Germany or clouthree. And it's performed better in ocr ocr benchmarks, for example, document Q A, and it's performmuch better at Chinese O C A. This is some, this is the example of our most one gm for model. You can download the our app from this chat gm dot an website. This is actually very hard to recognize draft but and it's also I mean the model can analyze it very accurately and can translate what is really what really right so Yeah you can experience our model is totally free from this website. Okay, we have some inection about where language understanding it's more about engineering but it's multimodality but and another half of the welanguage research about image generation and is also relevant to transformers. So I will also introduce this, the rule about image average wish for three of four years ago. Auto, we already know the gbt is very powerful, so we want to auto regressively modeling the tax generation for using the GPT architect. So this is the work of cogview. It's also my work at 20 and 222, 21 as it's a very simple framework because we know that GPT can only predict multinomial distribution. So we we need to find some method to train to train the image in a described way. There's a maybe 2020, there's a paper called a GPT from OpenAI extrained directly on the pixel level for modeling, but the sequence is very long, so you cannot train a very high resolution images. So we can train first train a image tokkenizer. It's actually a way e, to describe your image into several tokens. And you prepare the sequence of a text image as the first text, force image later. And you can use a gp t to train this kind of seconand. Finally, during the inference, you first impput the text and then predicts a token by token in the image, token in the image, you can generate some image. Yeah this is a very simple idea. And a concurrent work called di. And the most powerful work called party is the same idea. Okay, but Yeah Yeah we know that we can in we can generate image using GPT. So a very natural idea is can we achieve some universal modeling for real language tasks? So if if we just tokenize the image just like the text, we can generate image, we can generate text from the image, we can generate image from text and only generate text. So this is a very natural idea. And I also did this in two maybe two years ago. And Yeah the algorithm is also very simple, is just in the sequence you change different position of text and image sequence. If first text, then image, and you mask all the things they type to image generation. If first image, then text, it's image captioning. And you can also guess other formats like mascoto encoder or something like that. But the problem is when you compare this universal modeling system to diffusion or vision language modeling, vision language model, you will find the image generation is it worse than the diffusion and very slow compared to diffusion? For image understanding is performed worse than visulanguage model, because when your image when when you transform your image into descritokens, lots of information is lost during this process, so the performance is worse than the visulanguage model. So using this method, you can achieve universal modeling, but the you just achieve universmodeling, you cannot achieve the best performance on any task, any so the diffusion method actually when s the game or image generation are not auto aggressive, although the in in the nlp domain autoregressive method is dominant, but in image generation the Venner is diffusion. So what is diffusion? Diffusion actually another, it's a totally different self supervised learning method compared to auto rezing method. You can also think it's autogressive on a Fourier domain or something like that, but actually, it's the ddpm is the original paper of diffusion model is still the most popular framework of diffusion modeling. We can define a lot of steps, is we gradually adding noise to a clean image, and we get different intermediate states and train a model to predict the noise. The original image or something like v as the velocity of of the angle is of the log, and are actually given the noise y important, a noise noisy image. So it's totally different. But and the most vantaof diffusion model of autogressing model is that during sampling, we during sampling, we during sampling, we can use a full utility of GPU's, because in our regressive model, when we decode a token, we actually invase the power of the GPU. It is the utility of GPU is very low. If the back size is small, the batsize is equal to one. But for a diffusion model, we just input all the image into the, into the model so it can utilize the GPU and this, it can something in much faster than revising model. Okay. The really diffusion model is, the really diffusion model is our recent work about it is solved a problem in diffusion about the noschedule across different resolutions. The first thing is you can say the last image is there are actually three, three images with the same noise. The a and b are two images with different resolution and with the same noise level, but the a is actually more broad for for us during the observation. The problem is we we add an independent noand then and actually the original signal, the image is nothing independent across the space. What we need to do is is if we want to transform a noise schedule from a low resolution to high resolution, we need to use a block noise to find the equivalence of on the high resolution images. And finally, we can keep the snr in the frequency graph the same. So using this method, we can disentangle the noisy schedule and actually network we use for diffusion, use the noisy schedule, we don't care about the resolution. We just use a block noise when we want to continue diffusion on a high resolution one, so the speed can improve, because we don't need to re image, generate the image from the high resolution from the condition on the low resolution image on high resolution phase. Okay. And we also scale up the relay diffusion to copview three after after the the paper Yeah the view three is actually a large diffusion model. And after desolation, it could be very fast because of the effectiveness of the really diffuse okay. Finally, we guess we get something relevant to our topic transformer. And actually the previous works about diffusion is on unand. Using transformer is not trivial in the indifusion. The first work I think maybe is solid enough is the diit for meta. It's the author of this paper is also the author of Sorah. So the the most important, most difference between the original transformer and this A D is the im is predict a skill and bias for different Skand shifts for different conditioning on the time step. It actually needs a very huge amount parameters. It's six times of hidden size, nearly equals to a qkv width per layer, but the input is only one int. One int is actually very strange, because the input is only one int ch, and you need millions of parameters to transform it, so some methods can reduce the steam. In our practice, the stable devothree released recently used another architecture called md the stable diffthree first use our released code real m recaption of model on other images, and train a little diffusion model using this new architecture. The new architecture seem like very complicated, but the most important thing is they use a vision and text export like om instead of cross attention to t five features like the previous ones. So finally, we will talk shortly about video o generation, because sois currently very popular seeing we published video o generation work several years ago in an finis published unclear. So it's maybe the first open source language model for test video generation, but performance much worse than the current sera because it's autoregressive. So using diffusion, we can get better when we currently also working for replication of Sora like models and our we can sumthat most the improvement of sorver come from these aspects forces. There's no fcking in the videos and it can generate high quality images. The first one to d fcking can be solved by the 3D latent encoder decoder. And if you train a diffusion decoder could be better. The high quality is should sense to the scaling up. And it required very high resolution. And this is something related to the long context vtioning and context parallel techniques in the language model infra, which I introduced at the beginning of this course. So the most important thing is they use the infthey, use use the infin language model training into the diffusion and make it make it very easy to scale up and scale up much, much larger than the other other companies Yeah and finally, the most important thing is data coverage. It needs a very heavy design, engineering and video recaption. Okay. So this, I have introduced many topics of current multimodality ity printing and some problems in the this transformer community. So there's are some trees, I think what happened in one or few years in the mulmodality area, in the next one or two years, we can easily recognize grounding all the common things, attributes and human expressions and and other lots of high level vision things, es and all these things will be very cheap and be basically sold. So this will happen in one, two years, but at that time, the long tail problem of auauto driving could be a evnot solved, but largely elevate. And the second prediction is the video understanding will become very important in the next one or two years, because it's very useful. We we have lots of video in the Internet and in our everyday life, but it's very hard. And currently we cannot understand video well. And the most powerful video understanding model currently is generally 11.5. But it's it all, basically lots of hallucinations and wrong counting and lots of weakness. So there's a very large room to improve. Another thing is we have enough compute to deal with the video now and especially in the next one or two years, because the next generation of media, a GPU and the requirements from the larger language model. And another important thing is embody AI. Embody AI will be more and more important in the research, and it will be very closely related to mulmodality research, although it can impact our real lives in a few years because we now have planning ability with large language models. We can recognize other the things within language models, and there will be some chances to get some new ability. And very, very astonishing demo of this this embodied AI. But that may be seem be very expensive and can now be used for everyday life. So what should we do at that time that at that time for for me, some researcher like me in a large language model company, we got enough computer resources. But for others, so I think if you, a senior researcher, just follow your hard and the economy, if you want to quickly gain some citations, papers, impacts, I think maybe you can consider that the video understanding models, data sets, benchmarks, especially data sets benchmarks is very important and in great need of the video understanding community. Yeah and for multing modality, and there's another topic I haven't talked about in this lecture is is a speech or audio. I recently learned some knowledge about audio and I lead a group of speech AI group ping AI. So I I'm not a researcher about audio, but I can see that the speech AI is underestimated. It's actually very important for the user need an application, but there's no enough GPU and research research searchers put into these areas like in the model Yeah. Finally, if you want to do some really useful and impact AI research, which is very risky, you advice is you need to make some system PhD student at once because the best algorithm must utilize the current into the current GPU and other hardware. Yeah so you just need to know some system PhD students and there Yeah and that should be another more difficult but influential is there's actually some room for new architectures for self supervised learning and optimizers because the next generation of hardware will be totally different. So maybe the transformer will have some competitors and the also the autoregression modeling method. So there's some room, but it's very hard and need some computational resources. And finally, the new ways to transform computo high quality data is very important because the web data is the high quality web data is actually be crowdown into almost every large language model company and it's currently not very enough. So we need to find some new ways to transform computers to high quality data. For example, how to how to specisizing the new data using code execution result, using maybe mcts reinforcement learning or some other methods. Very big area. And in the next few years, Yeah, I think I will end this lecture here. And thank you for the instructors and audience. Thank you very much. If you have some question, you can send email to this and I will answer all the question. Thank you very much.
speaker 3: Thank you very much for meeting for the amazing talk and all the useful advice. So we have some questions. I got one through zoom and there's several also on slido. So Emily, are there any in person questions from your end or.
speaker 2: Okay if if someone has some questions and you can type in the in the chatting in in zoom zoom if you you are .
speaker 3: using zoom, let me see. Okay, Yeah here are some questions on slido that I'll ask. The first is that the success of long context windows must come at a cost. What is this cost?
speaker 2: Because because it's a very long time conception. You just need to run your run your inference engine for a very long time. Actually, the current inference system of large language model can be split into two period. One is proffilling. You need to input very long contacts into your engine. And then another is decode, and you you generate the token by token. So the most user case, they actually not generate a very long context, do not understand long term long test, and generate very few tokens about the questions. So we can bear maybe one minute to let the language model to just run the long context understanding and then begin to answer your question. So this is the course you need to a wait for maybe several, several seconds or one minutes. Yes.
speaker 3: Great. Oops, I was muted, but yes, thanks. That makes sense. So there's two questions, which are pretty similar, all uploaded on slido, talking about the quality of data. So recently, folks have been seeing that the quality of data is what really determines final model performance compared to anything else. Do you agree? And related to this, do you think there's still a lot of work to do around improving the architecture models? Or has attention shifted to focus on data?
speaker 2: Yeah, Yeah. I think I think this very reason why actually what's the whole community is doing is to improve the data. I just talk talk about this opinion in the lecture is the architecture, the algorithm, the data can conform to each other. If you have some idea, you can inject the inductor bias into architecture, you can design a new algorithm, and you can prepare some data to telyour model to act like that. So many, many of the the very special cases, you can, you can you can use data to solve the problem, so the high quality data is more important. The architecture updates to for many tasks button, I think if you can find find a general update of transformer, it's very valuable. If just increase the power of the model to fitting the data is very, very valuable. Yes.
speaker 3: All right, great. Here's a question. Why is autoaggressive architecture inferior to diffusion in image generation?
speaker 2: Yeah, Yeah, it's very complicated. This this this question is very complicated actually. So the diffusion has has is totally different in autogressive to some. But the most important thing I have talked about in the in the lecture is the speed of generations for autoregressive model. If you use a very large model, you train for a very long time. I believe we can get a very good result. We can also generate high quality images using autoregressive methods. This is, this is okay, but the time to generate an image is very, very long, because we need to predict the token by token may be a high resolution image is maybe thousands of thousds tokens is for diffusion. We use several steps of a face forwarding all the, all the image we need. We don't need to token by token prediction is it will be thousands times faster than autoregressive model if you are generating a high resolution images. So this is a very obvious advantage. And for for the modeling power, I think the most important thing is the the maybe some relation between the space is actually not well, not modeling well by autographic model because the lefmost pixel and the right bottom pixel, it's very far in auregressive model, but in in diffusion model is actually be we can we can see each other. So it's not a problem, but photographing model is have position problem. So it's it's it's not easy to modeling a very complicated 2D special problems. This is also A A possible reason. But AI, I cannot give a very, very good answer about this question. But Yeah, Yeah Yeah this should be more more research searabout that. Yeah, thank you.
speaker 3: Right. Great. Thanks for that detailed answer. So someone is asking how is the cog agent model different from the cog vlm model?
speaker 2: Oh Yeah, the coagent model is actually functuned from the comodel, but the coagent model deal with high resolution and and web screen cases because our motivation is that the high resolution inputs for web page is very important because there's many words, many icons, something very small, and you need to use a very high resolution model to deal with it. But if you just extend the the the input resolution of com, the the concsumption is very high. So we need we use across attention module adding to the copm to get copagent. The exist module is much liweight so we can deal with the high resolution more. Yes.
speaker 3: Great. Here's a question about video. How do you think video understanding will aid AI's ability to have a stronger physical understanding of the world?
speaker 2: Okay. Okay, that's is a very good question. I think yes, I my answer is yes, but it actually a bilective problem is because if you don't have some data source which conturn the physical physical rules, you cannot train a good video understanding model. I think using the current we are language model training meththird because we need the input ts the the text image or text video pairs to train. And we didn't, we actually did not use any self supervised learning in the image or video. So we cannot learn any knowledge from pure video or image. We actually we actually dealing with the unnoted data from a human side. So if we want to understand better of the physical words using ununnoted videos, we need to find some new methood for self supervised learning or training method. Yeah, this this is a very good very good question. This is very good question. Thank you.
speaker 3: Right. Okay, a couple more questions someone is asking. Are there vqa tasks that involve multiple terms of conversation in a tree structure similar to a tree of thoughts or beam search style?
speaker 2: Okay, okay. Yeah, Yeah, maybe but I still think it's different and the trio thought way could be better because it can is aware of other MaaS information, for example, the wrong PaaS, the other field case, something like that to my experience is if you you can include all the contacts in your input, you always get better results. So Yeah, maybe either trior or some other different process process procedure and some other information. You just included them into the contents. The language model will learn to how to deal with them and understand better than the beam search, which is actually a hard code method to compare the op probabilities. It should be better if if you do it right. Yes, right. Thanks. That's all.
speaker 3: All the time we have for questions. So thanks again to many for the great talk, the detailed answers to all the questions.