2024-05-30 | Stanford CS25: V4 I From Large Language Models to Large Multimodal Models

讲座由智谱AI研究科学家丁明主讲，他系统回顾了大规模语言模型的发展历程和多模态模型的最新研究进展。内容从最初基于自监督方法的语言模型探索出发，介绍了早期统一遮掩与自回归训练方法的发展，再到GPT‑3时代通过大规模计算与参数扩展实现性能稳步提升，特别强调了扩展计算资源在模型工程化中的作用。随后，他重点探讨了ChatGPT时代任务适应成本低、预训练知识更为关键的现象，并指出训练损失对下游表现的重要影响。讲座还涵盖了Transformer架构的技术细节更新，如预归一化、旋转位置编码、分组注意力以及通过DeepSpeed和零冗余优化、激活检查点等技术在大规模模型训练中的应用，为多模态系统及未来研究指明了方向。

视频科技

媒体详情

上传日期: 2025-05-18 15:43
来源: https://www.youtube.com/watch?v=cYfKQ6YG9Qo
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Hello, thank you all for joining ctwenty five transformers today. For today's talk, we have Ming ding, a research scientist at jpu AI based in Beijing. He obtained his bachelor's and doctoral degrees at ting shua University, and he does research on multimodal generative models and pre training technologies. He has led or participated in the research works about multimodal generative models such as cog view and cog video, and multimodal understanding models such as cog vlm and cog agent. For today's attendance, the attendance form is up on the course website and if you have any questions, ask them through slido sdo. And for the code, you just have to input cs 25. Thank you Ming for today's talk and I'm gonna na PaaS it off to you.
speaker 2: Thank you for the instructors of cs 25 to it's very happy to give a talk in Stanford University about mulmodality ity training. And actually I have tracked the all the previous talks in cs 25 and they are really diverse topics. Someone shares intuitions in their research about training. Someone one's shared recent works about maybe moe and some other technique. Actually I'm working in A A large language model company in China and our company working on training and maybe there's lots of different area from a large lmodel and multimodity model and generto model diffusion and tattoo speech, something like that. So I lead all the multimodity model research in juple AI. So I will share lots of different topics in in in in this talk. Some some of them may be not very familiar to you. So Yeah, it's okay, but you can get more information of different areas. Yeah, I will talk about several aspects of transformers, and I will generally follow the history of a large language model and say, why are we here? It's about the large language model introduction and history and how did we get here. It's about some practical techniques for training, like language models and what are we working on. It's about the last one year, the real language models and other techniques in the papers of all the real language model community. And finally, I will talk about the some possible and valuable direction for research in mulmodity. Okay, okay. I will share three moments. I think the most important three moments in the development of language model, the first moment is called birth moment. Actually, I get I got into the area at this moment. It's very honored that I'm the first among the first group of people who published papers on the next year. They l when bocame out. And at that time, since we don't really know what he's like with modeling, so at that time, nearly all the people are talking about how can we get better self supervised methods for an option. At that time, a common opinion is masculine model is just for is good at understanding the the tax and GPT the autoregressive model is better for tax generations and t five maybe can can do the both but is redundant. That's but nowadays we we will say that dvt and has now nearly A A sevbullet of all the nlp problem. Sometimes the things changes and we will back from that time point and know how the language model changes and how we got more and more knowledge about language model. So at that time, the I'm also among one of them who want to develop a new self supervised learning methood for an option, we a publish paper called grm, and we want to unify the board, the mask language model and the autographmodel and divyeah in a decoder only style. The actually, the method is very simple. We just select sequence, a part of the sequence and only do auto autoregressive modeling during this sequence. So if the, we select the mask area as all the sequis become a GPT and part of them, it become bored. So that's a method we find, we find very efficient. And because we, it's like a Bois, about 15% of the masked area and perform better than boards between treat it as a GPT. They perform as the same as GPT. It's quite a very proud thing. But there's the second moment I think is very important is the GPT -3 movement. It telus, the scaling law is very important. So you can design different architectures, define different laws, different self supervised tasks and different methods to schedule different models. But the performance maybe has some upper bound. But if you add more compute, you can get a guaranteed the performance improvement. You can predict the results probplexity based on the, based on the fitted curve. So at that time, the language modeling has become more and more engineering. If if you have, find a very good point. You train a language model. If you want to scale it, you and your boss gave you four times of monyou, can buy four times the computer te, you just assign the compute for more parameters or training more tokens. This is, this is called scaling low. And you did it. Tell you how how you can assign a different potential of of your monies. So so at that time, don't the language model don't really need some maybe architectural innovation or as algorithm innovation. So it's become an engineering thing. And the third moment, I think, which is more more important, it's called chagpt moment. At that moment, it telus, a very important fact is task adaptation is cheap. And what is the the what is really important is the knowledge from printraining. This is very bitter lesson. So I have told you that at that time, we designed different losses, different architectures, but some of the the aim of design different losses is to perform different tasks. For example, the autographing model cannot fill in the blank in the sequence, but glm and bocan, so we use different training tabut. Currently, we know that the task adaptation is very cheap, you just need to fine tune your language model as the final, the final period. What the only important thing is your laws. The left figure is from instruct GPT t. It's actually the paper about chagbt. And how can we align a model to a chat model? He told that the alignment can give a very huge improvement on human preference compared to the original printrained language model. And the right figure is actually a recent paper in our company. It's it's tell ls us a very important fact. Maybe it's intuitive because the fact is the downstream the performance of downstream task is only related to the loss to the loss of training. Yeah and it's not directly relevant to the model size, which means it's a large model, which very high loss because of lack of training and a small model, we train more and to reach the same level of laws, they performed exactly the same in the downstream tasks. So the so called emergent ability and some other maybe ability, strange rumors are not actually not that the ability is not not from the number of parameters of language model is actually from only relevant to the loss of your language model. So all the language model became a game of curl fitting is actually the current situation of language model research. So there's also some technical details of a large language model. Even we we know it's not curfitting, but there's a lot of important things. So we will back from the some basics and talk about the transformer the transformer architecture. A very interesting thing is the most important improvements in nowadays are still from the the the first author of the transformer paper the norm and maybe from his other papers. So actually the real innovation in the architecture is very small. I can summarize some common adaptation on transformer currently. First is decoder. Only the original transformer is a ender decoder architecture. So is redundant because the the important, the the encoder and decoder should learn the how, how to understand the text from different parameters is redundant. And currently we only care about decoder only, decoder only architectures. The second one is prenorm in the original transformer layer, the norm is after the the residual connection. It's called a postliar norm. And currently we usually use preal norm. The rotary position embinding is something very special because it not published from a paper, is not published from a Chinese bloit's, but currently it's proven very efficient and the group attention is actually from another people know it can save the inference memory. An tu warrant is also from norm, is just a replacement of the mlp and mixture of exporting actually also from the norms paper, and you can use the same flow of small parameter to get better performance. So this is what the current most ones open source language model, the architecture of most advanced open source language model, for example. Lama, okay, we know there's a texture, but how to train the, this transformer is also very important that we need to prepare very powerful code base to train the large language model. The the first choice is deep speed, is it's a library from Microsoft. And some of the most important optimization method is from the paper called zero in from deep speed group. If several years ago, some of the us not really know how to train a very large model, how to efficiently train them, but there give us some advice. For example, if we we can find the most a memory concsumption is actually the Adam states, the optimizer states you must keep its a full preciis, a float, and the master, which is also a float, the parameter and gradient, you can keep it high preciand, you can have a fast computation and save memories, but the zero one can scatter the maweight and automtimate or state into all the data parallel ranks. So if you have more more ranks, more GPU cards, you just use less GPU memory for for each rank. Another important technique is called activation. Checkpointing is actually recthe intermediate state and recompute when backwards. So we don't really need record all the computation flow graph, we just need to record some of the hidden states, it reduce all the activation from many layers from to into one layers. And there's other methods to reduce the memory consumption. For example, zero to cpu offload, which means you can offload some GPU memory to cpu and zero. Three, I also call fully shdata first data parallel. You can just shyour model into different cards. And when you use the parameter, you gather this parameter from the other ones. So all, all, all this method is very complicated, but the deep speed library have already give a very clean api to use it. So currently it's not very hard to train a very large language model efficiently, and Macron is another framework to train like language models, and it is also the most available framework to train a super large language model. More than 100 billion parameters. It use another set of optimization meththe first incalled tensor parallel, the tensor parallel splis the hidden size and heads into different ranks, and it costs ses an additional or reduce for attention and mlp, but reduce the all the parameters consumption and computing consumption into different tp ranks. The pipeline parallel is to split the layers into different ranks, and it also introducbubbles in pipeline. And there's some method, for example, leave a very bubble to remove this consumption. Yeah maybe if you are. If you want to train a very large language model, one day you need to learn about all these kind of system system things, because the current large language of training is actually an engineering work. Yeah, an lp is not very important. The the important is emses. Okay. So another very important thing is long contacts is actually loless long contacts, which means we don't use sparse attention or other methods to change the full attention behavior. Is the the current infto tree long context is beyond the imagination for AI guys four, five years ago the left figure is actually my paper when I Yeah published several years ago in Europe. It's at that time there's no such thing like bbc three. It's only bought so these papers are actually very very complicated to schedule to different boards to mimic the retrieval rehearsal and forget process in working memory or human to let the model to understand very long contstep by step. But actually we can see that we can use different system level techniques to understand very, very long contacts. For example, than one more than 100000 length is full attention. So it's just different from several years ago. And many things is super simplified because of this improvement. A key, a technique is called context parallel, which means we split the sequence into different ranks and use ring attention or Ulysses and other technials to finish the attention. There's a library called transformer engine, and all of this function is warped in this library. And we need to handle the load balance of the attention to make everyone have the same computation. So this is actually change lots of different research of and applications over an option. For example, we we summary and extract some facts from the documents several years ago using like bm 25 or and other retrieval methods. And currently we can just use a transformer and full attention to get the information and understand it. It's quite quite important improvement. So using this very powerful info, we can train very large language models. And for the alignment, the first period is called sft and supervised fine tunit's, actually a very ordinary fine tuning for language model. A high quality data. And the high quality data is usually from human annotation. The human, this human annotation is not not just cross sourcing. You need to hear experts from different domains to write this high quality answers to train the model. For example, if you want to, I want the model to write some code and explain the code in a very formated way. You need to hia very experienced programmer to write some example to teach this language model. It's not just cold sourcing. This is quite different from the curious human annotation. And we can also extract the pasanswer pairs from more powerful models like gbt t four turbo to train our model. But this is actually not allowed Ned, by OpenAI. So you cannot use this method to develop their can a model to competing with them. But you actually, if you for research, you don't worry about this. Using that method will never surpass GPT -4 because there's a paper called way to strong generalization. And recall what I said just now. Now what your what what what really important is your loss? If your loss is lower than your teacher math teacher model, you can also surpass your teacher model, even you use the accepdata from your teacher model. Yeah and another period of alignment is called ihf that use reinforcement learning from human feedback to improved model. But actually the most common open language model, didn't use this method. The main reason is, ppo is very hard to implement. It could be it could be very powerful if the reward model is good enough, but very not easy to train. So there's some more easy method and most open source language model to use. The dpo method is from paper from stanand. We only need some prreference pairs and use this formula to update your model. You don't need a, you don't really need a reward model. You you don't really need a reward model, you just need some pairs. Maybe these pairs should some on policy pairs, but Yeah but it's much simple and also very powerful. So these are basics of how to train language model currently. And it seemed like it's nothing about nlp, it's actually the a party of mlc guys. So what are the lm doing? Is actually the most important things is data currently the data cleaning, filtering, syntheis. The most important thing of all the large language model company, which is a open secret. So the training infis actually a it's basically what I said in the last several slides. Maybe there's some other more advanced method, but the improvement is maybe 20% or something like that. But if you have a better a better approdata and the performance of your language model is quite quite obvious, something like the language remote and something are are told by the media is most one thing. And but actually most of the ml engineering in large language model companies, they are actually cleanthe data. So is this something a Stanford graduate student should do? Maybe someone thing is very, it's very low. I want to design some new algorithm architectures. This is rare ml research. But I have opinion that the data, the algorithm and architecture can transform, transform to each other. So the data is most generbut. Sometimes if you don't have enough compute, it could be it could be very hard to fit this kind of data. And algorithm is very hard to implement and not very general architecture. It's hard to perform what you want. You design a new architecture very hard. I will take a multi help question answering task as an example. The right figure is from the cocuis, also one of my papers when I was a Pech student. It's actually about a task to we have very complex question and we need to find the task, the task find the answer from several documents, but you need to have, you need to find the chain reasoning between different documents to get the final answer. So at that time I proposed a method involved boards and graph neural network is very complicated. And finally I get I get a very good performance and ten points better than the prior method. But Yeah, there's this is actually some algorithm or architecture innovation. It's very fancy and get very high scin acl review, but there's some other cocurrent work, use mcts monocolor research and boards and something like that. It's this looks like algorithm level innovation to solve this problem. But currently this problem can be easily solved by a very long context GPT and chain of thought reasoning. If you include nearly all the documents into your contacts, you don't need any things like graph, new Ural network and cts to jump to jump between the documents. You have all the contacts and you can just finish using chaalso. It's a data level solution. So the data level solution is of course most simple one because you just add the data into your training training purpose and you can just finish this task while not affect other task. So the the data cleaning, filtering and exercising is not a very easy work and it actually very important view to to important to do this is we should transform our view of data and alsome are to fit the current current air. So Yeah, I have introduced some knowledge about the the about language models. So I will jump into the second part, which is Green language models in the past one year. So the past one year we have seen the very models jump from nearly a very, very silly one to a currently very powerful ones. So I will start from blip two, which is actually a maybe I think the first work to bridge the clip and contain large language model to give the language model the ability to understand the images. Then actually if we if we have a image encoder from clip and a large language model from anywhere, so you can just insert a transformer called q former to extract some important features from image encoder and insert these features into large line model. The but the space of image features and text which are different. So the q formis trinable you need lots of text image pairs and align the space of image features and language and text features. The space are so Yeah, Yeah, this the q format is actually did this but there's a more simple a method called lava is it actually actually don't use a simple projection weight to transform the the the feature from weencoder into the features in the large line model input. So it's quickly become the most popular architecture of language models. Cowill m is a work from our group at the motivation of cowill m is to keep all the language behavior while we add an image understanding ability to the language model for lava and for the previous method, maybe you can you actually can train the language model and get a better performance. But but it's about mulmodity task, the language model, language viability of the model will be reduced if you train the language model during the text, the image alignment. So we first we first use a vision export to add new parameters in the backbone, and the new vision exports only deal with the image features, and the original wiits in with four layers and qkv matrix deal with the original text features, so the original behavior of language model is kept, and we add lots of new parameter to train and get a better performance of language of multimodity models. The cowill m achieves a state of art performance of several benchmarks including image captioning, grounding and qa and some other real large model benchmarks and is also open source. You can download it from our GitHub. Last last month I found that I'm is downloaded more than five 500000 times in the in the last month. So I think it already help lots of people. And coagent is another works from our group is use a different architectures because we want high resolution with cross attention. Why is cross attention? Because we don't want to. We we just want want a high resolution input. I don't want to let all the hidden size is the same as the the the language model hidden size, which is very large. So we use the cross attention to deal with the low resolution, the high resolution channels. Slightly complicated, but the performance is really good. We can find this, this this model is actually trained to be a web agent and it's just take a screenshot as input and it will perform different operation on the screenshot. For example, this is example for a search, the last year's best paper in spr. So we ask the model this question. They told me you need to type the best paper of the vpr 2023 in the box, that this position and step by step, finally, we gather information and we now we can also use this method to do some tickets or some other task, some other task. Ks, Yeah, this is also open resource, open source, some other popular architectures about. Welanguage modeling includes wit is actually a example different version features input and is largely improved ocr performance. But what I want to strengis we actually the most in our most wants very language model G M for we we actually use a more simple architecture. It's actually a small adaptation upon lava. We just replace the projection weight of lava into a striconvolution to suppose high resolution input but keep the computation in language model. Using this architecture, we can train the real language model mixed with the text. And finally, we get a good performance. We can say that gem for, we can underpr of a gbt for or Germany or clouthree. And it's performed better in ocr ocr benchmarks, for example, document Q A, and it's performmuch better at Chinese O C A. This is some, this is the example of our most one gm for model. You can download the our app from this chat gm dot an website. This is actually very hard to recognize draft but and it's also I mean the model can analyze it very accurately and can translate what is really what really right so Yeah you can experience our model is totally free from this website. Okay, we have some inection about where language understanding it's more about engineering but it's multimodality but and another half of the welanguage research about image generation and is also relevant to transformers. So I will also introduce this, the rule about image average wish for three of four years ago. Auto, we already know the gbt is very powerful, so we want to auto regressively modeling the tax generation for using the GPT architect. So this is the work of cogview. It's also my work at 20 and 222, 21 as it's a very simple framework because we know that GPT can only predict multinomial distribution. So we we need to find some method to train to train the image in a described way. There's a maybe 2020, there's a paper called a GPT from OpenAI extrained directly on the pixel level for modeling, but the sequence is very long, so you cannot train a very high resolution images. So we can train first train a image tokkenizer. It's actually a way e, to describe your image into several tokens. And you prepare the sequence of a text image as the first text, force image later. And you can use a gp t to train this kind of seconand. Finally, during the inference, you first impput the text and then predicts a token by token in the image, token in the image, you can generate some image. Yeah this is a very simple idea. And a concurrent work called di. And the most powerful work called party is the same idea. Okay, but Yeah Yeah we know that we can in we can generate image using GPT. So a very natural idea is can we achieve some universal modeling for real language tasks? So if if we just tokenize the image just like the text, we can generate image, we can generate text from the image, we can generate image from text and only generate text. So this is a very natural idea. And I also did this in two maybe two years ago. And Yeah the algorithm is also very simple, is just in the sequence you change different position of text and image sequence. If first text, then image, and you mask all the things they type to image generation. If first image, then text, it's image captioning. And you can also guess other formats like mascoto encoder or something like that. But the problem is when you compare this universal modeling system to diffusion or vision language modeling, vision language model, you will find the image generation is it worse than the diffusion and very slow compared to diffusion? For image understanding is performed worse than visulanguage model, because when your image when when you transform your image into descritokens, lots of information is lost during this process, so the performance is worse than the visulanguage model. So using this method, you can achieve universal modeling, but the you just achieve universmodeling, you cannot achieve the best performance on any task, any so the diffusion method actually when s the game or image generation are not auto aggressive, although the in in the nlp domain autoregressive method is dominant, but in image generation the Venner is diffusion. So what is diffusion? Diffusion actually another, it's a totally different self supervised learning method compared to auto rezing method. You can also think it's autogressive on a Fourier domain or something like that, but actually, it's the ddpm is the original paper of diffusion model is still the most popular framework of diffusion modeling. We can define a lot of steps, is we gradually adding noise to a clean image, and we get different intermediate states and train a model to predict the noise. The original image or something like v as the velocity of of the angle is of the log, and are actually given the noise y important, a noise noisy image. So it's totally different. But and the most vantaof diffusion model of autogressing model is that during sampling, we during sampling, we during sampling, we can use a full utility of GPU's, because in our regressive model, when we decode a token, we actually invase the power of the GPU. It is the utility of GPU is very low. If the back size is small, the batsize is equal to one. But for a diffusion model, we just input all the image into the, into the model so it can utilize the GPU and this, it can something in much faster than revising model. Okay. The really diffusion model is, the really diffusion model is our recent work about it is solved a problem in diffusion about the noschedule across different resolutions. The first thing is you can say the last image is there are actually three, three images with the same noise. The a and b are two images with different resolution and with the same noise level, but the a is actually more broad for for us during the observation. The problem is we we add an independent noand then and actually the original signal, the image is nothing independent across the space. What we need to do is is if we want to transform a noise schedule from a low resolution to high resolution, we need to use a block noise to find the equivalence of on the high resolution images. And finally, we can keep the snr in the frequency graph the same. So using this method, we can disentangle the noisy schedule and actually network we use for diffusion, use the noisy schedule, we don't care about the resolution. We just use a block noise when we want to continue diffusion on a high resolution one, so the speed can improve, because we don't need to re image, generate the image from the high resolution from the condition on the low resolution image on high resolution phase. Okay. And we also scale up the relay diffusion to copview three after after the the paper Yeah the view three is actually a large diffusion model. And after desolation, it could be very fast because of the effectiveness of the really diffuse okay. Finally, we guess we get something relevant to our topic transformer. And actually the previous works about diffusion is on unand. Using transformer is not trivial in the indifusion. The first work I think maybe is solid enough is the diit for meta. It's the author of this paper is also the author of Sorah. So the the most important, most difference between the original transformer and this A D is the im is predict a skill and bias for different Skand shifts for different conditioning on the time step. It actually needs a very huge amount parameters. It's six times of hidden size, nearly equals to a qkv width per layer, but the input is only one int. One int is actually very strange, because the input is only one int ch, and you need millions of parameters to transform it, so some methods can reduce the steam. In our practice, the stable devothree released recently used another architecture called md the stable diffthree first use our released code real m recaption of model on other images, and train a little diffusion model using this new architecture. The new architecture seem like very complicated, but the most important thing is they use a vision and text export like om instead of cross attention to t five features like the previous ones. So finally, we will talk shortly about video o generation, because sois currently very popular seeing we published video o generation work several years ago in an finis published unclear. So it's maybe the first open source language model for test video generation, but performance much worse than the current sera because it's autoregressive. So using diffusion, we can get better when we currently also working for replication of Sora like models and our we can sumthat most the improvement of sorver come from these aspects forces. There's no fcking in the videos and it can generate high quality images. The first one to d fcking can be solved by the 3D latent encoder decoder. And if you train a diffusion decoder could be better. The high quality is should sense to the scaling up. And it required very high resolution. And this is something related to the long context vtioning and context parallel techniques in the language model infra, which I introduced at the beginning of this course. So the most important thing is they use the infthey, use use the infin language model training into the diffusion and make it make it very easy to scale up and scale up much, much larger than the other other companies Yeah and finally, the most important thing is data coverage. It needs a very heavy design, engineering and video recaption. Okay. So this, I have introduced many topics of current multimodality ity printing and some problems in the this transformer community. So there's are some trees, I think what happened in one or few years in the mulmodality area, in the next one or two years, we can easily recognize grounding all the common things, attributes and human expressions and and other lots of high level vision things, es and all these things will be very cheap and be basically sold. So this will happen in one, two years, but at that time, the long tail problem of auauto driving could be a evnot solved, but largely elevate. And the second prediction is the video understanding will become very important in the next one or two years, because it's very useful. We we have lots of video in the Internet and in our everyday life, but it's very hard. And currently we cannot understand video well. And the most powerful video understanding model currently is generally 11.5. But it's it all, basically lots of hallucinations and wrong counting and lots of weakness. So there's a very large room to improve. Another thing is we have enough compute to deal with the video now and especially in the next one or two years, because the next generation of media, a GPU and the requirements from the larger language model. And another important thing is embody AI. Embody AI will be more and more important in the research, and it will be very closely related to mulmodality research, although it can impact our real lives in a few years because we now have planning ability with large language models. We can recognize other the things within language models, and there will be some chances to get some new ability. And very, very astonishing demo of this this embodied AI. But that may be seem be very expensive and can now be used for everyday life. So what should we do at that time that at that time for for me, some researcher like me in a large language model company, we got enough computer resources. But for others, so I think if you, a senior researcher, just follow your hard and the economy, if you want to quickly gain some citations, papers, impacts, I think maybe you can consider that the video understanding models, data sets, benchmarks, especially data sets benchmarks is very important and in great need of the video understanding community. Yeah and for multing modality, and there's another topic I haven't talked about in this lecture is is a speech or audio. I recently learned some knowledge about audio and I lead a group of speech AI group ping AI. So I I'm not a researcher about audio, but I can see that the speech AI is underestimated. It's actually very important for the user need an application, but there's no enough GPU and research research searchers put into these areas like in the model Yeah. Finally, if you want to do some really useful and impact AI research, which is very risky, you advice is you need to make some system PhD student at once because the best algorithm must utilize the current into the current GPU and other hardware. Yeah so you just need to know some system PhD students and there Yeah and that should be another more difficult but influential is there's actually some room for new architectures for self supervised learning and optimizers because the next generation of hardware will be totally different. So maybe the transformer will have some competitors and the also the autoregression modeling method. So there's some room, but it's very hard and need some computational resources. And finally, the new ways to transform computo high quality data is very important because the web data is the high quality web data is actually be crowdown into almost every large language model company and it's currently not very enough. So we need to find some new ways to transform computers to high quality data. For example, how to how to specisizing the new data using code execution result, using maybe mcts reinforcement learning or some other methods. Very big area. And in the next few years, Yeah, I think I will end this lecture here. And thank you for the instructors and audience. Thank you very much. If you have some question, you can send email to this and I will answer all the question. Thank you very much.
speaker 3: Thank you very much for meeting for the amazing talk and all the useful advice. So we have some questions. I got one through zoom and there's several also on slido. So Emily, are there any in person questions from your end or.
speaker 2: Okay if if someone has some questions and you can type in the in the chatting in in zoom zoom if you you are .
speaker 3: using zoom, let me see. Okay, Yeah here are some questions on slido that I'll ask. The first is that the success of long context windows must come at a cost. What is this cost?
speaker 2: Because because it's a very long time conception. You just need to run your run your inference engine for a very long time. Actually, the current inference system of large language model can be split into two period. One is proffilling. You need to input very long contacts into your engine. And then another is decode, and you you generate the token by token. So the most user case, they actually not generate a very long context, do not understand long term long test, and generate very few tokens about the questions. So we can bear maybe one minute to let the language model to just run the long context understanding and then begin to answer your question. So this is the course you need to a wait for maybe several, several seconds or one minutes. Yes.
speaker 3: Great. Oops, I was muted, but yes, thanks. That makes sense. So there's two questions, which are pretty similar, all uploaded on slido, talking about the quality of data. So recently, folks have been seeing that the quality of data is what really determines final model performance compared to anything else. Do you agree? And related to this, do you think there's still a lot of work to do around improving the architecture models? Or has attention shifted to focus on data?
speaker 2: Yeah, Yeah. I think I think this very reason why actually what's the whole community is doing is to improve the data. I just talk talk about this opinion in the lecture is the architecture, the algorithm, the data can conform to each other. If you have some idea, you can inject the inductor bias into architecture, you can design a new algorithm, and you can prepare some data to telyour model to act like that. So many, many of the the very special cases, you can, you can you can use data to solve the problem, so the high quality data is more important. The architecture updates to for many tasks button, I think if you can find find a general update of transformer, it's very valuable. If just increase the power of the model to fitting the data is very, very valuable. Yes.
speaker 3: All right, great. Here's a question. Why is autoaggressive architecture inferior to diffusion in image generation?
speaker 2: Yeah, Yeah, it's very complicated. This this this question is very complicated actually. So the diffusion has has is totally different in autogressive to some. But the most important thing I have talked about in the in the lecture is the speed of generations for autoregressive model. If you use a very large model, you train for a very long time. I believe we can get a very good result. We can also generate high quality images using autoregressive methods. This is, this is okay, but the time to generate an image is very, very long, because we need to predict the token by token may be a high resolution image is maybe thousands of thousds tokens is for diffusion. We use several steps of a face forwarding all the, all the image we need. We don't need to token by token prediction is it will be thousands times faster than autoregressive model if you are generating a high resolution images. So this is a very obvious advantage. And for for the modeling power, I think the most important thing is the the maybe some relation between the space is actually not well, not modeling well by autographic model because the lefmost pixel and the right bottom pixel, it's very far in auregressive model, but in in diffusion model is actually be we can we can see each other. So it's not a problem, but photographing model is have position problem. So it's it's it's not easy to modeling a very complicated 2D special problems. This is also A A possible reason. But AI, I cannot give a very, very good answer about this question. But Yeah, Yeah Yeah this should be more more research searabout that. Yeah, thank you.
speaker 3: Right. Great. Thanks for that detailed answer. So someone is asking how is the cog agent model different from the cog vlm model?
speaker 2: Oh Yeah, the coagent model is actually functuned from the comodel, but the coagent model deal with high resolution and and web screen cases because our motivation is that the high resolution inputs for web page is very important because there's many words, many icons, something very small, and you need to use a very high resolution model to deal with it. But if you just extend the the the input resolution of com, the the concsumption is very high. So we need we use across attention module adding to the copm to get copagent. The exist module is much liweight so we can deal with the high resolution more. Yes.
speaker 3: Great. Here's a question about video. How do you think video understanding will aid AI's ability to have a stronger physical understanding of the world?
speaker 2: Okay. Okay, that's is a very good question. I think yes, I my answer is yes, but it actually a bilective problem is because if you don't have some data source which conturn the physical physical rules, you cannot train a good video understanding model. I think using the current we are language model training meththird because we need the input ts the the text image or text video pairs to train. And we didn't, we actually did not use any self supervised learning in the image or video. So we cannot learn any knowledge from pure video or image. We actually we actually dealing with the unnoted data from a human side. So if we want to understand better of the physical words using ununnoted videos, we need to find some new methood for self supervised learning or training method. Yeah, this this is a very good very good question. This is very good question. Thank you.
speaker 3: Right. Okay, a couple more questions someone is asking. Are there vqa tasks that involve multiple terms of conversation in a tree structure similar to a tree of thoughts or beam search style?
speaker 2: Okay, okay. Yeah, Yeah, maybe but I still think it's different and the trio thought way could be better because it can is aware of other MaaS information, for example, the wrong PaaS, the other field case, something like that to my experience is if you you can include all the contacts in your input, you always get better results. So Yeah, maybe either trior or some other different process process procedure and some other information. You just included them into the contents. The language model will learn to how to deal with them and understand better than the beam search, which is actually a hard code method to compare the op probabilities. It should be better if if you do it right. Yes, right. Thanks. That's all.
speaker 3: All the time we have for questions. So thanks again to many for the great talk, the detailed answers to all the questions.

概览/核心摘要 (Executive Summary)

本内容总结了 Zhipu AI 研究科学家 Ming Ding 于2024年5月9日在斯坦福 CS25 课程上关于从大型语言模型 (LLM) 到大型多模态模型 (LMM) 的演讲。Ming Ding首先回顾了LLM发展的三个关键时刻：BERT时刻（自监督学习方法的探索，如其参与的GLM模型），GPT-3时刻（揭示了“规模法则”的重要性，即计算资源投入与性能提升的直接关系），以及ChatGPT时刻（证明了“任务适应成本低廉”，预训练知识至关重要，且模型性能与预训练损失直接相关）。

演讲详细讨论了LLM训练的技术细节，包括Transformer架构的常见改进（如Decoder-only、Pre-Norm、RoPE）、训练框架（DeepSpeed、Megatron-LM）、长上下文处理技术（如Context Parallelism）以及对齐方法（SFT和RLHF，特别是DPO的兴起）。Ming Ding强调，当前LLM研究中，数据（清洗、过滤、合成）是核心竞争力，算法、架构和数据在一定程度上可以相互转化。

随后，演讲重点介绍了过去一年中LMM的进展，包括BLIP-2、LLaVA等模型，并详细介绍了Zhipu AI的开源模型CogVLM（保持语言能力的同时增强图像理解，采用视觉专家模块）和CogAgent（针对GUI和OCR场景，支持高分辨率输入和跨注意力机制）。他还提及了最新的GLM-4V模型，其在多模态能力上表现优异。演讲还涵盖了图像生成领域，对比了自回归模型（如CogView）和扩散模型（如DDPM、其团队的ReFlow/Rectified Flow、CogView3），并分析了视频生成模型Sora成功的关键因素。

最后，Ming Ding展望了未来1-2年多模态领域的发展趋势，预测视觉基础任务将基本解决，视频理解和具身智能将愈发重要。他建议研究者关注视频理解的数据集与基准、被低估的语音AI，并强调系统级优化、新架构探索以及高质量数据生成方法的重要性。

演讲概述

演讲者：Ming Ding (Zhipu AI 研究科学家)
主题：从大型语言模型 (LLM) 到大型多模态模型 (LMM)
核心内容：回顾LLM发展，探讨LMM的学术界尝试与结构更新，重点介绍CogVLM和CogAgent模型，并讨论多模态模型的应用与未来研究方向。
演讲结构：
1. LLM简介与历史 (我们为何在此)
2. LLM训练的实用技术 (我们如何到达这里)
3. 过去一年LMM及相关技术 (我们正在做什么)
4. 多模态领域有价值的研究方向 (未来展望)

大型语言模型 (LLM) 的发展历程与关键节点

Ming Ding 认为LLM发展中有三个最重要的时刻：

BERT 时刻 (‘诞生’时刻 / birth moment)
- 核心：探索更优的自然语言处理自监督学习方法。
- 普遍观点：Masked Language Model (如BERT) 擅长理解文本，自回归模型 (如GPT) 擅长文本生成，T5则试图兼顾两者但被认为冗余。
- Ming Ding的贡献：参与发表 GLM (General Language Model)，旨在统一BERT和GPT的Decoder-only架构。通过选择性地对序列的一部分进行自回归建模，实现了类似BERT（部分mask）和GPT（全序列mask）的效果。
- 反思：事物是变化的，如今GPT系列几乎统一了NLP问题。
GPT-3 时刻
- 核心：“规模法则 (Scaling Law)” 的重要性。
- 观点：增加计算资源（更多参数、更多训练数据）可以带来可预测的性能提升（如困惑度降低），使得LLM研究更偏向工程化。架构或算法创新的边际效益可能不如扩大规模。
- 影响：LLM的开发变得更侧重于如何有效分配计算资源。
ChatGPT 时刻
- 核心：“任务适应成本低廉 (Task adaptation is cheap)”，而 “预训练知识至关重要”。
- 教训：以往设计不同损失函数以适应不同任务（如GLM和BERT能做填空，自回归模型不能），但现在发现任务适应仅需在预训练模型基础上进行少量微调。
- InstructGPT的启示：对齐操作能显著提升模型在人类偏好上的表现。
- Zhipu AI的近期研究：下游任务的性能仅与预训练的损失 (loss) 相关，而与模型大小不直接相关。一个训练不足（loss高）的大模型可能不如一个充分训练（loss低）的小模型。所谓的“涌现能力”也与loss相关，而非参数量。
- 结论：LLM研究很大程度上变成了“曲线拟合的游戏 (game of curve fitting)”。

LLM 训练的技术细节

尽管LLM发展趋向工程化，但仍有许多重要的技术细节：

Transformer 架构的常见改进 (许多源于Transformer原论文作者后续工作)：
- Decoder-only架构：取代了原始的Encoder-Decoder架构，因其更简洁高效。
- Pre-Norm (LayerNorm)：Layer Normalization置于残差连接之前，而非原始的Post-LN。
- Rotary Position Embedding (RoPE)：一种高效的位置编码方法，最初并非通过论文发表。
- Grouped Query Attention (GQA)：节省推理显存。
- SwiGLU (variant of GLU)：替代MLP中的激活函数。
- Mixture of Experts (MoE)：用较少参数达到更好性能。
- 示例：Llama模型采用了上述多种技术。
LLM 训练框架与优化：
- DeepSpeed (Microsoft)：
  - ZeRO (Zero Redundancy Optimizer)：
    - ZeRO-1：分散优化器状态 (Adam states) 和主权重到数据并行中的各个rank，减少单卡显存占用。
    - ZeRO-2：进一步分散梯度。
    - ZeRO-3 (Fully Sharded Data Parallel)：将模型参数也分散到不同卡上，使用时聚合。
  - Activation Checkpointing (Gradient Checkpointing)：仅保存部分中间激活状态，反向传播时重计算，大幅减少显存占用。
  - CPU Offload：将部分GPU显存中的数据卸载到CPU内存。
- Megatron-LM (NVIDIA)：适用于训练超大规模模型 (如 >100B 参数)。
  - Tensor Parallelism：将隐藏层大小和注意力头在不同rank间切分，但会引入额外的通信 (all-reduce)。
  - Pipeline Parallelism：将模型的不同层切分到不同rank，但会产生“气泡 (bubbles)”，可通过interleaving等方法减少。
- 结论：当前LLM训练已高度工程化，有成熟的库和API简化了大规模训练的复杂度。
长上下文 (Long Context) 处理：
- 趋势：当前的长上下文处理已远超几年前的想象，能够实现基于完整注意力 (full attention) 的超长序列处理 (如超过10万token长度)。
- 对比：几年前处理长文本依赖复杂的检索、分步处理等方法（如Ming Ding早期关于工作记忆模拟的论文）。
- 关键技术：Context Parallelism，将序列切分到不同rank，使用Ring Attention或Ulysses等技术完成注意力计算。
- 库：Transformer Engine 提供了相关功能。
- 挑战：需要处理注意力计算的负载均衡。
- 影响：简化了许多NLP任务，如文档摘要和事实抽取，可以直接将长文档输入模型进行理解。
对齐 (Alignment)：
- Supervised Fine-Tuning (SFT)：
  - 使用高质量的人工标注数据进行微调，强调需要领域专家编写高质量答案，而非简单的众包。
  - 可以从更强大的模型（如GPT-4 Turbo）提取问答对进行训练，但OpenAI禁止此行为用于商业竞争。
  - 观点：即使使用教师模型的数据，如果学生模型的loss更低，也能超越教师模型（提及论文 "The False Promise of Imitating Proprietary LLMs" [讲者原文提及 "way to strong generalization"]）。
- Reinforcement Learning from Human Feedback (RLHF)：
  - PPO (Proximal Policy Optimization)：效果强大但难以实现和训练。
  - DPO (Direct Preference Optimization)：一种更简单的方法，仅需偏好数据对即可更新模型，无需显式的奖励模型。目前多数开源模型采用此方法。

数据在 LLM 中的核心地位

公开的秘密：数据清洗、过滤、合成为当前LLM公司的核心工作。
训练基础设施：虽然重要，但优化带来的性能提升可能不如高质量数据带来的提升明显（如20% vs 更显著的提升）。
数据、算法与架构的转化：
- 三者可以相互转化。一个问题可以通过改进架构、设计新算法或准备特定数据来解决。
- 示例 - 多跳问答 (Multi-hop QA)：
  - Ming Ding早期的工作 CogQA 使用复杂的BERT和图神经网络 (GNN) 架构解决。
  - 同期工作使用MCTS等算法。
  - 当前：通过长上下文GPT模型配合思维链 (Chain-of-Thought, CoT) 推理，将所有相关文档放入上下文中，即可较好解决，这是一种数据层面 (data-level) 的解决方案。
- 数据层面的解决方案：通常最简单，直接将数据加入训练，不影响其他任务。
- 结论：数据处理工作虽然看似基础，但对于当前AI发展至关重要，需要转变对数据工作的看法。

大型多模态模型 (LMM) 的进展 (过去一年)

BLIP-2：通过 Q-Former (一个Transformer模块) 连接了CLIP的图像编码器和LLM，使LLM具备图像理解能力。Q-Former需要训练以对齐图像和文本特征空间。
LLaVA (Large Language and Vision Assistant)：采用更简洁的方法，使用一个简单的投影层 (projection weight) 将视觉编码器的特征转换到LLM的输入空间。迅速成为流行的LMM架构。
CogVLM (Zhipu AI)：
- 动机：在赋予模型图像理解能力的同时，保持其原有的语言能力。避免像早期方法那样在多模态训练中损害LLM的语言性能。
- 方法：引入“视觉专家 (vision expert)”模块，在骨干网络中增加新参数处理图像特征，而原有权重继续处理文本特征。
- 性能：在多个基准测试（图像描述、视觉定位、VQA等）上达到SOTA。
- 开源：已开源，上个月下载量超过50万次。
CogAgent (Zhipu AI)：
- 目标场景：GUI理解和OCR，常用于构建网页代理 (web agent)。
- 架构特点：支持高分辨率图像输入，并使用跨注意力机制 (cross attention) 处理高分辨率通道，以在不显著增加LLM部分计算负担的前提下处理细节。
- 应用示例：接收屏幕截图作为输入，执行搜索、订票等操作。
- 开源：已开源。
其他流行的LMM架构：
- Fuyu-8B：支持不同分辨率的特征输入，显著提升OCR性能。
GLM-4V (Zhipu AI的最新模型)：
- 架构：基于LLaVA的简化改进，将投影层替换为步进卷积 (strided convolution) 以支持高分辨率输入，同时保持语言模型部分的计算效率。
- 训练：与文本数据混合训练。
- 性能：在多模态能力上可与GPT-4V、Gemini、Claude 3媲美，在OCR基准（如文档VQA）和中文OCR方面表现尤为出色。
- 体验：可通过 chatglm.cn 网站体验。

图像生成技术

自回归模型 (Autoregressive Models)：
- CogView (Ming Ding早期工作, 2021)：
  - 思路：借鉴GPT的自回归思想进行图像生成。
  - 方法：首先训练一个图像分词器 (image tokenizer)，通常是VQ-VAE，将图像离散化为token序列。然后将文本token和图像token序列拼接，用GPT模型进行自回归训练。推理时，输入文本，逐个预测图像token。
  - 同期工作：DALL-E (OpenAI), Parti (Google)。
- 通用多模态建模 (Universal Modeling)：
  - 思路：将图像像文本一样分词化，用统一模型处理图生文、文生图、纯文本生成等任务。
  - Ming Ding的尝试 (基于CogView等工作的延续, 约2年前)：通过调整文本和图像token在序列中的位置和mask方式实现不同任务。
  - 问题：
    - 图像生成效果和速度不如扩散模型。
    - 图像理解效果不如专门的视觉语言模型，因为图像离散化过程中会损失信息。
    - 结论：能实现通用建模，但在各项任务上都无法达到最佳性能。
扩散模型 (Diffusion Models)：
- 核心：一种与自回归完全不同的自监督学习方法。
- DDPM (Denoising Diffusion Probabilistic Models)：原始论文，仍是主流框架。通过逐步向清晰图像加噪，然后训练模型预测噪声（或原始图像/速度等）。
- 优势：采样时能充分利用GPU并行性，生成速度远快于自回归模型（尤其是高分辨率图像）。
- ReFlow / "Really Diffusion Model" (Rectified Flow, Zhipu AI近期工作)：
  - 解决问题：不同分辨率下噪声调度 (noise schedule) 的一致性问题。直接对不同分辨率图像添加独立同分布噪声会导致高分辨率图像在视觉上更模糊。
  - 方法：使用“块噪声 (block noise)” 将低分辨率的噪声等效转换到高分辨率，保持不同分辨率下频域信噪比一致。
  - 效果：解耦了噪声调度和网络架构，提升了在高分辨率上继续扩散的速度。
- CogView3 (Zhipu AI)：基于ReFlow扩展的大型扩散模型，经过蒸馏后速度很快。
- Transformer在扩散模型中的应用：
  - DiT (Diffusion Transformer) (Sora作者之一的工作)：关键是将时间步长信息通过类似 adaLN (Adaptive Layer Normalization) 的方式（预测缩放和平移参数）融入Transformer模块。这种条件注入模块参数量巨大。
  - Stable Diffusion 3：采用新的 MMDiT (Multi-Modal Diffusion Transformer) 架构，使用了类似CogVLM的视觉和文本专家模块，而非像先前工作那样依赖T5特征的跨注意力。其图像重描述模型基于Zhipu AI发布的ReFlow代码。

视频生成技术

CogVideo (Zhipu AI, 几年前)：基于自回归的文生视频模型，开源，但性能远不如当前的Sora。
Sora (OpenAI)：
- 关键改进来源分析：
  1. 无“闪烁” (flickering)：可能通过 3D隐空间编解码器 (3D latent encoder-decoder) 和训练扩散解码器解决。
  2. 高质量图像帧：得益于模型规模扩大 (scaling up) 和高分辨率处理。
  3. 长上下文视频条件化：借鉴了LLM训练基础设施中的长上下文处理和Context Parallelism技术，使得模型可以处理更长的视频序列和更复杂的文本提示。
  4. 数据覆盖度 (Data Coverage)：需要大量的工程设计和视频重描述 (video recaptioning) 技术来获取高质量的训练数据。
- 核心：将LLM训练的成熟基础设施和经验应用到扩散模型的扩展上。

未来展望与研究方向

未来1-2年趋势预测：
1. 常见视觉任务基本解决：物体识别、定位、属性识别、人类表情理解等高级视觉任务将变得廉价且基本解决。
2. 自动驾驶长尾问题缓解：虽不能完全解决，但会有显著改善。
3. 视频理解愈发重要：
  - 实用性强：互联网和日常生活中存在大量视频数据。
  - 难度大：当前最强模型（如Gemini 1.5）仍有幻觉、计数错误等问题，改进空间巨大。
  - 计算资源充足：下一代GPU和LLM的需求将提供足够算力。
4. 具身智能 (Embodied AI) 研究地位提升：
  - 结合LLM的规划能力和LMM的感知能力，可能产生惊艳的演示。
  - 短期内难以对现实生活产生巨大影响，因成本高昂。
对研究者的建议：
- 资深研究者：追随内心和经济效益。
- 快速产出影响力：关注视频理解模型、数据集、基准测试，尤其是数据集和基准测试，目前社区需求迫切。
- 多模态其他领域：语音/音频AI (Speech AI) 被低估，用户需求和应用前景广阔，但投入的GPU和研究资源不足。
- 高风险高影响力研究：
  1. 与系统领域博士生合作：优秀的算法需要充分利用当前硬件（GPU等）。
  2. 探索新架构、自监督学习方法和优化器：下一代硬件可能完全不同，Transformer和自回归建模方法可能面临挑战。
  3. 将计算资源转化为高质量数据的新方法：高质量网络数据已趋于耗尽。需要探索如利用代码执行结果、MCTS、强化学习等方法生成和筛选新数据。

问答环节 (Q&A)

长上下文窗口的代价是什么？
- Ming Ding：主要是时间消耗。推理分为prefill（处理长输入）和decode（逐token生成）两个阶段。多数用户场景是理解长文本后生成少量token。用户可能需要等待几秒到一分钟让模型处理长上下文。
数据质量是否比其他因素（如架构）更能决定模型最终性能？是否还有很多改进模型架构的工作可做？
- Ming Ding：是的，当前整个社区都在努力提升数据质量。架构、算法、数据可以相互转化。用数据解决特定问题通常更直接。高质量数据更重要。但通用的Transformer架构更新仍然非常有价值。
为什么自回归架构在图像生成方面不如扩散模型？
- Ming Ding：这是一个复杂问题。
  1. 生成速度：自回归模型逐token生成，高分辨率图像可能包含数千上万token，非常耗时。扩散模型通过几步前向传播即可生成完整图像，速度快几个数量级。
  2. 建模能力：自回归模型可能难以很好地建模图像中像素间的长距离空间依赖关系（如左上角像素与右下角像素在序列中距离遥远）。扩散模型在某种程度上可以同时看到所有像素。这只是可能的原因，尚需更多研究。
CogAgent模型与CogVLM模型有何不同？
- Ming Ding：CogAgent是在CogVLM基础上微调而来，专门处理高分辨率的网页截图等场景。由于网页包含大量小文字、小图标，需要高分辨率输入。CogAgent通过引入一个轻量级的跨注意力模块来处理高分辨率信息，避免了直接扩展CogVLM输入分辨率带来的巨大计算开销。
视频理解决如何帮助AI更好地理解物理世界？
- Ming Ding：是的，但这是一个双向问题。如果训练数据本身不包含物理规则，模型也难以学习。当前视觉语言模型主要依赖文本-图像/视频对进行有监督训练，并未充分利用纯图像/视频中的自监督信号。要让模型通过无标注视频更好地理解物理世界，需要探索新的自监督学习或训练方法。
是否存在类似思维树或束搜索风格的、涉及多轮对话的树状结构VQA任务？
- Ming Ding：可能存在，但他认为思维树这类方法可能更好，因为它们能感知到更多上下文信息（如错误路径、其他失败案例）。经验表明，将所有相关上下文都包含在输入中，模型通常能学到如何处理并表现更好，优于硬编码的束搜索（仅比较概率）。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-18 16:00

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-18 15:50

StreamSparkAI