2025-04-01 | Stanford CS25: V5 I Overview of Transformers

讲座系统地介绍了transformers的基本原理与发展历程，内容包括词向量和上下文嵌入的演变、自注意力机制中查询、键和值的工作原理、位置编码与多头注意力的作用以及最终实现的模型架构。讲者阐述了大规模语言模型预训练的关键策略，如数据混合、两阶段训练以及与人类语言学习数据量对比带来的启示，并探讨了微调、链式思维推理、基于人类反馈的强化学习等后续优化方法。讲座还展示了transformers在自然语言处理、计算机视觉、语音、生物信息、机器人等多个领域的广泛应用，并展望了模型可解释性、终身学习、设备端部署和自我改进智能体的发展前景，同时提出了应对高计算成本和环境影响的挑战。

视频科技

媒体详情

上传日期: 2025-05-18 15:29
来源: https://www.youtube.com/watch?v=JKbtWimlzAE
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Welcome to the fifth iteration of our cs 25 transformers class. So dive, and I kind of started this class a long time ago after seeing you know how transformers and machine learning in general and AI became such a prevalent thing, and how we predicted how it would become an even bigger part of our lives going forward, which does seem to be the case. So as large language models and AI in general takes over the world, whether it's through things like ChatGPT or image generation models like Sora, video generation models like Sora and so forth, we felt that, you know, having a class where people are able to sort of come and learn about transformers, how they work. And especially hear from leading experts in industry and academia working on state of the art research in this area, how that would be very beneficial to everybody's learning and help us progress further within AI and technology in general. So Yeah. So welcome to our class. So how our class works is typically each week we invite a leading researcher from either industry or academia to come speak about some state of ar art topic they're working on in transformers. So we have an exciting lineup of speakers prepared for you guys this quarter. And so this first lecture will be delivered by us where we'll sort of go through the basics of transformers. And then we sort of divided this lecture a bit differently from the previous lectures in that we kind of have a section on pre training and data strategies and then a section focused more on post training, which has become a very popular topic these days. We'll also touch briefly on some applications of transformers and some remaining sort of weaknesses or challenges that we should hopefully address to be able to further improve the state of AI and our machine learning models. So Yeah, I forgot, I have Yeah so we'll start with some instructor introductions. So we have a very good team of coinstructors. So my name is Stephen. I'm a current third year cphd here. Previously did my undergrad at Waterloo in Canada. I've done some research in industry as well at Amazon, nvidia. And in general, my research sort of focuses hovers around natural language processing. So machine learning for language and text, looking at things like, can we improve the controllability and reasoning models of large language models? And more recently, cognitive science and psychology inspired work, especially bridthe gap, the data gap, and the learning efficiencies between machine learning models and how the humans learn, how human children learn, and how our brains are able to learn so efficiently. Also done some work with multimodal as well as computer vision. So things like diffusion models and image generation. And just for fun, I also run the piano club here with Quran, and we have an upcoming concert on April eleventh, in case you guys are interested.
speaker 2: Hi everyone. I' M Kern, a second year electrical engineering PhD students. I did my undergrad at Cal Poly San Luis Obispo, after which I was a research scientist here, not doing my PhD. I'm a little bit more on the medical imaging and computer vision side. So a lot of my current work is at the intersection of computer vision and neuroscience, working with things like fMRI and ultrasound. And I currently work at the sai lab, a new lab under Dr. Sn Adeli.
speaker 3: Hi everyone. I'm Chelsea. I'm a first year master student in symbolic systems, and my general research interests are in multi ogentic frameworks, self improving AI agents, and overall just kind of improving like the interpretability and explainability of models. So previously, I studied applied math and neuroscience, and I did a bunch of like interdisciplinary research in computer vision, robotics, cognitive science, things of that sort. And currently I'm working part time at a vc firm, and over the summer I'll be interning at a conversational AI startup as a machine learning engineer. So I'm very interested in exploring the startup scene here at Stanford, so feel free to reach out.
speaker 4: Hi everyone. I'm Jenny. I'm a current student majoring in simpsis as well as a sociology co term here at Stanford. My background is primarily in technology, ethics, and poliso. If you have any questions or want to talk about that, I'd love to have a conversation. In the past, I've worked doing product at deshaw and also research in the tech ethics and policy space. And the summer I'll be working at daydream, which is an AI fashion tech startup in New York.
speaker 1: And so Yeah so dive was unable to join us today, but he's working on his new agent startup called AI Inc, currently on leave from a cd here. He's passionate about you know robotics, AI agents and so forth. And later to this, orhelikely be giving a lecture actually on everything to do with AI agents. So if you're interested in that, definitely look forto it. And previously, you know, he's worked at nvidia, Google and so forth, and he's the one who sort of know, started this class in the first place.
speaker 4: All right, so I'll go over some of the course logistics. So first announcement is we have a new website up that's just cs 25Stanford edu. And so all of our updates and as well as the speaker lineup will be posted there in the coming weeks. That will also be the link to share our zoom with people who are not Stanford affiliated or are on the walist or have not been able to gain admission into the class. So we encourage everyone to share this class with their network and ensure that anyone can access it from zoom. Yeah. So some takeaways from the course include a better understanding of transformers and the underlying architecture of many of our large language models, guest speakers which will be talking about applications in language, vision, biology, robotics, and more exposure to new research, especially from leading researchers all across the country, innovative methods that are driving the next generation of models, as well as key limitations, open problems, and the future of AI .
speaker 1: police sting.
speaker 2: Okay, next I'll give a really brief intro about transformers and how attention works. So the first step for language is word embeddings. So words aren't numbers, so we obviously can't just PaaS them into a model as is. So the first step is converting them into dense vectors in a high dimensional space. This is done through various methods, but the goal is to capture semantic similarity. Essentially that cat and dog are more similar than cat and car, even though the latter is more similar from a character standpoint. Doing so enables visualization, learning with transformer models or arithmetic. Like I've shown, like king minus man plus queen would approximately be queen in some embedding space. And classical methods for this are like words to vec, fast text, and many more these days. But static embeddings, for instance, giving the word bank the same meaning in just bank as in riverbank, have limitations. Therefore, the current standard is using contextual embeddings, which take into account the context and the sentence that the word is in. Self attention can be applied to this to learn what to focus on for given token. To do this, you learn three matrices, a query key, end value, which together comprise the attention process. Quick analogy for this is, imagine you're in a library looking for a book on a certain topic. This would be your query. Now, let's say each book has some summary associated with it, a key. You can match your query and key and get access to the book that you're looking for. The information inside the book would be your value. So in attention, we do a soft match over the values to get info from, say, multiple books. And this comprises the attention operation. And as you can see in this visualization, when you apply this to language, you can see that across different layers of the model, different words have connections to the rest of the words in the sentence. The next component is positional encodings, or embeddings, which add order to the sequence. Without these, the model, since you have just linear multiplications here, would not know what the first or the last word in the sentence is. Therefore, you add some notion of order through, say, sinusoids. Or in the simplest form, you could think that the first word would be a zero, the second 11 and on beyond. This is basically just scaling through multiple layers and multihead attention. More heads to attend to different parts of the sentence and more parameters means that you can capture more diverse relationships from your sequences. And this gives you the final transformer. Transformers today have overtaken pretty much every field from llms like GPT -4 zero zero three deep seek to vision with models that are getting increasingly better at segmentation and not speech biology video. You'll see a lot of these applications throughout the quarter with large language models. These are essentially just scaled up versions of attention. And the transformer architecture, you essentially just throw a large amount of data, general text data derived from the web at these models, and they can learn very well to model through a next token prediction objective language. And as you scale up, we've seen that immersion abilities pop up. So while at a smaller scale, you might not be able to do a certain task, once you get to a certain scale, you just have a pek in the ability to do that task. Some disadvantages though are that these models have very high computational costs and therefore also concerns like with climate and the carbon emissions they may produce. And like I was mentioning, with larger models, they're very good at generalizing too many abilities or tasks, and they're essentially plug and play with fewer zero shot learning.
speaker 1: All right, so now I'll talk a bit more about pre training. So as Kuran explained how the transformer works, but typically with a language model, especially a large language model, you typically divide it into two stages, pre training stage, where you sort of train the neural network from scratch, from randomized or initialized weights, randomly initialized weights to given more general capabilities. And a big portion of this is the data itself. So the data is sort of the fundamental fuel that sort of allows your model to learn because that's what the model is learning from. So your goal typically, again, like I said, with pre training, is to train on a large amount of data to obtain some sort of general level of capabilities and overall knowledge or intelligence. And this is arguably, again, the most important aspect of training, especially pre training, especially because llms learn, again, based on statistical distributions, predicting the next token, given previous tokens. So to effectively learn this, you typically need a large amount of data. So because of its importance, you know how do we maximally leverage it? So again, smart data strategies for pre training is definitely one of the most important topics these days. So I'll briefly touch upon two of the top projects I recently worked on on two different scales. The first is looking at you know what makes small, childlike data sets potentially effective for language learning, especially on the smaller scale. And the second is looking at smart data strategies for training large models on billions or trillions of tokens, which is on the much larger scale. So sort of why are humans able to learn so efficiently? This kind of looks at, you know, like how human children learn and interact with an environment and learn language compared to a model like chagpt, which is a bit analogous, how the human brain learns language and learns in general compared to something like a neural network. So some potential key differences are that humans learn continuously. We're continually learning. We don't just you know prere train. We don't just sit in the chair, have someone read the whole Internet to us and then we kind of just stop learning from there. So that's unlike a lot of current models, which are more single PaaS prere training models. Further, we have more go based approaches to learning and interaction with the environment. That's a major reason we learn. Whereas again, these models are typically just pre training on large amounts of data using next token prediction or autoregression. Further, we learn through continuous multimodal or multiti sensory data. So it's not just text only. We're subconsciously exposed to you know probably hundreds of senses that sort of guide the way we learn and sort of approach our daily lives. Further, I believe our brains are fundamentally different in that we learn probably in more structured or hierarchical manners, for example, through compositionality, rather than again, simply next token prediction. And the focus of this project in particular is more on the data differences. So again, humans are exposed to you know dialogue from people we talk to, storbooks, especially children, up compared to large amounts of data on the Internet. So this is a work that was published. So why do we care about small models and training on small amounts of data? Well, this will greatimprove the efficiency of training and using large language models, and this will open the door to potential new use cases. For example, models that can run on your phone, that you can run locally, and so forth. For many different use cases, smaller models and train on less data are also more interpretable and easier to sort of control or align, whether it's for safety purposes, to reduce bias and so forth, to ensure you know people are using them for safe reasons and you have appropriate guardrails in place. This will also enhance the open source availability, allowing research and the usage of these models for more people around the world, rather than simply companies with large amounts of compute. And in general, this might even allow us to more greatly understand the other direction, which is how humans are able to learn so effectively and efficiently. Yep. So this work is titled as child directed speech effective training data for language models, which I presented at emnlp in Miami last November. So again, the sort of hypothesis here is that children, you know, we sort of probably learn fundamentally different from llms. This is why we're able to learn on several magnitudes less language data in particular than many of these large language models these days, which require trillions of tokens. Now there's several hypotheses. One is the data we receive as humans is different fundamentally from llms, right? Rather than just training on Internet data, you know, we actually interact with people, we talk to people, we hear stories that our parents, our teachers tell us and so forth. The other is maybe the human brain just fundamentally learns different. So our learning algorithm is just different from large language models. And another is maybe it's the way or the structure in which we receive this data. So any data we receive is somewhat curricullarized. We start off with simple data, simple language as a child, and then you know learn more complex grammars. We hear more complex speech from our parents, coworkers and so forth. Anything we do, whether it's learning math, you know we start simple and then you know move on to more difficult problems. Whereas language models, you typically don't care too much about ordering our curriculum. So there's multiple of different hypotheses here. So in order to test some of these, what we did is we trained some small GPT two n Roberta models on five different data sets. One is childise, which is a natural conversation data with children. So this is transcribed. And then we collected a synthetic version called tiny dialogues, which I'll discuss more later, baby lm, which is a diverse mixture of different types of data. This includes reddata, Wikipedia data and so forth. So this is closer to your typical large language model, pretraining data. And then we also did a bit of testing with Wikipedia as well as open subtitles, so movie and tv transcriptions. So we collected tiny dialogue. And this was inspired by the fact that, you know, a lot of, again, I said, our learning as children is through conversations with other people. And conversations naturally lead to learning, right? We talk to someone, they give us feedback, we reflect on how the conversation went. So it's both pure and self reflection. Furthermore, conversations lead to not only learning of knowledge, but other things like ethics and morals, for example, parents or teachers, you know, telling us as children know what's right or wrong to do. And there's many different types of conversations you can have with many different types of people, leading to a lot of diversity and learning. So what we did is we collected a fully grammatical and curcularized conversation data set with a limited, childlike, restrictive vocabulary using GPT -4. And we collected different examples that diffdiffer by child age, the different participants in the conversation, and so forth. And here's just some examples of some data points in our collected data set. So you'll see as the age goes up, you know, the utterances or conversations become more complex, they become longer. The participants also differ by age appropriately. So we also ran an experiment, a curriculum experiment, where we ordered either by ascending age order, you know, so the model will first see two year old conversations, and then five year old conversations, and then ten year old and so forth, versus descending order. Maybe it's possible a language model might actually learn somehow better from more complex examples first. And then, of course, the typical baseline of randomly shuffling all your data examples. So we have some basic evaluation metrics targeted at fundamental capabilities. One is basic grammatical and syntactic knowledge, and there's another is a free word association metric called word similarity for assessing more semantic knowledge. So you see here from the different data sets that actually it seems like training on childlike data is worse than a heterogeneous mixture of just inite data like baby lm. So both metrics degrade quite substantially, especially on child desthe, more natural conversation data set between children and their caregivers. And you'll see in terms of curriculum, we don't see many substantial differences no matter what order. You sort of provide the examples into the model, which is again, surprising because as humans, you know we sort of go from simple to more difficult. So looking more closely at sort of convergence behavior or loss curves, you'll see here that the training loss the training loss sort of has these sorts of cyclical pattern depending on the sort of buckets you use for curriculum. But the validation loss, which is what you really care about, so the generalization and learning, it has the exact same trend no matter what order you feed the examples in, which is again, a very interesting sort of finding. So overall, we see that diverse data sources like baby lm seem to provide a better learning signal for language models than purely child directed speech. We do see, however, that our tiny dialogue data set noticeably outperforms the natural conversation data set, likely because that datset is very noisy, whereas ours is again synthetically collected by GPT -4. And again, global development mental ordering using curriculum learning seems to have negligible impact on performance. So overall, we can kind of sort of conclude that it's possible that other aspects of children's learning, not simply the data they're exposed to, are responsible for their efficient language learning, for example, learning from other types of information like multimodal information, or it's the fact that our learning algorithm in our brain is just fundamentally different and more data efficient than language modeling techniques. So if you wish to learn more, we have our data sets released on hugging face as well as GitHub and the papers up on archive as well. So now let's go bigger scale. So we were investigating small models trained on small amounts of data similar to a human child. Now what about current large models, billions of parameters trained on trillions of tokens. So I recently, during my last summer internship, I worked on a project with nvidia titled maximizer data's potential enhancing llm accuracy with two phase pre training. So this is the sort of optimized data selection as well as training strategies and large scale prere training. So a lot of works like glama highlighted the effectiveness of different sorts of data mixtures, but still really shed light into the exact mixtures and how these decisions were made. Whereas we know know data blending and ordering is crucial to effective llm pretraining. So can we shed more light on this, which is what our work does? So firstly, we sort of formalize and systematically evaluate this concept of two phase pre training, and we show that empirically, it improves over continuous training, which is typically what's done with llm training. And you just feed in all the data rather than separating it into particular buckets or a different schedule. We also do a fine grade analysis of data blending for these two pre training phases. And we sort of have this notion of prototyping blends on smaller token counts before scaling up. So this two phase pre training of poroach, it's sort of inspired kind of you know by how pre training and post training works, which is the first phase is on more general data. So this is to learn more broadly, so it's on more diverse data. And the second is to shift to more high quality domain specific data such as math and so forth. However, it's important to sort of balance between quality and diversity in both phases, as if you upgrade any data set too much, it can lead to overfitting. So firstly, does phase two phase training actually help? So we found that you know all our phase two blends or our phase two our two phase ptraining experiments outpered the baseline of simply just continuing training on a single phase. And this is noticeably better than just a randomized mixture of both phases as well as the natural data distribution compared to our sort of upsample data distribution for phase two. And we also showed that this is able to scale both on model scale and data scale. So if you blow up the token counts as well as the model size, and we show that performance further improves with our two phase pretrading compared to a single phase. So this kind of highlights also the effectiveness of prototyping on smaller data blends before I'm scaling up. And Furthermore, we investigate a sort of the phase the duration of phase two. So you know should we train on diverse data for a little bit and immediately switch to highly specialized data like math? Or should we wait longer? And what we found is performance improves up to a point around 40% until there's diminishing returns, likely from overfitting because specialized data, you know it's more specialized, it's more there's typically a lower number of it and it's less diverse compared to things like you know web craw data. So too much of it can lead to detrimental or dimission returns. So overall, we see a well structured two phase pre training approach with careful data selection and management is essential for optimizing llm performance while maintaining scalability and robustness across different downstream tasks. And in case you're interested, this paper is also preprint is up on archive. So overall, I guess the overall takeaway from these two projects and what I wanted to get at is like the fact that data effectiveness, especially for pre training, it's not just the amount of data, but it's about you know the quality of a data, the ordering and structure of data and how exactly you use it. So for our first project, we saw there's negative impact of global order in small scale training, but we saw that phase based training for larger scale ales is highly effective. And in general, smart data decisions are essential for models to generalize across tasks. So sort of takeaway is our research underscores that effective language modeling isn't just about amassing data, but about smarter data organization that harnesses its structure, quality and characteristics. And by continuing to sort of refine data centric approaches, the future of llm training promises smarter, more efficient and highly adaptable models. So now we'll be moving to sort of the second stage after pre training, which is post training, which Chelsea will talk about.
speaker 3: All right, so we have a pre train model. Now what like how do we adapt to specific tasks and different domains? So some major strategies include fine tuning, for instance, like reinforcement learning with human feedback or some prompt based methods or some sort of like rag architecture and retrieval based methods. So one major approach is called chain of thought reasoning. I'm sure you all have heard of it by now. So it's essentially a prompting technique to think step by step. So it shows the intermediate steps provide guidance. And this is sort of similar to the way how humans think we can imagine that we typically break down a problem into subsequent steps to help us better understand the problem itself. And another benefit of chain of thought is that it allows some sort of interpretable window into the behavior of the model. And this can kind of suggest that there is more knowledge embedded in the model's weights than just prompting a response. So this here is an example of chain of thought. On the left, we have it solve a problem in like a one shot manner, which turns out to get to the wrong answer. And on the right over there, it produces a sequence of reasoning chains, and ultimately it arrives at the correct answer. So naturally, this brings up an extension of chain of thought, which is called a tree of thought. And this is another prompting technique. But instead of producing a singular reasoning path as a chain of thought does, it considers multiple reasoning trajectories and then uses some sort of self evaluation process to kind of decide on the final outputs, such as like majority voting. So in the picture, you can see that tree of thought kind of generates like different reasoning paths and selects the best one at the end. So another way is through program of thought. And this basically generates code as the intermediate reasoning steps. And overall, what this does is that it offloads some sort of problem solving technique to some code interpreter. So it formalizes language into programs to arrive at more precise answers. So we have seen that this sort of problem decomposition seems helpful for different tasks. So one way is through Socratic questioning, which is basically using a self questioning module to propose subproblems related to the original and solthose in like a recursive sort of manner. So for instance, if the question is like what fills the balloons, this kind of leads to the next subquestion, which is like what can make a balloon float. And then by decomposing the original problem into like subsequent problems, it can better solve at the end. So finally, another problem decomposition method is through computational graphs. So this basically formulates compositional tasks as a computation graphs by breaking down the reasoning into different subprocedures and nodes. So the key takeaway here is that transformers can solve compositional tasks by reducing reasoning into subgraphs. And this is like without developing some sort of systematic problem solving skill, right? So Chelsea sort .
speaker 1: of touched on chain of thought and everything that sort of expands upon it or improves it. And that's sort of mainly a prompting based method for inference time. Next I'll be talking more at reinforcement learning and feedback mechanism. So are typically used for things like further fine tuning a pre train model. So the most popular is this thing called reinforcement learning with human feedback, or rlhf. So this trains a reward model directly from human feedback. So what you sort of do is you take your pre train ined model, get it to generate several responses, and then you typically take a pair of responses and have human sort of rate, which one they prefer. And you can sort of train a reward model based on this, basically using a reinforcement learning optimization algorithm such as ppo. Now there's an improvement to pvo called dpo, or direct preference optimization. So this sort of more directly trains the model to prefer outputs that humans rank higher compared to having a separate reward model, which is much more efficient. So basically, it actually gets sort of you can think of it as it sort of more closely ties the reward directly into the loss function itself by helping the llm to maximize the likelihood of generating preferred responses and minimize the likelihood of the responses that the humans did not prefer. And there's a sort of extension to rhf, which is called rl af. So this is simply replacing the human with an AI. So you typically have a pretty good llm that's able to provide accurate preference of judgments of which response it prefers. And this is less costly basically compared to human annocators. And then you basically you do the same thing. You train a reward model based on the llms preferences instead. And they found that actually human evaluators found that rl aif f tune outputs were around similar to rlhf, showing that you know this is a more scalable and cost efficient approach compared to human feedback. But there's one sort of disadvantage here, which is it really depends on the capabilities or the sort of accuracy of judgments of the llm you're using to provide your preferences. So if you're using one that is sort of incapable or very noisy, then that's going to hurt your post training. The next is this thing sort of very hot now, which was used in deep sebut. There are one as well as some other models like their math ones. So this is called group relative policy optimization, or grpo. So this is a variant of the ppo optimization algorithm. But rather than ranking simply pairs of responses, it actually ranks a group of responses in a different order. So this provides richer feedback, which is more fine grained and is much more sort of efficient compared to simply ranking pairs of outtwts. So this helps stabilize training, which is one reason deep seek is very much more data and compute efficient. And also they saw that it improves even things like llm reasoning, especially on things like math. There's also been other variations of rlhf and so forth. One is this thing called con man versky optimization. Not sure if I'm pronouncing that correctly, but kto. So this modifies the standard loss function typically used in post training things to account for human biases such as loss aversion. So as humans, you know we typically we care more about minimizing disastrous or negative outcomes then achieving positive ones. Were more risk averse in the most case, although it's very dependent on the person. So they encourage the AI to sort of behave in a similar manner by avoiding negative outcomes. And this basically adjuthe training process to reflect this. And they showed that this is able to sort of improve performance on different tasks, although it kind of depends on the task, but overall, it shows more sort of human like behavior on particular tasks. And these are just a subset of the sort of rlhf and sort of reinforcement learning and feedback based algorithms. One I want to touch upon before I finish off is this thing called personalizing rhf with variational preference learning. So the authors sort of saw that different demographics, you have different preferences. So typical rlhf sort of averages everything together. So what the authors do is they introduce a latent variable for every user preference profile, for example, a different demographic like children, adolts and so forth, and trains reward models conditioned on these sort of latent vectors or factors. So this leads to something they call pluralistic alignment, which is improving the reward accuracy for these particular demographics or subgroups. So it enables a single model to sort of adapt its behavior to different preferences, preference profiles and different demographics or groups of people. And now I'll hunit back to Chelsea to talk about self improving.
speaker 3: All right. So Yeah, let's talk a little bit about self improving AI agents. So what exactly is an AI agent? So it's essentially a system that perceives the environment, makes decisions and takes actions towards achieving some sort of specific goal. And usually this goal is given by the human. So for instance, like gameplaying task solving or like research assistance. And there's several components of an AI agent. So one, it's goal directed. Two, it can make its own sort of decisions. Three, it can act iteratively for there's usually some sort of memory component to it and like state tracking component to it. And finally, there's some agents that can use some tools such as like api calls or like function calling. And finally, it can learn and adapt on its own.
speaker 1: Okay, Yeah. So self .
speaker 3: improvement, basically models can reflect on their own outputs, leading to iterative improvements over time. So this typically consists of several steps. There's you know some sort of reflection of its own internal states. There's an explanation of its own reasoning process. It can evaluate the quality of its own outputs. And finally, it can also simulate multi step reasoning chains. So one technique is refinement. So this is where you have some sort of iterative prompting technique, where an llm critiques and improves its own outputs. So it generates some sort of initial response and then refines it over time. And this kind of uses feedback loops to sort of enhance the overall performance. So an example would be like it generates some answer, and then it evaluates itself for weaknesses and inconsistencies. And finally, it refines the response based on the own self critique method. Another technique is called a self reflection. So this is where a model learns from past mistakes and adjusts future responses based on past failures. So there usually is some sort of like long term memory component to this. And an example would be like the model first detects some sort of like weak response from its own outputs, and then it kind of reflects on its own mistakes and generates some sort of improved answer to it. And over multiple iterations, accuracy and reasoning should improve over time. Another technique is called react, which is essentially just combining reasoning with external actions such as you know api calls or like retrievals from a database. And this is basically some model that can interact dynamically with its environment. So it gets feedback from taking multiple action sequences and kind of incorporating that into its outputs. So for instance, the model will generate a reasoning plan and then it will call some sort of external tool, such as like web search or some api call. And then this model incorporates the retrieved data into its final response. And finally, this leads us to a framework called language agent research. So basically what lces is that it extends the react framework to incorporate multiple planning pathways. So you can kind of think this like analogous to chain of thought versus tree of thought. It kind of gathers feedback from every path to improve the future search process, which is kind of like some sort of verbal reinforcement learning inspired technique. And it uses Monte Carlo tree search to optimize planning trajectories, where in the tree structure, every node represents a state and every edge represents an action that the agent can take. So an example would be like it generates n best new action sequences, and then it will just execute them all in parallel. Then it will use some sort of like self reflection technique to score each one. And then overall, just continue exploring from the best state and update the probabilities of the past node. And Yeah, all right.
speaker 1: Next I'll be .
speaker 2: talking about a few other applications of transformers outside of language. I'll start with vision transformers, which have taken vision by storm. The dohere is that. So as I talked about, transformers taken sequences, right? But images aren't sequences. However, what the authors of the vit paper came up with was to split an image up into patches, which can then be embedded to form a sequence, passing this through a simple transformer yielded very good results. For instance, on classification, just by adding an mlp head to the end, you might ask, why apply transforces from when cnns are such a mainstay in the field? The main reason is that when you have a very large data set saying the tens of millions of examples, transformers bring in less inductive biases, cnn's assume locality, and that pixels are grouped together. Whereas with transformers and treating your images as sequences, you can see better results when you have enough data to train them. One common architecture that was impacted by the Swiss clip, which uses vits for its image encoders. This is the basis of models like GPT -4 zero or other vision language models, and essentially works through contrasted learning. So you take a data set of paired images and text pairs, and you train your model to align the encoded representations of both. So if you have an image of a cat and the word cat, then you can learn to align those embeddings. And like I mentioned, these have been applied to vision language models like GPT -4 or four o. The way these are trained is you can atenate your encoded image and text, and you can train in different stages such that your model learns to take both to account for its responses. And these have done very well on benchmarks and tasks, for instance, like test questions like I've shown here. Next, I'll talk about a bit of my work and neuroscience, which applies vits to other kinds of data. So a mainstay in my field is functional magnetic resonance imaging, or fMRI. Essentially, this captures the amount of oxygen that each voxel part of your brain is using at a given point. And this provides a very detailed proxy for the activity going on in your brain. Can be used to diagnose diseases and capture various amounts of data for a better cognitive understanding. However, if this is very high dimensional, you might have like a million or so boxels, or 100000 in the brain. So the first step to using this data with transformer models is usually averaging across like well known regions, or just grouping together voxels. And this gives you a more tractable, computationally tractable number of parcels that you can train in on. A traditional tool in this field was just to use linear pairwise correlation maps. And just these were enough to get pretty good diagnoses of things like Parkinson's. However, with the advent of tons of computer vision techniques, we can apply larger and more sophisticated models to these tasks. One cool, large body of work in this area is divvying up the brain into different functional networks. So let's say like your vision system, or your daydreaming network, or control, etcetera. And I'll get into how we use this to sort of guide our work. So like I mentioned, early ml models just took like linear correlation maps, so making lots of assumptions about the data, and just supplied typical like neural networks to the task for regression or classification tasks, or in some cases, graph based analyses to try to get a deeper understanding of how the brain, different parts of the brain interact with each other. With computer vision, we can take our raw data and just throw that at a transformer model, and that does very well as a pre training objective. So what we do is, let's say we have some number of rois across time. We can just mask out some portion of that data, PaaS the rest of the data through a transformer model and have it predict this portion. You repeat this across a large data set and all of your rois. And this provides a very good self supervised training objective for this task. So self supervised essentially means that there is no paired labell data here. We are essentially just using our raw data and posing our objective such that we can learn directly off of it. Once you've trained this sort of model, you have these dense representations inside the model that can be applied downstream to various tasks, like predicting patient attributes or the risk of disease. And you can also look at the weights that your model has learned to to analyses of brain .
speaker 1: networks. So in brief.
speaker 2: our approach essentially consists of taking the activity in the entire brain, partitioning out some small region. Let's say it's your vision system. You PaaS the unmaportion into a transformer model, which learns to predict the MaaS portion, and you can compare this to your ground truth to provide your training objective. One key thing we use here is cross attention. So what we talked about before with language, with self attention, wherein you are attending to the current sequence you're looking at, in cross attention, you have two different sequences, let's say in machine translation, you have one in English and in French. And essentially you apply attention between the two sequences instead of just on a single sequence. So our most basic architecture takes advantage of this through just a singular cross attention decoder. Having a very small model makes for better interpretability. And like I mentioned, this model just learns to predict MaaS brain regions from unmasked ones. Once we've done this, we can analyze again the attention weights to gain a deeper understanding of networks and also apply this to downstream tasks. So some modeling results here I've plotted, like the brain activity from different patients, and you can see that the model does pretty well in matching the ground truth for two networks that I've shown here. The salience network, which is involved in your senses and decision making. And the default mode network, or dmwhich, is responsible for like daydreaming or just recapitulating your brain information when you're not doing a certain task. On the bottom, we have the attention weights for this model, which I've split up by all of the other networks. So for instance, on the left, when predicting the salience network, we can see from our model that it is heavily dependent on the default mode and control networks. So this gives us a better understanding of how different brain networks are connected to each other, or how they might share information inside the brain. For other networks, though, like vision, these are more singular, we can't predict them very well. Or subcortical regions, say those involved in like memory, we also cannot predict very well. So this is all well and cool. We can predict brain activity, but what can we do with this model? If we simply replace one component of the model with a learnable token, which corresponds to predicting Parkinson's disease, then we can use this model to predict that ailment. So if you look on the right, after some fine tuning on a label data set, we can see some clustering in the models, embedding, which corresponds to getting close to 70% accuracy in predicting this disease, which is much higher than using the correlation based methods or linear assumptions that I talked about earlier.
speaker 4: All right, so now that we have some background on these transformer models and a couple of their applications, let's talk about the future and what's next. So overall, these transformer models can enable a lot more applications across every industry and sector. This includes generalist agents as well as longer video understanding and generation across the finance and business sector. Domain specific foundation models, like for example, one could imagine a doctor GPT or a lawyer GPT, or an insert field GPT. As well as potential real world impacts like personalized education and tutoring systems, advanced healthcare diagnostics, environmental monitoring and protection, real time multilingual communication, as well as an interactive environment and gaming, for example, non playable characters. What is missing though, what information might we need and what can we develop in the future? Currently, we're missing reducing computation complexity, enhancing human controllability, alignment with the language models of the human brain, adapted learning and generalization across different domains, and finally, multi sensory multimodal embodiment like intuitive physics and common sense. So these might, one might consider these barriers to developing artificial general intelligence. And these are some of the limitations of current transformer models. Some other things that are missing include infinite and external memory like neural Turing machines, infinite self improvement capabilities like continual or lifelong learning. This is another central tenet of human learning that we're not able to replicate at the moment. Complete autonomy, including curiosity, desires and goals, and long horizon decision making, as well as emotional intelligence, social understanding, and of course, ethical reasoning and value alignment.
speaker 1: Right. So there's still a sort of plethora of remaining sort of weaknesses or challenges around transformers, larger language models and AI in general these days. So I'll touch upon a couple of them briefly. The first is, like I mentioned earlier, efficiency being able to minify or sort of have tiny llms or models that you can run on your phone, on your smartwatch and etcec. So that's a big trend these days is using llms for everyday applications and purposes. And again, you want to be able to run them quickly and easily on smaller devices. Right now, there is more and more work on smaller and more efficient open source models, things like deep seek, llama and mistreal, but they're still somewhat large and a bit expensive, especially if you're looking to fine tune. There's not not accessible to everybody, especially on smaller devices. So in the future, again, we want na aim to have the ability to sort of fine tune or run these models locally on whatever device you want. The second is as our models, as our llm scale up trillions of parameters, trained on trillions of tokens across the Internet, what happens is this makes it a huge black box that is difficult to understand or interpret. It's hard to know what exactly is going on behind the scenes when you ask it to solve xyz and it comes up with answers abc, how exactly did it get there? Why did it choose those answers instead, and so forth. So more work on interpretability for loms will give us a better idea of what or how to improve them, is there ways of controlling them and better alignment, for example, being able to prevent them from producing certain outputs that might be unsafe or unethical. So there's this area which has gotten an even more popular reason called mechanistic interpretability, which is trying to understand how individual components or operations, even sometimes down to the individual node level, so very granular in an mo model, contribute to its overall decision making process. With the goal, again, of sort of unpacking this black box for a clear insight on how exactly they work behind the scenes. Next, I feel like we're approaching or we're already seeing diminishing returns. We're simply scaling up. So larger models on more data does not seem to be the band l solution. So one size fits all in frozen pretrain models have already started, leading to diminishing returns. So again, pre training performance. So the first the first sort of half right of training llms, it's likely saturating. Hence, there's been more focus on post training methods. Everything we've talked about feedback and rl mechanisms on prompting methods like chain of thought, self improvement and refinement and so forth. However, all of these post training mechanisms are going to be fundamentally limited by the overall performance or capabilities of the base model. So you can argue that the pre training is fundamentally what gives the basis or the foundational knowledge and capabilities to the model. Riso, we should not just stop investigating pre training just because we're hitting scaling limits. Furthermore, too much post training can actually lead to an issue. This is called catastrophic forgetting where the model forget stuff it's learned beforehand, for example, during pre training because you're overloading it with tons of new information in a new domain or a new task during post training. So how do we break through this sort of scaling law limit? Some potential things to investigate would be new architectures is there are different things like mommba state space machines, those sort of architectures. And it would be good to see more investigation on even non transformer architectures, which is a bit ironic in this class as transformers United. But we also always encourage more diversity and thinking outside the box. Also, again, everything I've talked about, about high quality data and smart data ordering and structuring strategies and overall improved training procedures, improved algorithms, loss functions, optimization algorithms and so forth. Another goal, as we've mentioned several times, is to be able to have bring these advanced capabilities to smaller models. Furthermore, we would still encourage more theoretical and interpretability research, including things like cognitive science and neuroscience inspired work, which kand I have some talked about some of that, that we've done recently. And so the next step will be models that are not just larger, but ones that are more smarter and more adaptable. So again, this there's this one major thing or major weakness that I think still bridges the gap between AI and humans, which is sort of continual or lifelong learning. So AI systems that can continuously improve by learning after deployment, after being pre trained, using implicit feedback, real world experience and so forth. So essentially, this is infinite and permanent sort of fundamental self improvement. We're not just talking about rag or retrieval, like putting knowledge in a retrieval database that you can retrieve at test time, but updating the brain or the weights of the model continuously. So this is similar to us, right? So we're learning every day. I'm learning right now by talking to you. I learn every time I talk to somebody else as I'm going through my daily life. But these models, after they're frozen or ptrained, that doesn't really happen. The only way they truly learn or their brain or weights are updated is through fine tuning. And again, we don't do that right? We don't sit in a chair every three months and have someone, we read the Internet to us or something like that. So again, this is almost wasted work, right? So currently during inference, the models are not actually learning and updating their weights when they're talking. When ChatGPT is talking to you, it's not truly updating its brain or weights. So this is a very challenging problem. But in our painting is likely one of the keys potentially to agi or truly human like AI systems. So there's different current work that tries to tackle this. There's things like fine tuning a smaller mail ale model based on traces from a larger model, things like model distillation related to a lot of things like improvement and so forth. But this is again, naturally continual learning. So some questions are, what mechanisms could potentially truly enable real lifelong learning? Will this be gradient updates, so actually updating the brain? Will it be things like targeting particular nodes in the architecture? Will it be having things like particular memory architectures or different parts of the neural network solely focused on sort of continuous updates and learning or even things like meta learning and looking more at the broad scale of things or the broader scope? So one line of work which has seen a bit of traction is moediting. So this is related to work on mechanistic interpretability. So this is instead of updating the whole model, if we're given a new factor, a new data point, can we target specific nodes or neurons in the model that we should update? So one work called rank one mile editing, or Rome, tries to do this through causal intervention mechanisms to determine you know which neuron activations most correspond to particular facial predictions, and then updating them appropriately. But as you can possibly suspect, this has a lot of weaknesses. So firstly, this works mainly for knowledge based things or simple facts. What if we want to update the actual skills or capabilities of a model? We want it to be better at math in general. We want it to be better at advanced and logical reasoning like humans. Then something like model editing based on factual predictions doesn't seem like itwork. The second is these are targeting one fact at a time, so it's not easy to propagate these changes to other nodes based on related facts. For example, let's say the mother like someone's mother, we want to update a fact about them. Then we should also update a fact about that person's brother because you know they have the same mother, but you know this sort of approach were only updated for the original person in question, but not any of the relatives. So this is just one example. So there's a lot of other works which have spuno recently in continual learning, which is good that this sort of area has seen more work. So I will very briefly describe some of these. One is distinthing commment, which is directly related to what I just said about Rome, but it's MaaS editing of factual knowledge instead of a simple sort of fact or memory at a time. It's able to simetiously modify thousands, even once, which might be related to each other, like I said, which is useful. There's things like cheor continue evolving from mistakes, so it actually identifies the element mistakes, somewhat similar to what self improvement Chelsea was talking about, but incrementally updates the model to self improve. Just things like lifelong mixture of experts. So what it does, instead of having a simple fixed mixture of experts architecture, it continually adds new experts for different domains over time while freezing potentially past experts which are no longer useful or don't need to be updated to avoid things like catastrophic forgetting. So this is a very smart sort of approach. Another is called club. So this enables continutask learning using only prompting without updating model weights by summarizing past knowledge into a compressed prompt memory. However, not a criticism of this work, but again, this is not technically updating the brain or the fundamental capabilities of the model. So this is a more of a prompt only approach. And another one of these is called progressive prompts, which again, alerts a soft prompt vector for each task and again progressively compresses them and composes them together, so allowing llms to continually learn without weight updates or catastrophic forgetting. But again, my opinion is continual learning would update the brain or the weights of the model in some way, I guess. So thanks. That's mainly our lecture. So you know we gave a brief overview of transformers, how they work, talked about pre training and especially how data is important for that. Various pultraining techniques, feedback, bad mechanisms, prompting mechanisms like chain of thought, self improvement, some applications to neuroscience, vision, and so forth. And some remaining weaknesses like things like the lack of continual learning and data efficiency, being able to scale down and run these models on our phone. So before we send you guys off, I know where ended a bit early. So this class going forwards every week, in case you haven't attended, we'll have a speaker, typically from industry or academia, come in to talk about the state of the art twork they're doing. And we have a cool lineup of speakers to prepare for you guys for the remainder of the quarter. And some more logistical things will be posting updates about lectures and so forth on our website through the mailing list, discord and so forth. So please join those if you haven't already. Thank you, guys. Hope you enjoyed the first lecture. And if anybody has any questions, feel free to come up and stay around. Thanks.

概览/核心摘要

本内容总结了斯坦福 CS25 课程第五期关于 Transformer 的概述讲座。讲座旨在应对 Transformer 及人工智能日益增长的重要性，为学生提供学习其工作原理及接触前沿研究的平台。核心讲师团队包括 Steven Feng、Karan Singh、Chelsea Zou 和 Jenny Duan，他们分别介绍了各自在自然语言处理、认知科学、计算机视觉、神经科学、多智能体框架、技术伦理等领域的研究背景。

讲座深入剖析了 Transformer 的核心技术与发展脉络。首先回顾了 Transformer 的基础组件，如词向量、自注意力机制、位置编码和多头注意力。随后，重点探讨了预训练阶段的数据策略，强调数据质量、结构和使用方式对模型性能的关键影响，并对比了人类与语言模型在学习效率上的差异。接着，详细阐述了训练后策略，包括通过思维链（CoT）及其多种扩展方法（如思维树、思维程序、从少到多提示、自笔记、潜在空间思维链等）以增强推理能力，以及多种基于反馈的强化学习机制（如RLHF、DPO、RLAIF等）用于模型优化。此外，还介绍了自我提升 AI 智能体的概念与实现路径。

讲座还展示了 Transformer 在视觉（ViT、VLM）、神经科学（fMRI分析）等领域的广泛应用。最后，展望了 Transformer 及 AI 的未来，讨论了其在各行业的潜在应用、实现通用人工智能（AGI）面临的挑战（如计算复杂度、可控性、多模态学习、持续学习）、小型化与端侧LLM的趋势、模型可解释性的重要性以及规模化的瓶颈。讲座强调，AI的未来发展关键在于提升模型的智能性、适应性、效率和可控性，而非单纯追求规模。

课程简介与目标

课程背景: 由 Dive 和 Steven Feng 发起，旨在应对 Transformer 和人工智能（AI）日益增长的重要性及其在未来生活中的核心地位。
课程目标:
- 帮助学员理解 Transformer 的工作原理。
- 邀请行业和学术界的顶尖专家分享其在前沿研究领域的成果。
- 促进 AI 和技术领域的学习与进步。
课程形式: 每周邀请一位行业或学术界的研究者就其在 Transformer 领域的最新研究发表演讲。
本次讲座结构:
1. Transformer 基础知识。
2. 预训练与数据策略。
3. 训练后策略（近期热门话题）。
4. Transformer 的应用简介。
5. 当前存在的挑战与弱点。

讲师介绍

Steven Feng:
- 斯坦福大学计算机科学博士三年级在读。
- 本科毕业于加拿大滑铁卢大学。
- 曾在亚马逊（Amazon）、英伟达（Nvidia）从事研究工作。
- 研究方向：自然语言处理（NLP），大型语言模型（LLM）的可控性与推理能力提升，认知科学与心理学启发的工作（弥合机器学习模型与人类学习效率的差距），多模态学习，计算机视觉（如扩散模型、图像生成）。
- 兴趣：与 Karan Singh 共同运营钢琴俱乐部，并提及即将举行的音乐会。
Karan Singh:
- 斯坦福大学电子工程博士二年级在读。
- 本科毕业于加州州立理工大学圣路易斯奥比斯波分校（Cal Poly San Luis Obispo）。
- 研究方向：医学影像，计算机视觉，神经科学（fMRI、超声波），目前在 Sn Adeli 博士的 SAI 实验室工作。
Chelsea Zou:
- 斯坦福大学符号系统（Symbolic Systems）硕士一年级在读。
- 研究兴趣：多智能体框架，自我提升的 AI 智能体，以及模型的可解释性与可理解性。
- 背景：应用数学与神经科学，跨学科研究经验（计算机视觉、机器人学、认知科学）。
- 经历：目前在一家风险投资公司兼职，暑期将在一家对话式 AI 初创公司担任机器学习工程师。
Jenny Duan:
- 斯坦福大学符号系统专业本科在读，同时辅修社会学（Sociology co-term）。
- 背景：技术伦理与政策。
- 经历：曾在 D.E. Shaw 从事产品工作，并在技术伦理与政策领域进行研究。暑期将在纽约一家 AI 时尚科技初创公司 Daydream 工作。
Dive (未出席，由 Steven Feng 介绍):
- 目前从斯坦福大学计算机科学博士项目休学，创办 AI 智能体初创公司 AI Inc。
- 研究兴趣：机器人学，AI 智能体。
- 未来可能在本课程中就 AI 智能体发表演讲。
- 曾任职于英伟达（Nvidia）、谷歌（Google）。
- 是本课程的最初发起人。

课程安排与资源

课程网站: cs25.stanford.edu，将发布更新、演讲者阵容等信息。
Zoom 链接: 将通过网站分享，供非斯坦福附属人员、候补名单或未能入学的学生远程参与。
课程收获:
- 深入理解 Transformer 及大型语言模型（LLM）的底层架构。
- 聆听嘉宾演讲，了解其在语言、视觉、生物、机器人等领域的应用。
- 接触来自全国顶尖研究人员的新研究成果。
- 学习驱动下一代模型的创新方法。
- 了解 AI 的关键局限性、开放性问题及未来发展方向。

Transformer 核心概念回顾 (Karan Singh)

词向量 (Word Embeddings):
- 将词语转化为高维空间中的密集向量，因为词语本身不是数字，不能直接输入模型。
- 目标是捕捉语义相似性（例如，“猫”和“狗”比“猫”和“车”更相似，尽管后者字符相似度可能更高）。
- 应用：可视化、Transformer 模型学习、算术运算（如 king - man + woman ≈ queen）。
- 经典方法: Word2Vec, FastText 等。
- 静态词向量的局限性（如“bank”一词多义）催生了上下文相关的词向量 (Contextual Embeddings)，后者考虑词语在句子中的具体语境。
自注意力机制 (Self-Attention):
- 学习每个词元 (token) 应该关注序列中哪些其他词元。
- 通过学习三个矩阵实现：查询 (Query, Q)，键 (Key, K)，值 (Value, V)。
- QKV 类比: 想象在图书馆根据特定主题 (Query) 查找书籍。每本书有摘要 (Key) 帮助识别内容。当 Query 和 Key 匹配时，获取书籍的详细信息 (Value)。Attention 机制在多个 Value 之间进行“软匹配”，从多本书中获取信息。
- 可视化结果显示，模型不同层中，不同词语与句子中其他词语的连接关系。
位置编码 (Positional Encoding/Embeddings):
- 为序列添加顺序信息，因为线性乘法本身不包含位置概念。
- 若无位置编码，模型无法区分词语在句子中的先后顺序。
- 实现方式：如使用正弦函数，或简单地将第一个词标为0，第二个为1等。
多头注意力 (Multi-Head Attention):
- 拥有更多的“头”和参数，意味着可以捕捉到序列中更多样化的关系。
最终的 Transformer 架构: 由上述组件组合而成。
当今的 Transformer 应用:
- 已在几乎所有领域取得主导地位，如 LLM (GPT-4, GPT-3 [原文为003], DeepSeek)、视觉 (Vision)、语音 (Speech)、生物 (Biology)、视频 (Video)。本学期将探讨其中许多应用。
大型语言模型 (LLM):
- 本质上是 Transformer 架构的扩展版本，参数规模巨大。
- 通常在海量通用文本数据（如网络语料库）上进行预训练。
- 训练目标通常是“下一个词元预测”。
- 随着规模扩大，模型会涌现出新的能力（涌现能力, Emergent Abilities），即小模型不具备某些能力，但达到一定规模后会突然出现。
- 缺点: 计算成本高昂，引发对气候和碳排放的担忧。
- 泛化能力: 大型模型具有良好的泛化能力，可通过少量或零样本学习 (few or zero-shot learning) 实现“即插即用”。

预训练策略与数据研究 (Steven Feng)

预训练概述:
- 通常分为两个阶段：预训练和后训练。
- 预训练是从零开始（随机初始化权重）训练神经网络，赋予其通用能力。
- 数据是根本性的“燃料”，模型从数据中学习。
- 目标是在大量数据上训练，以获得通用的能力、知识或智能。
- 数据是训练（尤其是预训练）中最关键的方面，因为 LLM 基于统计分布学习（根据前文预测下一个词元），需要大量数据才能有效学习。
- 核心问题：如何最大限度地利用数据？智能数据策略是当前最重要的议题之一。
Steven Feng 的两个相关项目:
1. 面向儿童的小规模数据集在语言学习中的有效性研究 (小规模)。
2. 针对数十亿/万亿词元的大模型训练的智能数据策略研究 (大规模)。
人类与 LLM 学习的差异:
- 持续学习 vs. 单次学习: 人类持续学习，而许多当前模型是单次预训练。
- 基于目标的交互式学习 vs. 自回归学习: 人类学习有明确目标并与环境互动，模型通常是基于下一个词元预测进行自回归学习。
- 连续多模态/多感官数据 vs. 纯文本/文本+图像: 人类通过多种感官潜意识地学习，模型数据模态有限。
- 结构化/层级化/组合式学习 vs. 下一个词元预测: Steven Feng 认为人脑学习方式可能更结构化，而非简单的统计学习。
- 数据差异: 儿童通过与人对话、故事书学习（约1亿词），LLM 通过互联网海量数据学习（数万亿词元）。
研究小型模型和小型数据的意义:
- 大幅提高 LLM 训练和使用的效率。
- 开辟新的可能性和潜在用例（如可在手机上本地运行的模型）。
- 提高可解释性，易于控制和对齐（安全、减少偏见）。
- 增强开源可用性，惠及更多研究者和用户。
- 可能帮助更好地理解人类如何高效学习。
项目1：面向儿童的语音是否是语言模型的有效训练数据？ (Is Child-Directed Speech Effective Training Data for Language Models?)
- 背景: 儿童学习语言所需数据量远少于 LLM。
- 核心假设:
  1. 人类接收的数据与 LLM 不同。
  2. 人脑学习算法与 LLM 不同。
  3. 人类接收数据的方式/结构（课程化学习，从简单到复杂）与 LLM 不同。
- 实验:
  - 模型: 小型 GPT-2 和 RoBERTa。
  - 数据集:
    - CHILDES: 与儿童的自然对话转录。
    - TinyDialogues: 使用 GPT-4 合成的面向儿童的对话数据集，语法正确，课程化，词汇量受限，按儿童年龄、参与者等区分。
    - BabyLM: 多种类型数据的混合体（包括 Reddit、维基百科等），更接近典型 LLM 预训练数据。
    - Wikipedia。
    - OpenSubtitles: 电影和电视转录。
  - TinyDialogues 设计理念: 对话促进学习（反馈、反思），学习知识、伦理和道德。
  - 课程化学习实验: 按年龄升序、降序、随机打乱顺序喂给模型数据。
- 评估指标: 语法和句法知识，词汇相似性（语义知识）。
- 结果与结论:
  - 多样化的数据源 (BabyLM) 比纯粹的面向儿童的语音数据为语言模型提供更好的学习效果。
  - 合成的面向儿童的语音数据 (TinyDialogues) 比自然的 (CHILDES) 更有效（可能因后者噪音较大）。
  - 全局发展顺序（课程化学习）对模型性能影响不显著。训练损失曲线随课程化分组呈现周期性，但验证损失（泛化能力）趋势一致。
  - 儿童高效语言学习的原因可能在于其他方面，如从多模态信息中学习，或其大脑学习算法本身比当前语言建模技术更高效。
- 资源: 数据集已在 Hugging Face 和 GitHub 发布，论文已上传至 arXiv。
项目2：最大化数据潜力：通过两阶段预训练增强 LLM 准确性 (Maximizing Data's Potential: Enhancing LLM Accuracy with Two-Phase Pre-training) (Nvidia 实习项目)
- 背景: 优化大规模预训练中的数据选择和训练策略。现有工作（如 LLaMA）虽强调数据混合的有效性，但缺乏具体细节。数据混合和排序对 LLM 预训练至关重要。
- 贡献:
  - 形式化并系统评估两阶段预训练。
  - 经验验证其优于连续训练（将所有数据一次性喂入）。
  - 对两个预训练阶段的数据混合进行细致分析。
  - 提出在较小词元计数上进行原型设计，然后扩展的策略。
- 两阶段预训练方法:
  - 阶段1: 使用更多样化的通用数据，建立广泛的语言理解基础。
  - 阶段2: 转向更高质量、特定领域的数据（如数学）。需平衡质量与多样性，避免过拟合。
- 主要结果:
  - 有效性: 所有两阶段预训练实验均优于单阶段基线，且明显优于随机混合或自然数据分布。
  - 可扩展性: 该方法在模型规模和数据规模上均能有效扩展，性能随之提升。
  - 阶段2持续时间: 性能随阶段2占比增加而提升，约在40%时达到峰值，之后出现收益递减（可能因专业数据量少、多样性低导致过拟合）。
- 结论: 精心构建的两阶段预训练，通过仔细的数据选择和管理，对于优化 LLM 性能同时保持跨下游任务的可扩展性和鲁棒性至关重要。论文已上传至 arXiv。
预训练数据策略总结:
- 数据有效性不仅在于数量，还在于质量、顺序和结构。
- 项目1表明，全局顺序在小规模训练中影响可忽略。项目2表明，基于阶段的训练对大规模学习高效。
- 智能的数据决策对于模型在任务间泛化至关重要。
- 未来 LLM 训练需要更智能的数据组织，利用其结构、质量和特性，以实现更智能、高效、适应性强的模型。

训练后策略 (Post-Training Strategies)

目标: 将预训练好的通用模型适应特定任务、场景、用户、领域等。
主要策略: 微调 (Fine-tuning)（如 RLHF）、基于提示的方法 (Prompt-based methods)、RAG / 基于检索的方法 (retrieval-based methods) 等。

思维链 (Chain-of-Thought - CoT) 及其改进 (Chelsea Zou)

思维链 (CoT):
- 一种提示技术，引导模型“一步一步思考”。
- 展示中间步骤以提供指导，类似于人类分解问题。
- 为模型行为提供了一个可解释的窗口，揭示模型权重中可能蕴含比直接提示响应更多的知识。
- 示例：对比单次回答（错误）与 CoT 回答（正确）。
思维树 (Tree-of-Thoughts - ToT):
- CoT 的扩展，考虑多个推理路径而非单一路径。
- 使用自我评估（如多数投票）来决定最终输出。
思维程序 (Program-of-Thought - PoT / Program-aided Language models - PAL):
- 生成代码作为中间推理步骤。
- 将问题解决过程交给代码解释器，将语言形式化为程序以获得更精确的答案。
问题分解：苏格拉底式提问 (Problem Decomposition: Socratic Questioning):
- 使用自问模块让 LLM 提出与原始问题相关的“子问题”。
- 通过递归地回答子问题来解决原始问题。
- 示例：“什么填充了气球？” -> “什么能让气球漂浮？”
从少到多提示 (Least-to-Most Prompting):
- 将复杂问题分解为一系列从易到难的简单子问题，模型依次解决。每个子问题的答案为下一个子问题提供信息，使模型能够处理比提示示例中更复杂的问题，从而提高泛化能力。
问题分解：计算图 (Problem Decomposition: Computation Graphs):
- 将组合任务表述为计算图，将推理分解为子过程和节点。
- Transformer 可以通过将推理简化为子图匹配来解决组合任务，而无需开发系统性的问题解决技能。
自笔记 (Self-Notes):
- 引入自笔记机制，允许 LLM 在处理输入时做草稿笔记。模型可以在阅读上下文时暂停并写下中间想法，而不是像 CoT 那样在看到整个问题后才进行，这种即时推理为模型提供了工作记忆，从而实现更好的多步推理和状态跟踪。
潜在空间中的思维链 (Chain-of-Thought Over Latent Space - CoCoNUT):
- 模型在其潜在隐藏状态空间而非自然语言中进行推理。通过将模型的隐藏状态反馈作为输入，模型可以并行探索多个推理路径（潜在空间的广度优先搜索），从而改善需要回溯的逻辑推理任务，并使用更少的词元。

强化学习 (RL) 与反馈机制 (Steven Feng)

基于人类反馈的强化学习 (Reinforcement Learning with Human Feedback - RLHF):
- 直接从人类反馈中训练一个“奖励模型”。人类对模型生成的响应对进行排序（哪个更好）。
- 使用该奖励模型作为奖励函数，通过 PPO 等优化算法来优化智能体的策略。
直接偏好优化 (Direct Preference Optimization - DPO):
- 是对传统 RLHF 方法的改进，它无需训练单独的奖励模型，而是直接根据偏好数据优化语言模型，效率更高。
- 通过最大化生成偏好补全的似然性并最小化不偏好补全的似然性来微调 LLM。
基于 AI 反馈的强化学习 (Reinforcement Learning with AI Feedback - RLAIF):
- 用一个表现良好的人工智能（LLM）的偏好判断取代昂贵的人类偏好标注。
- 在这些 AI 生成的偏好标签上训练奖励模型，并用 RL 微调策略。
- 人类评估者认为 RLAIF 调整的输出与 RLHF 输出效果相近，表明这是一种更具可扩展性和成本效益的方法。
- 缺点: 效果依赖于提供偏好判断的 LLM 的能力和准确性。
群体相对策略优化 (Group Relative Policy Optimization - GRPO):
- PPO 的一种变体，应用于 DeepSeek Math 等模型。
- 对一个群体内的多个响应进行排序（而非仅成对比较），提供更丰富、细致的反馈。
- 有助于稳定训练，提高 LLM 推理能力（尤其在数学等任务上），更高效。
卡尼曼-特沃斯基优化 (Kahneman-Tversky Optimization - KTO):
- 修改标准损失函数以考虑人类偏见，如损失厌恶。
- 鼓励 AI 更多地避免负面结果（最小化灾难性错误）而不是仅仅追求正面结果，使模型行为更符合人类在特定任务中的风险规避倾向。
通过变分偏好学习实现个性化 RLHF (Personalizing RLHF with Variational Preference Learning):
- 不同人群可能有不同的偏好，传统 RLHF 会将它们平均化。
- 引入用户偏好配置文件的潜变量（如不同年龄段），并根据该潜变量训练奖励模型和策略。
- 实现“多元化对齐”，提高对特定子群体的奖励准确性，使单一模型能适应不同偏好。

自我提升的 AI 智能体 (Self-Improving AI Agents) (Chelsea Zou)

AI 智能体定义: 一个感知环境、做出决策并采取行动以实现（通常由人类给定的）特定目标的系统。例如游戏、任务解决或研究助理。
AI 智能体组件:
- 目标导向。
- 自主决策。
- 可迭代行动。
- 具备记忆和状态跟踪能力。
- 可使用工具（如 API 调用、函数调用）。
- 能够学习和适应。
自我提升 (Self-Improvement): 模型可以反思自己的输出以迭代地改进自己。
- 包括：反思内部状态、解释推理过程、评估自身输出质量、模拟多步推理链。
自我提升技术:
- 改进 - 精炼 (Refinement - Self-Refine): 一种迭代提示方法，LLM 批评并改进其自身的输出。
  - 流程：生成初始响应 -> 评估弱点和不一致之处 -> 基于自我批评改进响应。
- 改进 - 反思 (Reflexion): 模型从过去的错误中学习，并根据先前的失败调整未来的响应。通常包含长期记忆组件。
  - 流程：模型检测到不正确或薄弱的响应 -> 反思错误并生成改进的响应 -> 经过多次迭代，准确性和推理能力得到提高。
- 改进 - ReAct (Reasoning + Acting): 将推理与外部行动（如 API 调用、数据库检索）相结合。模型与环境动态交互，并从交互中获取反馈。
  - 流程：模型生成推理计划 -> 调用外部工具（如网络搜索） -> 模型将检索到的数据整合到最终响应中。
- 改进 - 语言智能体树搜索 (Language Agent Tree Search - LATS):
  - 将 ReAct 扩展到包含多个规划路径（类似 CoT 与 ToT 的关系）。
  - 从每个路径收集反馈以改进未来的搜索过程（受强化学习启发）。
  - 使用蒙特卡洛树搜索 (MCTS) 进行优化规划（节点=状态，边=动作）。
  - 流程：生成 N 个“最佳”新动作序列 -> 并行执行它们 -> 对每个动作序列进行评分（使用自我反思） -> 从最佳状态继续探索，更新过去节点的概率。

Transformer 的应用实例 (Karan Singh)

视觉 Transformer (Vision Transformers - ViT)

核心思想: 将图像分割成小块 (patches)，将这些小块线性嵌入并加上位置嵌入，然后输入到标准的 Transformer Encoder 中。
应用: 在图像分类等任务上取得良好效果（末端加 MLP 头）。
与 CNN 对比: 当拥有非常大的数据集（如千万级样本）时，Transformer 因其较少的归纳偏置（CNN 假设局部性）而表现更优。

CLIP (Contrastive Language-Image Pre-training)

图像编码器常使用 ViT。
是 GPT-4o 等视觉语言模型（VLM）的基础。
通过对比学习训练：将成对的图像和文本对的编码表示在嵌入空间中对齐。

用于视觉语言模型 (Vision Language Models - VLM) 的 ViT

如 GPT-4o，将编码后的图像和文本拼接起来。
分阶段训练，使模型学会同时考虑两种模态进行响应。
在基准测试和任务（如看图回答问题）上表现优异。

Transformer + 神经科学 (Karan Singh 的工作)

功能性磁共振成像 (Functional Magnetic Resonance Imaging - fMRI):
- 通过测量脑区血氧水平（BOLD 信号的代理）来检测大脑活动。
- 可用于诊断疾病、理解认知。数据维度非常高。
- 处理方法：通常先对脑区进行平均或对体素进行分组，以降低计算复杂度。
- 传统工具：线性成对相关图（对帕金森等疾病诊断有效）。
大脑功能网络: 将大脑划分为不同功能网络（如视觉网络、默认模式网络、控制网络等）。
早期机器学习方法: 基于线性相关图，使用传统神经网络进行回归或分类，或进行基于图的分析。
当前基于计算机视觉的方法:
- 将原始 fMRI 数据直接输入 Transformer 模型。
- 预训练目标 (自监督学习): 遮蔽掉一部分脑区随时间变化的数据，让模型通过 Transformer 预测被遮蔽的部分。无需成对标签数据。
- 下游任务: 利用模型学习到的密集表征预测患者属性、疾病风险，或分析模型权重以理解大脑网络。
Karan Singh 的方法:
- 将大脑活动分区，输入未遮蔽部分到 Transformer，预测遮蔽部分，并与真实情况对比作为训练目标。
- 使用交叉注意力 (Cross-Attention): 在两个不同序列间应用注意力机制（如机器翻译中的源语言和目标语言）。其基础架构采用单一交叉注意力解码器，模型小巧易于解释。
- 结果:
  - 模型能较好预测大脑活动（如负责感知和决策的突显网络 Salience Network，负责“发呆”或信息重组的默认模式网络 Default Mode Network）。
  - 注意力权重分析揭示了不同脑网络间的依赖关系（如突显网络依赖于默认模式和控制网络）。
  - 视觉网络、皮层下区域（如记忆相关）的预测效果较差。
- 应用: 将模型一部分替换为可学习的与帕金森病相关的标记，微调后可用于预测该疾病，准确率接近70%，远高于传统基于相关性的方法。

未来展望与挑战

潜在应用与缺失要素 (Jenny Duan)

潜在应用:
- 通用智能体 (Generalist Agents)。
- 更长的视频理解和生成，金融与商业应用。
- 特定领域的基础模型 (如 DoctorGPT, LawyerGPT)。
- 现实世界影响：个性化教育和辅导、先进医疗诊断、环境监测与保护、实时多语言交流、互动娱乐与游戏（如 NPC）。
缺失的要素 (迈向 AGI 的障碍/当前局限):
- 降低计算复杂性。
- 增强人类可控性。
- 与人脑语言模型的对齐。
- 跨领域的自适应学习和泛化。
- 多感官多模态体现（例如，直觉物理学和常识）。
- 无限/外部记忆（如神经图灵机）。
- 无限的自我提升能力（即持续或终身学习，这是人类学习的核心特征，目前无法复制）。
- 完全自主（拥有自身的好奇心、愿望和目标）和长远决策能力。
- 情商和社交理解。
- 伦理推理和价值对齐。

核心挑战与未来方向 (Steven Feng)

效率：小型化与设备端 LLM (Minified LLMs and On-Device LLMs):
- LLM 在日常应用中的大趋势，要求能在手机、智能手表等小型设备上快速轻松地运行。
- 当前小型开源模型（如 DeepSeek, LLaMA, Mistral）仍相对较大且微调成本高。
- 未来目标：能够在本地设备上微调和运行模型。
LLM 的可解释性 (Interpretability of LLMs):
- LLM 因其巨大的参数和数据量成为难以理解的“黑箱”。
- 提升可解释性有助于：更好地改进模型、更容易控制、实现更好的对齐/安全（如防止产生不安全或不道德的输出）。
- 机制可解释性 (Mechanistic interpretability): 理解模型中单个组件、操作（甚至到单个节点级别）如何对其整体决策过程做出贡献，旨在解开模型的“黑箱”。
规模化的局限性 (Limits of Scaling?):
- 单纯扩大模型规模和数据量似乎正面临收益递减，预训练性能可能正在饱和。
- 因此，更关注训练后方法，但这受限于基础模型的整体性能/能力。预训练仍是基础。
- 过多的训练后处理可能导致“灾难性遗忘” (Catastrophic Forgetting)，即模型忘记预训练阶段学到的知识。
- 突破规模法则限制的途径:
  - 新的架构（如 Mamba 状态空间模型，甚至非 Transformer 架构）。
  - 更高质量的数据和智能的数据组织策略。
  - 改进的训练程序、算法、损失函数、优化算法。
  - 将高级能力赋予较小的模型。
  - 加强理论和可解释性研究，以及受认知科学/神经科学启发的工作。
  - 下一步：模型不仅要更大，还要更智能、更具适应性。

持续与终身学习 (Continual & Lifelong Learning) (Steven Feng)

核心问题: AI 系统在部署后（预训练后）通过学习、利用隐式反馈和真实世界经验来持续改进，实现无限和永久的基础性自我提升。这不仅仅是 RAG 或检索，而是更新模型的“大脑”或权重。
与人类学习的对比: 人类每天从互动中学习，而模型在预训练冻结后通常不学习（除非进行微调，但这与人类学习方式不同）。当前模型在推理时不更新权重，这是一种浪费。
重要性: 极具挑战性，但可能是通往 AGI 或真正类人 AI 系统的关键之一。
当前研究现状:
- 一些工作（如基于更好模型的轨迹微调小模型，模型蒸馏）更像是重新训练，而非真正的“持续学习”。
- 核心机制探索: 梯度更新？记忆架构？元学习？
模型编辑用于持续学习 (Model Editing for Continual Learning?):
- 与机制可解释性相关，旨在当获得新事实或数据点时，更新模型中的特定节点或神经元，而非整个模型。
- Rank-One Model Editing (ROME): 使用因果干预方法追踪模型事实预测的关键神经元激活，并相应更新它们。
- 弱点:
  - 主要适用于简单的基于知识的事实，难以更新模型的实际技能或能力（如数学、逻辑推理）。
  - 难以将更改传播到其他相关或依赖的事实（例如，更新某人母亲的信息，应同时更新其兄弟姐妹的相关信息，但此类方法通常只更新原始目标）。
其他持续学习方面的工作 (简述):
- REMIX: 在更新过程中混合通用数据以减轻在将新事实整合到 LLM 中时的遗忘，无需重放缓冲区即可实现持续的事实更新。
- MEMIT (Mass-Editing Memory in a Transformer): 实现 Transformer 中事实知识的高效批量编辑，能一步修改数千个模型“记忆”（可能相互关联），而无需重新训练。
- CEM (Continue Evolving from Mistakes): 识别 LLM 错误，检索纠正数据，并增量更新模型以实现自我改进，同时避免遗忘。
- Lifelong-MoE (Lifelong Mixture-of-Experts): 使用不断增长的混合专家架构，为新的领域或时间段添加新的专家，通过冻结过去的专家来避免遗忘。
- CLOB: 仅使用提示（无需更新模型权重）即可实现持续的任务学习，方法是将过去的知识总结为压缩的提示记忆。(Steven Feng 评论：这并未真正更新模型的“大脑”或基础能力)
- Progressive Prompts: 为每个任务学习软提示向量并逐步组合它们，使 LLM 能够在没有权重更新或遗忘的情况下持续学习。(Steven Feng 评论：同上，其个人观点认为持续学习应涉及模型权重的更新)

总结与后续安排 (Steven Feng)

讲座内容回顾: 简要概述了 Transformer 的工作原理，预训练（数据的重要性），多种训练后技术（反馈机制、思维链、自我改进），在神经科学、视觉等领域的应用，以及当前存在的弱点（如缺乏持续学习、数据效率、模型小型化等）。
后续课程安排: 每周将邀请行业或学术界的演讲嘉宾分享其前沿研究工作。
信息渠道: 课程更新将通过网站、邮件列表、Discord 等渠道发布。

发言人明确观点总结

Steven Feng:
- 强调数据质量、结构和智能使用策略对预训练至关重要。
- 认为人脑学习与当前 LLM 在数据效率和学习机制上存在根本差异。
- 对“真正的”持续学习的定义倾向于包含模型权重的更新，而不仅仅是基于提示或外部记忆的适应。
- 认为AI的未来发展方向是更智能、更适应，而不仅仅是更大。
Karan Singh:
- 强调上下文词向量相对于静态词向量的优势。
- 展示了 Transformer 在其研究领域（视觉、神经科学fMRI分析）的强大应用潜力。
Chelsea Zou:
- 重点介绍了多种扩展 CoT 的方法和自我提升 AI 智能体的框架。
Jenny Duan:
- 从更宏观的视角讨论了 Transformer 的广泛应用前景和实现 AGI 所面临的伦理、社会及技术层面的挑战。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-18 15:51

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-18 15:36

StreamSparkAI