Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy

该转录文本主要介绍了斯坦福大学CS25课程“Transformers United V2”的概览。课程聚焦于深度学习模型“Transformers”，该模型自2017年问世以来，已革新自然语言处理（NLP）领域，并广泛应用于计算机视觉、强化学习、生物学等多个方向。课程旨在深入解析Transformers的工作原理、不同类型及其应用，并邀请了领域内专家进行分享。

几位课程讲师进行了自我介绍，并阐述了课程目标，即帮助学生理解Transformers的运作机制、应用场景及前沿研究动态。

随后，内容回顾了Transformers的发展历程：从2017年论文《Attention is All You Need》的提出，标志着Transformers时代的开启，到其在NLP领域的迅速普及，再到2018-2020年间扩展至计算机视觉、生物信息学等领域，以及2021年至今生成模型（如GPT、DALL-E、ChatGPT）的爆发式增长。讲师强调了Transformers在处理长序列、理解上下文方面的优势，超越了早期的RNN和LSTM模型。

目前，Transformers在零样本泛化、多模态任务、音频与艺术创作、代码生成以及初步的逻辑推理方面展现出强大能力，并通过强化学习与人类反馈（RLHF）技术提升了与人类的交互和对齐。

展望未来，Transformers有望在视频理解与生成、金融、商业等领域取得突破，甚至可能用于文学创作。发展方向包括通用智能体、特定领域模型（如医疗GPT、法律GPT）及“专家模型”混合系统。然而，实现这些愿景面临诸多挑战：
1. 记忆与交互：当前模型缺乏长期记忆和持续学习能力。
2. 计算复杂度：注意力机制的二次方复杂度亟待优化。
3. 可控性：需增强对模型输出的精确控制。
4. 与人类认知对齐：需进一步研究如何使模型的工作方式更接近人类大脑。

最后，一位讲师（可能为Andrej Karpathy）简述了AI领域的发展背景，以此引出设立Transformers课程的初衷。

视频科技

媒体详情

上传日期: 2025-05-16 20:54
来源: https://www.youtube.com/watch?v=XfpMkf4rD6E
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-preview-06-05

转录

下载为TXT

speaker 1: Hi everyone, welcome to cs 25 transformers United V2. This was a course that was held at Stanford in the winter 2023. This course is not about robots that can transform into cars, as this picture must suggest. Rather, it's about deep learning. Models that have taken the world by the storand, have revoluzed the feel of AI and others, starting from natural language with processing. Transformers have been applied all over from comprienforcement, learning biology, robotics, etc. We have an exciting set of videos linup for you with some truly fascinating speakers. Skip talks presenting how they're applying transformers to the research in different fields and areas. We hope you'll enjoy and learn from these videos. So without any further alet's, get started. This is a purely introductory lecture, and we'll go into the building blocks of transformers. So first, let's start with introducing .
speaker 2: the instructors.
speaker 1: So for me, I'm currently on a temporary literprfrom, the psp program. I'm leading AI at a robotics startucollaborative robotics working on some general purpose robots, somewhat like a about and Yeah, I'm very passionate about robotics and building fstic learning algorithms. My research just start in the personal learning and provisions in remodeling. And I have a bunch of publications in protake on numbers driving other the areas under a closet corner. Now it's someone's book. For now, it's an nice timate call.
speaker 3: So I' M Stephen, currently first Sher cs speaker peson did my master's at cmu and an undergrad. I one of mainly into nlp research anything involved in language and text but more recently I've been getting more into computer vision as well as wonton moand just some stuff I do for fun, a lot of music stuff, mainly piano, some self promo, but I pua lot on my instyoutube and TikTok. So if you guys want to check it out my friends and I are also starting a Stanford piano club. So if anybois interested, feel free to email it again for details. Other than that, you know, martial arts, bodybuilding, huge found of pdramas, anime and occasional gamer. Okay, cool. Yeah. So my nais Ryan, since talk about myself, I just want to very briefly say that I'm super excited to take this class. I think the last time was off, sorry to teach this class. Excuse me. I think that the last time it was offered, I had a bunch of fun. I thought we brought in a really great group of speakers last time. I'm super excited for this offering. And Yeah, I'm thankful that you're all here, and I'm looking forward to a really .
speaker 1: fun quarter to thank you. Yeah, it's a fun fact. I was the most uspoken student last year. And if someone wants to become instrucinstructor next year, you know what they do.
speaker 2: Okay, cool.
speaker 1: See. So what we hope you will learn in this class is, first of all, how do transonomers work? How that we apply just genintenand nowadays, like we are pretty much using them everywhere in AI machine learning, and what are some new integraof research in the topics?
speaker 3: Cool. So this class .
speaker 1: is this introductory. So we were just talking about the basics of transformers, introducing them, talking about the self pretention mechanism on which they found it. And they will do a deep dimore on like models like birth GPT a, so get happy to get solved. Okay, so let me start with presenting the attention timeline. Attention all started with this one paper. Attention is already by wamania l in 20017. That was the being transformers. Before that, we had the historic error where we had models like R and m lstms. And there's into attention mechanisms that didn't work for scale. Depostarting, in 20017, we saw this explosion of transformers into nlthree where people started using it for everything. I even heard this support from Google is like our performance increase every time you hiour linguists for the first 2018. After 2018 to 2020, we saw this explosion of consumers into other few like question bunch of other stuff and mybiology alcoholes. And in last year, 2021 was the start tup alternative error, where we got like a lot of alternative modeling started, like molike Codax, GPT Dali table equusions, or a lot of like things happening in genetic warness. And we we start scup in here and now the present. So this is 2022 and like the start tup point 22. And now we have more like chat twhisper, a bunch of others. And we are like scaling onwards without spup. So that's great. So that's the future. So going more into this, so once there were, so we had to two sequence models. Lns gwhat worked here was that they were good at encoding history. But what did not work was they been enlong sequences and they were very encoding contests. So consider this example.
speaker 2: Consider trying to .
speaker 1: predict the last word in the text. I grew up in France, dot, dot, dot. I speak fluent dash. Here you need to understand the context for it to predifrench. And tecattention mechanism is very good at that. Whereas if you're just using lstms, it doesn't work better. Another thing transformers are good at is more based on content, is like also context prediction is like finding attention maps. If I have something like word like it, what noudoes it collected? And we can give a probability attention on what are the possible activations. And this works so better than existing mechanisms. So where we were in 20, 21, we were on the verge of takeoff. We were starting to realize the potential of transformers in different fields. We solved a lot of long sequence problems, like protein folding. Alphahold oppline rl, we started to see three zero short generalization. We saw multimodal tasks and applications like generating images from language. So that was di, Yeah. And it feels like Asia. But person one, like, and this is also talk on transformers that can watch.
speaker 3: which Yeah and this is where we were .
speaker 1: going from 20 21 to 2020 two, which is we have gone from the virof taking off, crwe taking off. And now we are seeing unique applications in audio generation, art, music through towing. We are starting to see reasoning capabilities like common sense, logical reasoning, mathematical reasoning. We are also able to now get human enlightenment and interaction. They're able to use reinforcement learning and human feedback. That's how trgit is trained to perform really good. We have a lot of mechanisms for controtoxicity bias in ethics now and a lot of also a lot of developments in other areas like diffvision models. Cool. So the feature is a spaceship, and we are all excited about it. And there's a lot of more equations that we can enable. And itbe great. If you can see transformers also work there. One big example is video understanding and generation. That is something that everyone is interested in. And I'm hoping we'll see a lot of models in this area this year. Also finance business. I'll be very excited to see gbauthor novel. But we need to solve very long sequence modeling. And most transformodels are still limited to like 4000 tokens or something like that. So we need to do make them generalize much more better on long sequences. We are we also want to have genernalized agents that can do a lot of mei task meinput predictions like garthrough. And so I think we will see more of that too. And finally, we also want domain specific models. So you might want like a GPT model that's put at like maybe like you have so that could like a doctor GPT model. You might have like a law GPT model that's like tain on only on law data. So currently we have like GPT models that obtain on everything, but we might start to see more niche models that are like good at one task. And we could have a mixture of experts. So society and think like this is like how you normally consult an expert. You'll have like expert AI models and you can go to different AI models for your different needs. There are still a lot of missing gradients to make this all successful. The first of all is excel and memory. We are already starting to see this with models like chagpt where the interactions are short left, there's no long term memory, and they don't have ability to remember store conversations for long term. And this is something we want to fix. Second, second is reducing the computation complexity. So attention mechanism is quadratic over the sequence. Lawhich is luand. We want to reduce ceramic posture. Another thing we want to do is we want to enhance the controllability of these models, like a lot of these models can be stochastic, and we want to be able to control whatsort outputs. We get from now. And you might have experienced with chagpt p refresh, you get like different outpueach time, but you might want to have mechanisms left controinters what sort of things you get. And finally, we want to align our state of art language models with how the human brain works. And we are seeing the search, but we still need more research on seeing how they can remain more important. Okay, thank you.
speaker 3: Great. Hi. Yes, I'm excited to be here. I live very nearby. So I got to be invites to come to class. And I was like, okay, I'll just walk over. But then I spent like ten hours on those slides, so it wasn't as simple. So Yeah, I want na talk about transformers. I'm going to skip the first two over there. We're not going to talk about those. We're we'll talk about that one just to simplify the lecture since we've don't have time. Okay. So I wanted to provide a little bit of context of why does this transformers class even exist? So a little bit of historical context. I feel like billbo over there. I joined like telling you guys about this. I don't know if you guys saw drinks. And basically I joined AI in roughly 2:12 in full course ses. So maybe a decade ago, and back then, you wouldn't even say that you joined AI by the way, that was like a diiry word. Now it's okay to talk about, but back then it was not even deep learning. It was machine learning. That was a term reviews, if you were serious. But now AI is okay to use, I think. So basically, do you even realize how lucky you are potentially entering this area in roughly 20, 23? So back then, in 2011 or so, when I was working specifically on computer vision, your pipelines looked like this. So you wanted to classify some images. You would go to a paper, and I think this is representative, you would have three pages in the paper describing all kinds of a zoo of kitchen sync of different kinds of features, descriptors. And you would go to post ster session and in computer Vision Conference, and everyone would would have their favorite feature descriptors that they're proposing. And it was totally ridiculous. And you would take notes on like which one you should incorporate into your pipeline, because you would extract all of them. And then you would put an svm on top. So that's what you would do. So there's two pages. Make sure you get your sparse sihistograms, your sims, your color histograms, textiles, tiny images, and don't forget the geometry specific histograms. All of them had basically complicated code by themselves. So you're collecting code from everywhere and running it. And it was total nightmare. So on top of that, it also didn't work. So if this would be, I think, representive prediction from that time, you would just get predictions like this once in a while. And yoube like you just shg your shoulders like that just happens once in a while. Today you would be looking for a bug. And worse than that, every every single sort of feel, every single chunk of AI had their own completely separate vocabulary that they work with. So if you go to, if you go to nlp papers, those papers would be completely different. So you're reading the nlp paper and you're like, what is this part of speech tagging, morcological analysis, syntactic parsing, co reference resolution? What is nbt t kj in your compute? So the vocabulary and everything was completely different, and you couldn't read papers, I would say, across different areas. So now that changed a little bit starting 2012, when oskoevsky and the colleagues basically demonstrated that if you scale a large neural network on large data set, you can get very strong performance. And so up till then, there was a lot of focus on algorithms, but this showed that actually neural nets scale very well. So you need to now worry about compute and data. And if you scale it up, works pretty well. And then that recipe actually did copy paste across many areas of AI. So we started to see a neural network pop up everywhere since 2012. So we saw them in computer vision and nland, speech and translation in rl and so on. So everyone started to use the same kind of deltoin modeling framework. And now when you go to nlp n, you start reading papers there in machine translation, for example, this is a sequence of sequence of paper, which we'll come back to in a bit. You start to read those papers and you're like, okay, I can recognize these words. Like there's a neural network, there's some parameters, there's an optimizer, and it starts to read like things that you know of. So that's decreased tremendously, the barrier to entry across the different areas. And then I think the big deal is that when the transformer came out in 2017, it's not even that. Just the toolkits and the neural networks were similar, is that literally the architectures converged to like one architecture that you copy paste across everything, seemingly. So this was kind of an unassuming machine translation paper at the time proposing the transformer architecture. But what we found since then is that you can just basically copy paste this architecture and use it everywhere. And what's changing is the details of the data and the chunking of the data and how you peed in. And you know that's a caricouature, but it's kind of like correct first order statement. And so now papers are even more similar looking because everyone just using transformer. And so this convergence was remarkable to watch and unfolded the last decade. And it's crazy to me. What I find kind of interesting is I think this is some kind of a hint that we're maybe converging to something that maybe the brain is doing, because the brain is very homogeneous and uniform across the entire sheet of your cortex. And okay, maybe some of the details are changing, but those feel like hyperparameters of like a transformer, but your auditory cortex and your visual cortex and everything else looks very similar. And so maybe we're converging to some kind of a uniform, powerful learning algorithm here. Something like that I think is kind of interesting, exciting. Okay, so I want to talk about where the transformer came from briefly, historically. So I want to start in 2003. I like this paper quite a bit. It was the first sort of popular application of neural networks to the problem of language model one. So predicting, in this case, the next word in a sequence which allows you to build generative models over text. And in this case, they were using multilayer perceptron. So very simple, neural wethe, neural netook three words and predicted the probability of distribution, put fourth word in a sequence. So this was well and good at this point. Now, over time, people started to apply this to a machine translation. So that brings us to sequence to sequence paper from 2014. That was pretty influential. And the big problem here was, okay, we don't just want to take three words in predict it four, we want to predict how to go from an English sentence to a French sentence. And the key problem was, okay, you can have arbitrary number of words in English, an arbitrary number of words in French. So how do you get an architecture that can process this variably sized input? And so here they use a lstm, and there's basically two chunks of this which are covered by the this, but basically have an encoder lstdm on the left. And it just consumes one word at a time and builds up a context of what it has read. And then that acts as conditioning vector to the decoder rnn or lstdm that basically goes chunk chunchunk for the next word in a sequence, translating the English to French or something like that. Now the big problem with this that people identified, I think, very quickly and tried to resolve is that there's what's called this encoder bottlenecso. This entire English sentence that we are trying to condition on is packed into a single vector that goes from the encoder to the decoder. And so this is just too much information to potentially maintain a single vector. And that didn't seem correct. And so people are looking around for ways to alleviate the attention of sort the cobottlenecas it was called at the time. And so that brings us to this paper, neural machine translation by jointly learning to align and translate. And here, just quoting from the abstract in this paper, we congested that the use of a fixed lung vector is a bottleneck in improving the performance of the basic encoder decoder architecture, and proposed to extend this by allowing the model to automatically soft search for parts of the source sentence that are relevant to predict an target word without having to form these parts or heart segments explosively. So this was a way to look back the words that are coming from the encoder, and it was achieved using this soft search. So as you are decoding in the words here, while you are decoding them, you are allowed to look back at the words at the encoder via this soft attention mechanism proposed in this paper. And so this paper, I think, is the first time that I saw basically attention. So your context vector that comes from the encoder is awaighted sum of the hidden states of the words in the in the encoding. And then the weights of this sum come from a soft max that is based on these compatibilities between the current state as you're decoding, and the hidden states generated by the encoder. And so this is the percent that really, you start to like, look at it. And this is the current modern equations of the attention. And I think this was the first speaker that I saw it in, is the first time that there's word attention used, as far as I know, to call this mechanism. So I actually tried to dig into the details of the history of the attention. So the first author here, Dimitri, I had an email correspondence with him, and I basically sent him an email. I'm like, Dimitri, this is really interesting. Just warmers have taken over. Where did you come up with the soft attention tion mechanism that ends up being the heart of the transformer? And to my surprise, he wrote me back this massive email, which was really fascinating. So this is an excerpt from that email. So basically he talks about how he was looking for a way to avoid this bottleneck between the encoder and decoder. He had some ideas about cursors that traversed the sequences that didn't quite work out. And then here, so one day I had this thought that it would be nice to enable the decoder rand n to learn to search where to put the cursor in a source sequence. This was sort of inspired by translation exercises that learning English in my middle school involved. You gaze shifts back and forth in source and target sequence as you translate so literally. I thought this was kind of interesting that he's not native English speaker. And here that gave him an edge in this machine translation that led to attention and then led the transformer. So was that's really fascinating. I expressed a soft search as soft tmax and that weighted averging of the fire on estates. And basically, to my great excitement, this this worked from the very first try. So really, I think, interesting piece of history. And as it later turned out, the name of rnn search was kind of blame. So the better name attention came from Joshua on one of the final passes as they went over the paper. So maybe attention is all I need would have been called like rns or just but we have Yeshua angio to thank for a little bit of better name I would say. So apparently that's the history of the subject. That was interesting. Okay, so that brings us to 2017, which is attention is all unique. So this attention component, which in Dimitrii paper was just like one small segment, and there's all this bidirectional rnn rnn and decoder, and this attention only paper is saying, okay, you can actually delete everything. Like what's making this work well, very well, is just attention by itself. And so delete everything, keep attention. And then what's remarkable about this paper actually is usually you see papers that are very incremental. They add like one thing and they show that it's better. But I feel like attention is on. It was like a mix of multiple things at the same time. They were combined in a very unique way and then also achieved a very good local minimum in the architecture space. And so to me, this is really a landmark paper that is quite remarkable and I think has quite a lot of work behind the scenes. So delete all the rand n, just keep attention because the tenis operates oversets. And I'm going to go into this in a second. You now need to positionally encode your inputs, because attention doesn't have the notion of space. They be hothat. I have to be very careful. They adopted this residual network structure from resonance. They interspersed attention with mullayer perceptrials. They used layer norms, which came from a different paper. They introduced a concept of multiple heads of attention that were applied in parallel. And they gave us, I think, like a fairly good set of hyperparameters that to this day are used. So the expansion facture in the mullaoppercepon goes up by four x. We'll go into a bit more detail and the four x has tuck around. And I believe there's a number of papers that try to play with all kinds of little details of the transformer and nothing like sticks because this is actually quite good. The only thing to my knowledge that stuck that didn't stick was this reshuffling of the layer norms to go into the prenorm version, where here you see the layer norms are after the multi headed detention p, they just put them before instead. So just reshuffling of layer norms. But otherwise the GPTs and everything else that you're seeing today is basically the 3:17 architecture from five years ago. Even though everyone is working on it, it's proven remarkably resilient, which I think is real interesting. There are innovations that I think have been adopted. Also in posiencodings, it's more common to use different rotary and relative posiencodings and so on. So I think there have been changes, but for the most part, it's proven very resilient. So really quite an interesting paper. Now I wanted to go into the attention mechanism, and I think I sort of like the way I interpret it is not similar to the ways that I've seen the presented before. So let me try a different way of like how I see it. Basically, to me, attention is kind of like the communication phase of the transformer and the transformer interleaves two phases, the communication phase, which is the multi headed attention, and the computation stage, which is this multitallio percepon or p. So in the communication phase, it's really just a data dependent message passing on directed graphs. And you can think of it as, okay, forget everything with a machine translation and everything. Let's just we have directed graphs. At each a node, you are storing a vector. And then let me talk now about the communication phaof, how these vectors talk to each other in this directed graph. And then the compute phalater is just the multitao perceptron, which now which then basically acts on every node individually. But how do these notes talk to each other in this directed graph b? So I wrote like some simple Python, like I wrote this in Python, basically to create one round of communication using attention as the paspassing scheme. So here a node has this private data vector, as you can think of it as private information to this node. And then it can also emit a key aquery and a value. And simply that's done by linear transformation from this node. So the key is what are the things that I am sorry, the query is one of the things that I'm looking for. The key is one of the things that I have, and the value is one of the things that I will communicate. And so then when you have your graph that's made up of nodes in some random edges, when you actually have these notes communicating, what's happening is you loop over all the nodes individually in some random order, and you are at some node, and you get the query vector q, which is, I'm a node in some graph, and this is what I'm looking for. And so that's just achieved via this linual transformation here. And then we look at all the inputs that point to this node. And then they broadcast, what are the things that I have, which is their keys. So they broadcast the keys. I have the query. Then those interact by dot product to get scorps. So basically, simply by doing dot product, you get some kind of an unnormalized weighting of the interestingness of all of the information in in the nose that point to me and to the things I'm looking for. And then when you normalize that with a soft tmaso, it just sums to one. You basically just end up using those scores, which now sum to one now probto distribution, and you do a weighted sum of the values to get your update. So I have a query. They have keys, dot products to get interesting, this or like authinity softmax to normalize it. And then weighted sum of those values flow to me and update me. And this is happening for each note individually, and then we update at the end. And so this kind of a message passing scheme is kind of like at the heart of the transformer and happens in a more vtoriized batched way that is more confusing and is also interspinterspersed with layer norms and things like that to make the training behave better. But that's roughly what's happening in the attention mechanism, I think, on a high level. So Yeah.
speaker 2: So in the communication .
speaker 3: phase of the transformer, then this message passing scheme happens in every head in parallel and then in every layer in series and with different weights each time. And that's the that's it. As far as the multi headed attention goes. And so if you look at these encoder decoder models, you can sort of think of it then in terms of the connectivity of these nodes in the graph, you can kind of think of it as like, okay, all these tokens that are in the encoder that we want to condition on, they are fully connected to each other. So when they communicate, they communicate fully when you calculate their features. But in the decoder, because we are trying to have a language model, we don't want to have communication from future tokens because they give away the answer at this step. So the tokens in the decoder are fully connected from all the encoder states, and then they are also fully connected from everything that is Boand. So you end up with this like triangular structure of in the Arctic graph. But this the message passing scheme that this basically implements. And then you have to be also a little bit careful because in the cross attention here with the decoder, you consume the features from the top of the encoder. So think of it as in the encoder, all the nodes are looking at each other, all the tokens are looking at each other many, many times, and they really figure out what's in there. And then the decoder, when it's looking only at the top nodes, so that's roughly the message passing scheme. I was going to go into more of an implementation of a transformer. I don't know if there's any questions about this.
speaker 2: You explain a little bit of self attention and movement by attention, but what is the advantage?
speaker 3: Yeah. So self attention tion and mulheaded detention. So the multi headed detention is just this attention scheme, but it's just applied multiple times in parallel. Multiple has just means independent applications of the same attention. So this message passing scheme basically just happens in parallel multiple times with different weights for the query key and value. So you can always look at it like in parallel. I'm looking for I'm seeking different kinds of information from different nodes and I'm collecting it all in the same note. It's all done in parallel. So heads is really just like copy paste and parallel and layers are copy paste. But ts in series, maybe that makes sense. And self attention, when it's self attention, what it's referring to is that the node here produces each node here. So as I described it here, this is really self attention because every one of these nodes produces a key query and value from this individual node. When you have cross attention, you have one cross attention here coming from the encoder. That just means that the queries are still produced from this node, but the keys and the values are produced as a function of nodes that are coming from the encoder. So I have my queries because I'm trying to decode some the fifth word in the sequence, and I'm looking for certain things because I'm the fifth word. And then the keys and the values in terms of the source of information that could answer my queries can come from the previous nodes in the current decoding sequence or from the top of the encoder. So all the nodes that have already seen all of the encoding tokens many, many times can not broadcast what they contain in terms of the information. So I guess to summarize, the self attention is kind of like sorry, cross attention and self attention only differ in where the peace and the values come from. Either the peace and values are produced from this node or they are produced from some external source, like like an encoder and the nodes over there. But algorithally is the same microoperations.
speaker 2: Okay, so the two question the first question is is the message paspardigm.
speaker 3: So Yeah so.
speaker 2: So think .
speaker 3: of so each one of these nodes is a token. I guess they don't have a very good picture of it in the transformer. But like this node here could represent the third word in the output in a decoder. And in the beginning, it is just the embedding of the word. And then, okay, I have to .
speaker 2: think through this knowledge .
speaker 3: a little more. I came up with it this morning. Actually .
speaker 2: got to day yesterday. Notes, as blothese notes, are basically the factors .
speaker 3: I'll go to an implementation, I'll go to the implementation, and then maybe I'll make the connections to the ground. So let me try to first go to, let me now go to, with this intuition in mind, at least to nagpt, which is a concrete implementation of a transformer that is very minimal. So I worked on this over the last few days, and here it is reproducing GPT two on open webtext. So it's a pretty serious implementation that reproduces GPT two, I would say, and provided in the computer. This was one note de of hgps for 38 hours or something like that very quickly. And it's very readable with 300 lives. So everyone can take a look at it. And Yeah, let me basically briefly step through it. So let's try to have a decoder only transformer. So what that means is that it's a language model. It tries to model the next word in sequence or the next character sequence. So the data that we train on is always some kind of text. So here's some fake Shakespeare. So this is real Shakespeare. We're gonna to produce fake Shakespeare. So this is called the tiny Shakespeare data set, which is one of my favorite toy datsets. You take all Shakespeare concatenated and it's one megabte file, and then you can train language models on it and get infinite Shakespeare, if you like. So we have a text. The first thing we need to do is we need to convert it to a sequence of integers. Because transformers natively process, you know, you can't pluck text into transforming meto some out encoded. So the way that encoding is done is we convert. For example, in the simplest case, every character gets an integer. And then instead of hide there, we would have this sequence of integers. So then you can encode every single character as an integer and get like a massive sequence of integers. So you just inatenate it all into one large, long, one dimensional sequence, and then you can train on it. Now here we only have a single document. In some cases, if you have multiple independent documents, what people like to do is create special tokens. And they intersperse those documents with those special end of text tokens that they splice in between to create boundaries. But those boundaries actually don't have any any modeling impact is just that the transformer is supposed to learn the back propagation, that the end of document sequence means that you should wipe the memory. Okay? So then we produce batches. So these batches of data just mean that we go back to one dimensional sequence and we take out chunks of this sequence. So say if the block size is eight, then the block size indicates the maximum length context that your transformer will process. So if our block size is eight, that means that we are going to have up to eight characters of context to predict the ninth character in the sequence. And the batch size indicates how many sequences in parallel we're going to process. And we want this to be as large as possible. So we're fully taking advantage of the GPU and the paralleels on the accords. So in this example, we're doing four x eight batches. So every row here is independent example, sort of, and then every row here is a small chunk of the sequence that we're going to train on. And then we have both the inputs and the targets at every single point here. So to fully spell out what's contained in a single four x eight batch to the transformer, I sort of like compacted here. So when the input is 47 by itself, the target is 58. And when the input is the sequence 4758, the target is one. And when it's 40000 7:50 81, the target is 51, and so on. So actually, the single batch of examples that's score by eight actually has a ton of individual examples that we are expecting the transformer to learn on in parallel. And so you'll see that the batches are learned on completely independently, but the time dimensions sort of appear along horizontally is also trained on imparallel. So sort of your real batch size is more like detensed is just that the context grows linearly for the predictions that you make along the t direction in the model. So this is how this is all the examples of the model will learn from this single fact. So now this is the GPT class. And because this is a decoder only model, so we're not going to have an encoder because there's no like English we're translating from. We're not trying to condition on some other external information. We're just trying to produce a sequence of words that follow each other or likely to so this is all PyTorch, and I'm growing slightly faster because I'm assuming people have taken 231n or something along those lines. But here in the forward PaaS, we take these indices, and then we both encode the identity of the indices just via an embedding lookup table. So every single integer has we index into a lookup table of vectors in this nand dot embedding and pull out the the word vector for that token. And then because the message, because transformed by by itself, doesn't actually it processes sets natively. So we need to also positionally encode these vectors so that we basically have both the information about the token identity and its place in the sequence from one to block size. Now the information about what and where is combined addiittively. So the token embeddings and the positional embeddings are just added exactly as here. So this x here, then there's optional dropout. This x here basically just contains the set of words and their positions. And that feeds into the blocks of transformer that we're going to look into. What's blohere? But for here, for now, this is just a series of blocks in the transformer. And then in the end, there's a layer norm. And then you're decoding the loits for the next or next integer in a sequence using a linear projection of the albuopant of this transformer. So lm head here, a short core language model. Head is just a linear function. So basically, positionally encode all the words, feed them into a sequence of blocks, and then apply your linear layer to get the probability distribution for the next character. And then if we have the targets which we produced in the data order, and you'll notice that the targets are just the inputs offset by one in time, then those targets begin through a cross entropy loss. So this is just a negative law likelihood, typical classification loss. So now let's drill into what's here .
speaker 2: in the blocks.
speaker 3: So these blocks that are applied entially, there's again, as I mentioned, communicate phase and the compute phaso. In the communicate phase, all of the nodes get to talk to each other. And so these nodes are basically, if our block size is eight, then we are going to have eight nodes in this graph. There's eight node in this graph. The first node is pointed to only by itself. The second node is pointed to by the first node and itself. The third node is pointed to by the first two nodes and itself, etcetera. So there's eight nodes here. So you apply, there's a residual pathway in x, you take it out, you apply a layer norm, and then the self attention so that these communicate, these eight nodes communicate. But you have to keep in mind that the batch is four. So because batch is four is also applied. So we have eight nodes communicating, but there's a batch, four of them, all individually communicating on those eight nodes. There's no crisscross across the batch dimension, of course, there's no batch or moanywhere, luckily. And then once they've changed information, they are processed using dertalover Epand. That's the computer base. And then also here we are missing the cross attention and because this is a decoder only model. So all we have is this step here, the multi headed retenand, that's this line, the communicate phase. And then we have the feet forward, which is the mlp, and that's the compute phase. I'll I'll take questions a bit later. Then the mlp here is fairly straightforward. The mlp is just individual processing on each node, just transforming the feature representation, sort of fthat noso applying a two layer neural nut with a galling nonlinearity, which is just think of it as a rellow or simple like that is just a nonlinearity. And then mlp, straightforward. I don't think there's anything too crazy there. And then this is the causal salattention part, the communication phase. So this is kind of like the meof things. And the most complicated part, it's only complicated because of the batching and the implementation detail of how you mask the connectivity in the graph so that don't you can't obtain any information from the future when you're predicting your token, otherwise it gives away the information. So if I'm the fifth token and if I'm the fifth position, then I'm getting the fourth token coming into the input and I'm attending to the third, second and first, and I'm trying to figure out what is the next token. Well, then in this batch, in the next element over in the time dimension, the answer is at the input. So I can't get any information from there. So that's why this is all tricky. But basically in the forward PaaS, we are culculating the queries, keys and values based on x. So these are the key queries and values here. When I'm computing the attention, I have the queries matrix multiplying the keys. So this is the dot product in parallel for all the queries in all keys in older heads. So that I felt to mention that there's also the aspect of the heads which is also done old and parallel here. So we have the bash dimension, the time dimension and the head dimension. And you end up with five dimensional test sorts. And it's all really confusing. So I invite you to step through it later and convince yourself that this is actually doing the right thing. Basically, you have the batch dimension, the head dimension and the time dimension, and then you have features at them. And so this is evaluating for all the batch elements, for all the head elements and all the time elements. The simple Python that I gave you earlier, which is query dot prop. Then here we do a mask fill. And what this is doing is it's basically clamping the the attention between the nodes that are not supposed to communicate to be negative infinity. And we're doing negative infinity because we're about to softmax. And so negative infinity will make basically the attention that those elements be zero. And so here we are going to basically end up with the weights, the sort affinities between these notes, optional dropout. And then here, attention matrix multiply v is basically the gathering of the information according to the affinities we've calculated. And this is just a weighted sum of the values at all those nodes. So this matrix multiplies as new, that weighted sum. And then transpose contiguous view because it's all complicated and bashed in five dimensiononal tensors, but it's really not doing anything optional dropout. And then aiming your projection back to the residual pathway. So this is implementing the communication phase here. Then you can train this transformer, and then you can generate infinite Shakespeare. And you will simply do this by because our block size is eight, we start with sum token, say, like I use in this case, you can use something like a new line as the start toand. Then you communicate only to yourself because there's a single node, and you get the prompted distribution for the first word in the sequence, and then you would decode it, or the first character in the sequence. You decode the character and then you bring back the character and you re encode de it as an integer. And now you have the second thing. And so you get, okay, where the first position, and this is whatever integer it is, add the positional encodings, goes into the sequence quitransformer. And again, this token now communicates with the first open, and then it's item d. And so you just keep like plucking the doc. And once you run out of the block size, which is eight, you start to crop because you can never have block size more than eight in the way you've trained this transformer. So we have more and more context until eight. And then if you want to generate beyond eight, you have to start cropping because the transformer only works for eight elements in time dimension. And so all of these transformers in the naive setting have a finite block size or context length. And in typical models, this will be 1024 tokens, or 20, 48 tokens, something like that. But these tokens are usually like dpe tokens or sentence piece tokens or workpiece tokens. There's many different encodings. So it's not like that long. And so that's why I think did mention we really want to expand the context size and gets gnarly because the attention is spordratic in the many cases. Now if you want to implement an encoder instead of a decoder attention, then all you have to do is this master, you just delete that line. So if you don't mask the attention, then all the nodes communicate to each other and everything is allowed and information flows between all the nodes. So if you want to have the encoder here, just delete all the encoder blocks. We'll use attention where this line is deleted. So you're allowing whatever is encoding my store, say ten tokens, like ten notes, and they are all allowed to communicate to each other. Going up the transformer. And then if you want to implement cross attention, so you have a full encoder decoder transformer, not just a decoder, only transformer or a GPT, then we need to also add a cross attention in the middle. So here there's a self attention piece where all the there's a self attention piece, a cross attention piece, and this mlp. And in the cross attention, we need to take the features from the top of the encoder. We need to add one more line here. And this would be the cross attention of I should have implemented it instead of just pointing, I think, but there will be a cross attention line here. So we'll have three lines because we need to add another block and the queries will come from x, but the piece and the values will come from the top of the encoder, and there will be basically information flowing from the encoder strictly to all the nodes inside x, and then that's it. So it's a very simple sort of modifications on the decoder attention. So you'll hear people talk that you kind of have a decoder only model like GPT. You can have an encoder only model like buror. You can have an encoder decoder model like say, t five doing things like machine translation. And in Burt, you can't train it using sort of this language model deling setup that's utter aggressive. And you're just trying to predict the next omement sequence. You're training it to a slightly different objectives. You're putting in like the full sentence, and the full sentence is allowed to communicate puand. Then you're trying to classify a sentiment or something like that. So you're not trying to model like the next token in the sequence. So these are translatghtly different with mask with using masking and other denoise entechnique. Okay. So that's kind of like the transformer. I'm going to continue. So Yeah, maybe more questions.
speaker 2: So what really important information connected? Or like it's a dynamic route that change in everything. And we also have enough we're seeing strengon it by just masking. But.
speaker 3: So I'm not sure if I fully follow. So there's different ways to look at this analogy. But one analogy is you can interpret this graph as really fixed, is just that every time you do the communicate, we are using different weights. You can look at it delso, if we have block size of eight, in my example, we would have eight nodes. Here we have two, four, six. Okay? So we have eight nodes. They would be connected in in, you lay them out and you only connect from left to right.
speaker 2: I mean, but for different following them, that might not be to have a graph ft verconnected point, why would usually .
speaker 3: the connections don't change as a function of the .
speaker 2: data or something like good, like in that poll throughout the good life that. I don't think I've seen .
speaker 3: a single example where the connectivity changes dynamically in option of data. Usually the connectivity is fixed. If you have an encoder and you're sharing ding a biryou have how many tokens you want and they are fully connected. And if you have decoder or on, you have this strthe lothing. And if you have encoder decoder, then you have awkwardly sort of like two tools of nodes. Yeah.
speaker 2: I'll absolutely very. Question, I wonder you know much more about this than I know what like almost sense like. You also have different things like very clear norm and it in Yeah.
speaker 3: it's really hard to say. So that's why I think this paper is so interesting is like, Yeah usually yousee like a path and maybe they hit path internally. The digital didn't publish it and all you can see is sort of things that didn't look like the transformer. I mean, you have resonates which have mosiveness, but a resonnet would be kind of like this, but there's no there's no cell potential sion component. The mlp is there kind of in a renet. So a resonance looks very much like this, except there's no you can use layer norms and resonance, I believe as well. Typically sometimes they can be bash norms. So it is kind of like a resonant. It is kind of like they took a resonant and they put in a cell potentially block in addition to the preexistent mlt block, which is kind of like convolution. And mlp was, strictly speaking, the convolution one by one convolution. But I think the idea similar learned that mlp is just kind of like know typical weights, nonminearity weights or operation. And but I will say like Yeah it's kind of interesting because a lot of work is is not is not there. And then they give you this transformer, and then it turns out five years later it's not changed even though everyone's trying to change it. So it's kind of interesting to me that is kind of like a package thelike a package which I think is really interesting historically. And I also talked to paper authors and they were unaware of the impact that transformed would have at the time. So when you read this paper, actually it's kind of unfortunate because this is like the paper that changed everything. But when people read it, it's like question Marks because it reads like a pretty random machine translation paper, like all over doing machine translation. Oh, here's a cool architecture. Okay, great, good results. Like it's it doesn't sort of know what's gonna to happen. And so when people read it today, I think that it kind of confused potentially having like having I will have some tweets at the end, but I think I would have renamed it with the benefit of hindsight of like, well, I'll get to it. Yeah, I think that's a good question as well. Currently. I mean, I certainly don't love the autoregressive modeling approach. I think it's kind of weird to like sample a token and then commit to it. So you know maybe there's some ways to maybe there's some ways some hybrids with diffusion as an example, which I think would be really cool, or we'll find some other ways to like edit the sequences later. But still in the our regressive framework. But I think diffusion is kind of like an up and coming modeling approach that I personally find much more appealing when I sample text. I don't go chunk, chunk, chunk and commit. I do a draft one and then I do a better draft two and that feels like a diffusion process. So that would .
speaker 2: be my hope.
speaker 1: Okay. Also question. So Yeah like the logic, the kids visuwill you say like self attention is sort of like computing like a agincreaslike. The dot product on the novariand then basically have the edge speed selmalied by the values and then we just appropriate it. Yes. Yes, right. And you think there's like a like analogy ble, like graph able networks and health attention.
speaker 3: I find graph neural networks kind of what a confused term because I mean, Yeah, previously there was this notion of I kind of thought maybe today everything is a graphienural network because the transformer is a graph enural network processor. The native representation that the transformer operates over is sets that are connected by edges in a direcway. And so that's a native representation.
speaker 2: And then Yeah okay.
speaker 3: I should go on because I still have like 30 slides .
speaker 2: first.
speaker 3: Oh Yeah Yeah the root de. I think basically like as if you're initializing with random weits separate from aggulsion, as your dimension size grows, so does your values. The variance grows and then your soft tmax will just become the one half vector. So it's just a way to control the variance and bring us to always be in a good range for softmax and nice distribution. Okay? So it's .
speaker 2: almost like an .
speaker 3: initialization thing. Okay? So transformers .
speaker 2: have been applied .
speaker 3: to all the other fields. And the way this was done is, in my opinion, kind of ridiculous ways, honestly, because I was a computer vision person and you have coments and they kind of make sense. So what we're doing now with bits, as an example, is you take an image and you chop it up into little squares, and then those squares literally feed into a transformer, and that's it, which is kind of ridiculous. And so I mean, Yeah. And so the transformer doesn't even in the simplest case, like really know where these patches might come from. They are usually positionally encoded, but it has to sort of like we discover a lot of the structure. I think of them in some ways, and it's kind of weird to approach it that way, but it's just like the simplest baseline of just trumping up big images into small squares and feeding them in as like the individual notes actually works fairly well. And then this is in the transformer encoder. So all the patches are talking to each other throughout the entire transformer. And the number of nodes .
speaker 2: here would be sort of like nine.
speaker 3: Also in speech recognition, you just take your metal spectrogram and you chop it up into your slices and put them into a transformer. So the rosppaper like this, but also whisper, whisper as a copy based transformer. If you saw whisper from OpenAI, you just chop up mespectrogram and feed it into a transformer and then pretend you're dealing with text. Then it works very well. Decision transformer an rl, you take your states actions and reward that you experience an environment, and you just pretend it's a language, and you start to model the sequences of that, and then you can use that for planning later. That works pretty well. You know, even things like alcoholes. So we were briefly talking about molecules and how you can plug them in. So at the heart of alcohold, computationally, is also a transformer. One thing I wanted to also say about transformers is I find that very they're super flexible, and I really enjoy that. I'll give you an example from Tesla. Like you have a comat that takes an image and makes predictions about the image. And then the big question is, how do you feed in extra information? And it's not always trivial. Like say I have additional information that I want to inform that I want the topto be informed by. Maybe I have other sensors, like radar, maybe I have some map information or vehicle type, or some audio. And the question is, how do you feed information into a combat? Like where do you feed it in? Do you concatinate it? Like how do you add it? At what stage? And so with the transformer, it's much easier because you just take whatever you want, you chop it up into pieces, and you feed it in with a set of what you had before, and you let the self potential figure out how should communicate. And that's actually apparently works. So just chop up everything and throw it into the mix is kind of like the way, and it freees neor less from from this from this version den of euclilidean space, where previously you had to arrange your computation to conform to the Euclidean space of three dimensions of how you're laying out the compute. Like the compute actually kind of happens in normal like 3D space, if you think about it. But in attention, everything is just sets. So it's a very flexible framework and you can just like throw this stuff into your conditioning set and everything just self attended over. So it's quite beautiful to have expected. So now what exactly makes transformer ers so effective? I think a good example of this comes from the GPT -3 paper, which I encourage people to read. Linmodels are two short learners. I would have probably remanamed this a little bit. I would have said something like transformers are capable of in context learning or like meta learning. That's kind of like what makes them really special. So basically the something that they're working with is, okay, I have some context and I'm trying to like say passage. This is just one example of many. I have a passage and I'm asking questions about it. And then I'm giving as part of the context in the prompt, I'm giving the questions and the answers. So I'm giving one example of question answer, another example of question answer, another example of question answer and so on. And this becomes, Oh Yeah, people are going to sing that. Okay? Is this really important to think? Okay. So what's really interesting is basically like with more examples given in the context, the accuracy improves. And so what that hsaid is that the transformer is able to somehow learn in the activations without doing any gradient descent in a typical fine tuning passion. So if you fine tune, you have to give an example and the answer and you fine tuning using gradient descent. But it looks like the transformer internally in its weights is doing something that looks like potentially gradient descent, some kind of a metal learning in the weights of the transformer as it is reading the prompt. And so in this paper, they go into, okay, distinguishing this outter loop with stochastic graating descent and this inner loop of the intercontext learning. So the inner loop is as the transformers sort of what reading the sequalmost. And the outer loop is the training by gradient descent. So basically, there's some training happening in the activations of the transformer as it is consuming a sequence that maybe very much looks like gradient descent. And so there's some recent papers that kind of hint at this and study it. So as an example, in this paper here, they propose something called the raw operator, and they argue that the raw operator is implemented by a transformer. And then they show that you can implement things like ridge regression on top of a raw operator. And so this is kind of giving there are papers hinting that maybe there is something that looks like radiant based learning inside the activations of the transformer. And I think this is not impossible to think through because what is gradient based learning? Overpass, backward PaaS and an update. Well, that looks like a resume, right? Because you're just changing. You're adding to the weights. So you start ve initial random set of weights, forward PaaS, backward PaaS and update your weights. And then forward PaaS, backward PaaS up weights. Looks like a resonnet transformers a resonate. So much more hand wavy. But basically some papers trying to hthat why that would be potentially possible. And then I have a bunch of tweets I just copy pasted here. In the end, this was kind of like meant for general consumption. So they're a bit more high level and high P A little bit. But I'm talking about why this architecture is so interesting and why potentially became so popular. And I think it simultaneously optimizes three properties and I think have very desirable. Number one, the transformers, very expressive in the overpass. It's sort of like it's able to implement very interesting functions, potentially functions that can even like do meta learning. Number two, it is very optimizable thanks to things like residual connections, layer norms and so on. And number three, it's extremely efficient. This is not always appreciated, but the transformer, if you look at the computational graph, is a shallow, wide network, which is perfect to take advantage of the perils in GPU's. So I think the transformer was designed very deliberately to run efficiently on GPU's. There's previous work like neural gpuu that I really enjoy as well, which is really just like how do we design neural nets that are efficient on GPU's? And thinking backwards from the constraints of the hardware, which I think is a very interesting way.
speaker 2: think about .
speaker 3: it. Oh Yeah. So here I'm saying I would have called, I probably would have called the transformer a general purpose efficient optimizable computer instead of the attention is all you need. Like that's what I would have maybe in hindsight called that paper is proposing is a model that is very general purpose. So forward passes expressive is very efficient in terms of ppu usage and is easily optimizable by graant descent and trains very nicely. Again, I have some other hot tweets .
speaker 2: here anyway.
speaker 3: so I Yeah you can read them later, but I think this was maybe interesting. So if preneural nets are special purpose computers designed for a specific task, GPT is a general purpose computer reconfigurable at runtime to run natural language programs. So the the programs are given as prompts and then GPT runs the program by completing the document. So I really like these analogies personally to computer. It's just like a powerful computer and it's optimizable by radiated cent and. I don't know. Okay, then we can read this later, but it's right now. Just thank you. I'll just leave this up maybe. So sorry, I just found this leaso. It turns out that if you scale up the training set and use a powerful enough neural nets like a transformer, the network becomes a kind of general purpose computer over text. So I think that's a kind of like nice way to look at it. And instead of performing a single fix sequence, you can design the sequence in the prompt. And because the transformer is both powerful but also was trained on large enough very hard data set, it kind of becomes this general purpose text computer. And so I think that's kind of interesting when you'll at it.
speaker 2: This high through form of what it's been. And what you do is you figureally know. Yeah I guess questions. So that you know, I think it really. Doctor bias.
speaker 3: So I think there's a bit of that. Yeah. So I would say rn's like in principle, yes, they can implement arbitrary programs. So I think it's kind of like a useless statement to some extent because they are they're probably I'm not sure that they're probably expressive because in a sense of like power and that they can implement these arbitrary functions, but they're not optimizable and they're certainly not efficient because they are serial computing devices. So I think so if you look at it as a compute graph, rns are very long, thin compute graph. Like if you stretched out the neurons and you look like take all the individual neurons in connectivity and stretch them out and try to visualize them, rn's would be like a very long graph in a bad. And it's bad also optimizability, because I don't exactly know why, but just the rough intuition is when you're backpropagating, you don't want to make too many steps. And so transformers are a shallow White graand. So from supervision to inputs is a very small number of pops, and it's along residual pathways, which make gradients flow very easily. And there's all these layer norms to control gradithe scales of all of those activvaand. So there's not too many hops. And you're going from supervision to inputs very quickly and just flows through the graph. And it can all be done in parallel. So you don't need to do this endecoder rness. You have to go from first word, then cycle word, then third word. But here in transformer, every single word was processed completely, a sort of in parallel, which is kind of. So I think all these are really important because all these are really important. And I think number three is less talked about, but extremely important, because in deep learning, scale matters. And so the size of the network that we can train it gives you is extremely important. And so if it's efficient on the current hardware, then we can make it bigger.
speaker 2: We mentioned that you're dealing with like multiple modology data you repeit all together. How does that actually work if you like kind of leave the different data as different token or it's doesn't?
speaker 3: No. So Yeah. So you take your image and you apparently chop them up into patches. So there's the first thousand tokens or whatever. And now I have a special so radar could be also, but I don't actually know the make the representation of radar. So but you could, you just need to chop it up and enter it, and then you have to encode it somehow. Like the transformer needs to know that they're coming from radar. So you create a special, you have some kind of a special token that you, these radar tokens are relaslightly different in the representation, and it's learnable by grading cent. And like, vehicle information would also come in with a special embedding token that can be learned .
speaker 2: you know like orally because .
speaker 3: you don't it's all just a set and there's the placement guarto no pitch. And Yeah, it's all just a set, but you can posiencode these sets if you want. So positional encoding means you can hardwire, for example, the coordinates like using sinues and posines. You can hardwire that, but it's better if you don't hardwire the position. You it's just a vector that is always hanging out the dislocation. Whatever content is there just adds on it. And this vector strawill by background. That's how you do it.
speaker 2: Kind of deliegate like they seem to work, but it seems like you're sometimes hard and reasonable. We want to put some structure of the belief in the presentation tation something .
speaker 3: better. Let's sure if I understand question. So I mean, the positional encoder is like they're they're actually like not they have okay. So they have very little inducted bias or something under but they're just vectors hanging out the location always and you're trying to you're trying to help them network in some way. And I think the intuition is good, but if you have enough data, usually trying to mess with it is like a bad thing, while trying to trying to enter knowledge when you have enough knowledge in the data set itself is not usually productive. So it really depends on what scale you are. If you have impity data, then you actually want to encode less and less. That turns out to work better. And if you have very little data, then actually you do want to encode some biases. And maybe if you have a much smaller data set and maybe convolutions are a good idea because you actually have this bias coming from your filters. And but I think so the transformer is extremely general, but there are ways to mess with the encodings to put in more structure. Like you could, for example, encode sinuses and cosines and fix it. Or you could actually go through the attention mechanism and say, okay, my if my image is chopped up into patches, this patch can only communicate to this neighborhood and you can you just do that in the attention matrix, just mask out whatever you don't want to communicate. And so people really play with this because the full attention is inefficient. So they will intersperse, for example, layers that only communicate in little patches and then layers that communicate globally. And they will sort of do all kinds of tricks like that. So you can slowly bring in more inducted bias. You would do it, but the inducted biases are sort of like they're factored out from the core transformer and they are factored out in in the connectivity of the notes and they are factored out in the posiintobase and can .
speaker 2: mess with .
speaker 3: this per competition. So there's probably about 200 papers on this now, if not more. They're kind of hard to keep track of honestly, like my party I browser, which is what's on my computer, like 200 open tabs. But. Yes, I'm not even sure if if I want to want to pick my favorite honestly.
speaker 1: Yeah, I it was just very interesting stuff from the history and even people call transformer eseason. I think process statakes sliinstructions $4000 and make now not many games to see if you body paslike, you store variables, you have memory. Just like if I want to do a different program, see you are just like do it a different time. So maybe I can use a transformer like that.
speaker 3: get there. The other one that I actually like even more is potentially keep the context length fixed, but allowed the network to somehow use a scratch paokay. And so the way this works is you will teach the transformer ers somehow via examples in the prompt that, Hey, you actually have a scratch. Hey, Hey, trbasically, you can't remember too much. Your context line is finite, but you can use a scratch pad. And you do that by emitting a start scratch pad and then writing whatever you want to remember and then end scratch pad, and then you continue with whatever you want. And then later, when it's decoding, you actually have special object that when you detect start scratch pad, you will sort of like save whatever it puts in there in like external thing and allow it to attend over it. So basically, you can teach the transformer just dynamically, because it's so metallearned. You can teach it dynamically to use other gizmos and gadgets and allow it to expend its memory in that way, if that makes sense. It's just like human learning to use a notepad. You don't have to keep it in your brain. So keeping things in your brain is kind of like a context mind of the transformer. But maybe you can just give it a notebook and then conquery the notebook and read from it and write to it.
speaker 2: I'm you that we just get the way like they're clearly going to the some sort of memories. Looks like the idea. I don't know if I detected that.
speaker 3: I kind of feel like, did you feel like it was more than just a long prompt .
speaker 2: that's unfolding? I didn't try extensively.
speaker 3: but I did see a forgetting event and I kind of felt like the block size was just moved. Maybe I'm wrong. I don't actually know about the interof chagot two online. So 11 question is, what do you think about architecture? Like I which one is this for? And second question, this was a personal question. What are you going to work on next? I mean, so right now I'm working on things like nanogpt versus as nanogpt. I mean, I'm going .
speaker 2: basically slightly from .
speaker 3: computer vision and like kind of like the computer vision based products do a little bit in the language domain, wherechagpt GPT, so originally I had mgpt, which I wrote nanogpt. And I'm working on this. I'm trying to reproduce GPT. And I mean, I think something like ChatGPT, I think incrementally improved in the product fashion would be extremely interesting. And I think a lot of people feel it, and that's why it went so wide. So I think there's something like a Google Plus, plus plus to build that. I think they were interesting. So we generaround thought.

概览/核心摘要 (Executive Summary)

该转录文本主要包含两部分：斯坦福CS25课程（Transformers United V2，2023年冬季）的介绍，以及Andrej Karpathy关于Transformer模型的特邀讲座。课程旨在深入探讨Transformer的工作原理、在不同领域的应用及前沿研究。Karpathy的讲座回顾了Transformer从2017年《Attention Is All You Need》论文诞生以来的发展历程，强调其如何从最初的自然语言处理（NLP）领域扩展到计算机视觉（CV）、强化学习（RL）、生物学（如AlphaFold）等多个AI分支，并最终趋向于一种可被广泛复制的统一架构。

Karpathy详细阐述了Transformer出现之前的技术背景（如RNN、LSTM及其局限性），并追溯了注意力机制的起源，引用了Dimitri Bahdanau关于其早期“软搜索”想法的邮件。他将Transformer的核心机制——多头注意力（Multi-Head Attention）解读为一种在有向图上进行的数据依赖消息传递（通信阶段），并与多层感知机（MLP）的计算阶段交织进行，其中提及了为稳定训练而对注意力分数进行缩放的关键细节。通过其NanoGPT项目，Karpathy具体展示了Decoder-only Transformer（如GPT）的实现细节，包括词元化、位置编码、Transformer模块（自注意力、MLP、残差连接、层归一化）以及因果自注意力机制中的掩码操作。他还对比了Decoder-only、Encoder-only（如BERT）和Encoder-Decoder（如T5）架构。

Karpathy认为Transformer的成功关键在于其表达能力强（能实现上下文学习/元学习）、易于优化（归功于残差连接、层归一化等）和高效性（浅而宽的结构适合GPU并行计算）。他将Transformer比作“通用目的、可优化、高效的计算机”，能够通过提示（prompt）在运行时重构以执行自然语言程序。未来的挑战包括处理更长序列、增强模型记忆能力（如引入“草稿纸”机制）、提高可控性以及进一步与人脑工作方式对齐。

课程介绍：CS25 Transformers United V2

课程名称与背景：CS25 Transformers United V2，斯坦福大学2023年冬季开设的课程。
核心主题：深入学习Transformer这种深度学习模型，而非字面意义上的变形机器人。
- Transformer已彻底改变自然语言处理（NLP），并广泛应用于计算机视觉（CV）、强化学习（RL）、生成对抗网络（GANs）、语音处理乃至生物学（如AlphaFold2解决蛋白质折叠问题）。
课程目标：
1. 理解Transformer的工作原理。
2. 探讨Transformer在不同领域的应用方式。
3. 了解Transformer研究的前沿动态。
本讲座性质：纯粹的入门介绍，讲解Transformer的基础构建模块，特别是自注意力机制。后续课程将深入探讨BERT、GPT等模型。
讲师团队：
- Speaker 1 (课程介绍人)：目前从某PSP项目[不确定具体含义，原文literprfrom, the psp program]中暂时脱离，在一家协作机器人初创公司领导AI工作，专注于通用机器人。研究兴趣包括机器人学、强化学习、[不确定：personal learning and provisions in remodeling]。
- Stephen (Speaker 3)：斯坦福大学一年级计算机科学博士生，硕士毕业于CMU。主要研究NLP，近期也涉足CV和[不确定：wonton moand，可能指多模态]。业余爱好广泛，包括钢琴（正与朋友筹建斯坦福钢琴俱乐部）、武术、健身、电视剧、动漫和游戏。
- Ryan (由Stephen提及)：对教授该课程充满热情，认为上次课程邀请的演讲嘉宾非常出色。
趣味信息：课程介绍人Speaker 1提及自己去年是“最敢于发言的学生(most outspoken student)”。

Andrej Karpathy讲座：Transformer深入解读

Transformer发展时间线与早期模型局限

注意力机制时间线：
- 2017年以前 (历史时期)：主流模型如循环神经网络 (RNN) 和长短期记忆网络 (LSTM)。存在一些注意力机制的早期探索，但未能有效扩展。
- 2017年：Vaswani等人的论文 “Attention Is All You Need” 发表，标志着Transformer的诞生。
- 2018-2020年：Transformer在NLP领域爆炸式增长，并迅速扩展到其他领域，如CV、生物学（AlphaFold）。谷歌曾表示“每次我们雇佣语言学家，模型性能都会提升”[此现象主要见于2018年早期或之前，原文"for the first 2018"]。
- 2021年 (生成模型/替代模型时代开启)：出现大量基于Transformer的生成模型，如Codex、GPT、DALL-E、Stable Diffusion。
- 2022年至今：模型规模持续扩大，出现了ChatGPT、Whisper等应用，发展势头不减。
Transformer出现前的序列模型 (RNN, LSTM) 的局限：
- 优点：擅长编码历史信息。
- 缺点：
  - 难以处理长序列。
  - 上下文编码能力较弱。例如，在句子 “I grew up in France, ..., I speak fluent ____.” 中，模型需要理解远距离的上下文 “France” 才能准确预测 “French”。注意力机制对此非常有效，而LSTM则表现不佳。
  - Transformer通过注意力图谱能更好地进行基于内容的上下文预测，例如判断代词“it”指代哪个名词。

Transformer的演进与能力提升

2021年 (即将腾飞)：
- 解决了许多长序列问题，如蛋白质折叠 (AlphaFold)、离线强化学习。
- 展现出真正的零样本泛化能力。
- 实现了多模态任务和应用，如DALL-E根据文本生成图像。
2022年 (已在腾飞)：
- 在音频生成、艺术创作、音乐生成等领域出现独特应用。
- 开始展现推理能力，包括常识推理、逻辑推理和数学推理。
- 通过强化学习和人类反馈 (RLHF) 实现与人类对齐和交互（如ChatGPT的训练方式）。
- 发展出控制毒性、偏见和保障伦理的机制。
- 扩散模型 (Diffusion Models) 取得重要进展。

未来方向与尚待解决的关键问题

令人期待的应用领域：
- 视频理解与生成。
- 金融与商业应用，例如用GPT创作小说（提及“gbauthor novel”[不确定具体含义，可能指GPT for authoring novels]）。
- 通用智能体 (Generalized Agents)：能够执行多任务、处理多模态输入（提及“garthrough”[不确定具体模型名称，可能是Gato]）。
- 领域专用模型：如医疗GPT、法律GPT等，与通用大模型形成互补，可能出现“专家混合 (Mixture of Experts)”的AI模型生态。
亟待解决的关键问题 (Missing Ingredients)：
1. 记忆 (Memory)：当前模型（如ChatGPT）交互是短暂的，缺乏长期记忆和存储对话历史的能力。
2. 计算复杂度：注意力机制的计算复杂度与序列长度成二次方关系 (O(N²))，需要降低。
3. 可控性 (Controllability)：许多模型输出具有随机性，需要增强对其输出内容和风格的控制。
4. 与人脑对齐：尽管已有研究，但仍需更多探索如何使模型工作方式更接近人脑。

Karpathy对Transformer的历史、机制与潜力的深入剖析

1. Transformer出现的历史背景

2012年以前的AI领域 (以CV为例)：
- 研究者通常为特定任务设计复杂的特征提取流程，如SIFT、HOG、颜色直方图等，然后将这些特征输入SVM等分类器。Karpathy形容其为“各种特征的厨房水槽 (kitchen sink of different kinds of features)”。
- 这些方法不仅复杂，而且效果不佳，预测错误时常发生。
- 不同AI子领域（如NLP与CV）使用完全不同的术语和方法论，跨领域阅读论文非常困难。NLP领域充斥着词性标注、形态分析、句法分析等术语。
2012年 (AlexNet的突破)：Krizhevsky等人证明，在大型数据集上扩展大型神经网络可以获得非常好的性能。这使得研究焦点转向计算资源和数据规模。
- 此后，神经网络开始在AI的各个领域普及（CV, NLP, 语音, 翻译, RL等）。
- 不同领域的论文开始使用相似的术语（神经网络、参数、优化器），降低了跨领域学习的门槛。
2017年 (Transformer的诞生)：
- 不仅仅是工具包和神经网络相似，模型架构本身也趋于统一。Transformer架构被广泛复制到几乎所有AI任务中，主要区别在于数据预处理和输入方式。
- Karpathy认为这种趋同性非常显著，并推测这可能是AI正在趋近某种类似大脑皮层结构（具有高度同质性）的、统一且强大的学习算法的迹象。

2. 注意力机制的起源

早期语言模型 (2003年)：Yoshua Bengio等人的工作，使用多层感知机 (MLP) 根据前三个词预测第四个词的概率分布，是神经网络在语言建模中的早期成功应用。
序列到序列模型 (Seq2Seq, 2014年)：用于机器翻译等任务，解决了变长输入输出问题。采用Encoder-Decoder架构（通常是LSTM）。
- 主要问题：编码器瓶颈 (Encoder Bottleneck)，即整个输入句子的信息被压缩到一个固定长度的向量中，信息损失严重。
Bahdanau等人的注意力机制 (2014/2015年)：论文《Neural Machine Translation by Jointly Learning to Align and Translate》。
- 核心思想：允许模型在预测目标词时，自动“软搜索 (soft search)”源句子中相关的部分，而不是依赖单一的固定长度向量。
- 实现方式：Decoder在生成每个词时，可以“回顾”Encoder的所有隐藏状态，通过一个“软注意力”机制计算一个上下文向量，该向量是Encoder隐藏状态的加权和。权重基于Decoder当前状态与Encoder各隐藏状态的兼容性（通过Softmax归一化）。
- Karpathy与Dimitri Bahdanau的邮件交流揭示的“注意力”历史：
  - Bahdanau的灵感来源于中学时学习英语做翻译练习的经验——视线在源语言和目标语言序列间来回移动。
  - 他将“软搜索”实现为Softmax和对Encoder隐藏状态的加权平均，并且“第一次尝试就成功了”。
  - 最初机制的名称是“RNN Search”，Bahdanau认为这个名字“有点平庸 (blame [应为plain/bland，转录错误])”。“Attention”这个更佳的名称是在论文最终审阅阶段由Yoshua Bengio提出的。

3. 《Attention Is All You Need》(2017) 论文的独特性

核心变革：彻底移除了RNN结构，仅保留并依赖注意力机制。
Karpathy的评价：这是一篇里程碑式的论文，非常了不起。它并非渐进式改进，而是同时融合了多种创新思想，并在架构空间中找到了一个非常好的局部最优解。
关键组成部分：
1. 位置编码 (Positional Encoding)：由于注意力机制本身处理的是集合（无序），需要引入位置信息。
2. 残差连接 (Residual Network Structure)：借鉴了ResNet。
3. 注意力与MLP交错：将注意层与多层感知机层穿插。
4. 层归一化 (Layer Norms)。
5. 多头注意力 (Multi-Head Attention)：并行应用多个注意力“头”。
6. 优秀的超参数设置：例如MLP层的扩展因子通常为4倍，至今仍被沿用。
架构的韧性：尽管历经多年研究，当今的GPT等模型在核心架构上与2017年的Transformer非常相似。
- 主要变化：层归一化被调整到注意力/MLP层之前（Pre-Norm）；位置编码方面有一些创新（如旋转位置编码RoPE、相对位置编码）。

4. Karpathy对注意力机制的解读：通信与计算

Transformer的每个模块交替执行两个阶段：
1. 通信阶段 (Communication Phase)：通过多头注意力 (Multi-Head Attention) 实现。这是一种在有向图节点间进行的数据依赖的消息传递。
2. 计算阶段 (Computation Phase)：通过多层感知机 (MLP) 实现，对每个节点独立进行特征变换。
通信阶段的简化描述 (以单个注意力头为例)：
- 每个节点 (token) 存储一个私有数据向量。
- 每个节点通过线性变换生成三个向量：
  - Query (Q)：我正在寻找什么信息？
  - Key (K)：我拥有什么信息/特征？
  - Value (V)：如果我的信息被选中，我将传递什么信息？
- 对于图中的某个目标节点：
  1. 该节点产生其Query (Q)。
  2. 所有指向该节点的源节点广播它们的Key (K)。
  3. 通过Q与各K的点积计算“分数 (scores)”，表示源节点信息对目标节点Query的“ интересность (interestingness，转录错误，应为相关性)”或“亲和度 (affinity)”。
  4. 对分数进行Softmax归一化，得到权重（在原论文《Attention Is All You Need》中，为稳定训练并控制点积结果的方差，使其在Softmax函数中处于较好范围，点积结果会先除以一个与键向量维度相关的缩放因子，即 (\sqrt{d_k})）。
  5. 用这些权重对各源节点的Value (V) 进行加权求和，结果更新到目标节点。
- 这个过程在实际中是高度向量化和批量化处理的。

5. 自注意力、多头注意力与交叉注意力

多头注意力 (Multi-Head Attention)：并行运行多个独立的上述注意力机制（通信过程），每个“头”使用不同的Q, K, V线性变换权重。这允许模型同时从不同方面、寻找不同类型的信息。
自注意力 (Self-Attention)：指Q, K, V均来自同一组节点（例如，Encoder内部的token之间，或Decoder内部当前token与已生成token之间）。
交叉注意力 (Cross-Attention)：指Q来自一组节点（如Decoder），而K和V来自另一组节点（如Encoder的输出）。

6. NanoGPT：一个Transformer的极简实现案例

Karpathy介绍了他的NanoGPT项目，一个用于语言建模的Decoder-only Transformer的简洁实现（约300行代码），能够复现GPT-2在OpenWebText上的性能。

数据处理：
- 文本数据（如“莎士比亚小数据集”，一个1MB的文本文件）。
- 词元化 (Tokenization)：将文本转换为整数序列（例如，字符级编码或更高级的BPE编码）。可能使用特殊标记（如<|endoftext|>）分隔不同文档。
- 批处理 (Batching)：
  - block_size：Transformer能处理的最大上下文长度。
  - batch_size：并行处理的序列数量。
  - 训练时，输入是序列 x_1, ..., x_t，目标是预测 x_2, ..., x_{t+1}。一个 batch_size * block_size 的批次包含了大量并行的训练样本。
GPT类 (PyTorch实现)：
- forward 方法：
  1. 输入整数序列 (indices)。
  2. 词元嵌入 (Token Embeddings)：通过查找表 (nn.Embedding) 将整数ID转换为向量。
  3. 位置嵌入 (Positional Embeddings)：为序列中的每个位置生成一个向量，与词元嵌入相加，得到包含内容和位置信息的x。
  4. x 依次通过多个Transformer模块 (Blocks)。
  5. 最后经过一个层归一化 (LayerNorm) 和一个线性层 (LM Head) 输出预测下一个词元的logits。
  6. 使用交叉熵损失函数 (CrossEntropyLoss) 计算损失。
Transformer模块 (Block)：
- 包含残差连接。
- 第一部分 (通信)：层归一化 -> 因果自注意力 (Causal Self-Attention)。
  - 节点 (tokens) 之间进行信息交换。对于Decoder，注意力被掩码 (masked)，确保当前位置的预测只能依赖于之前位置的token，不能看到未来的信息。
- 第二部分 (计算)：层归一化 -> MLP (多层感知机)。
  - 对每个节点的特征表示进行独立处理（通常是两层神经网络，激活函数如GELU）。
因果自注意力机制详解：
1. 从输入 x 计算Q, K, V矩阵。
2. 计算注意力分数：Q @ K.transpose()。
3. 掩码 (Masking)：将未来位置的注意力分数设置为负无穷大，这样经过Softmax后其权重接近于0。
4. Softmax归一化得到注意力权重。
5. 加权聚合V：weights @ V。
6. 结果通过一个线性投影层，加回到残差路径。
文本生成 (Generation)：
- 从一个起始词元开始（如换行符）。
- 模型预测下一个词元，将其添加到当前序列末尾。
- 重复此过程。当序列长度超过block_size时，需要截断最前面的词元以保持上下文长度。
不同Transformer架构的实现差异：
- Encoder-only (如BERT)：移除自注意力中的因果掩码，允许所有token相互通信。通常用于NLU任务，训练目标不同（如Masked Language Modeling）。
- Encoder-Decoder (如T5)：Decoder模块中除了自注意力，还会增加一个交叉注意力 (Cross-Attention)层，其中Q来自Decoder自身状态，K和V来自Encoder的最终输出。

7. Transformer的应用扩展与灵活性

跨领域应用方式：
- 计算机视觉 (ViT)：将图像分割成小块 (patches)，每个patch视为一个token输入Transformer。
- 语音识别 (Whisper)：将梅尔频谱图 (mel spectrogram) 切分成片段，作为token输入Transformer。
- 强化学习 (Decision Transformer)：将（状态、动作、奖励）序列视为一种“语言”进行建模。
- AlphaFold：其核心计算部分也使用了Transformer。
Transformer的灵活性：
- Karpathy以特斯拉自动驾驶为例，指出向Transformer模型中添加额外信息（如雷达、地图数据、车辆类型、音频）非常方便：只需将这些信息也“词元化”并加入到输入集合中，让自注意力机制自行学习如何整合它们。
- 这种方式“或多或少地将你从欧几里得空间的束缚中解放出来 (frees you more or less from this burden of Euclidean space)”，因为注意力机制处理的是集合，而非必须符合特定空间结构的数据。

8. Transformer为何如此有效？

上下文学习 (In-Context Learning) / 元学习 (Meta-Learning)：
- 引用GPT-3论文《Language Models are Few-Shot Learners》。当在提示 (prompt) 中给出更多任务示例时，模型性能会提升，即使没有进行梯度下降式的微调。
- 这表明Transformer能够在前向传播过程中，在其激活值中进行某种形式的学习或适应。
- 有研究（如提出“Raw Operator”的论文）试图解释这种现象，认为Transformer内部可能实现了类似梯度下降的学习机制。
Karpathy总结的Transformer三大优势 (源自其推文)：
1. 表达能力强 (Expressive)：在前向传播中能实现复杂函数，甚至元学习。
2. 易于优化 (Optimizable)：归功于残差连接、层归一化等设计，使得梯度能够有效传播。
3. 极其高效 (Efficient)：其浅而宽的计算图结构非常适合GPU的并行计算。Karpathy认为Transformer的设计刻意考虑了GPU效率。
Karpathy对Transformer的本质比喻：
- “通用目的、可优化、高效的计算机 (General Purpose Efficient Optimizable Computer)”。
- “GPT是一个通用目的计算机，可在运行时通过自然语言程序（即提示）进行重构，并通过补全文档来‘运行’该程序。”

9. 问答与讨论精选

RNN vs. Transformer的优化性：RNN虽然理论上表达能力强，但其计算图深而窄，优化困难（梯度消失/爆炸），且串行计算效率低。Transformer浅而宽，梯度路径短且有残差连接等辅助，易于优化且适合并行。
处理多模态数据：将不同模态数据（图像块、文本、雷达信号等）都词元化，赋予特殊的类型嵌入或位置嵌入以作区分，然后一起输入Transformer的集合中。
归纳偏置 (Inductive Bias)：Transformer的归纳偏置非常少，这使其在数据量大时表现优越。在数据量少时，具有更强归纳偏置的模型（如卷积网络）可能更好。可以通过修改注意力连接模式（如局部注意力）或设计特定编码来为Transformer引入更多结构化偏置。
长上下文与记忆问题：
- 当前模型受限于固定大小的上下文窗口 (block_size)。
- 一个潜在的解决方案是固定上下文长度，但允许网络使用“草稿纸 (scratchpad)”。通过在提示中提供示例，教会模型使用特殊的“开始草稿纸”和“结束草稿纸”标记来将信息写入或读出外部存储，从而扩展有效记忆。这类似于人类学习使用笔记本。
Karpathy的当前工作：专注于NanoGPT项目，试图复现GPT模型。他对类似ChatGPT的产品化改进非常感兴趣。

核心观点总结

Andrej Karpathy的讲座强调，Transformer的成功并非偶然，而是其架构设计在表达能力、可优化性和计算效率上达到了精妙的平衡。其核心的自注意力机制（包括关键的缩放技巧）赋予了模型强大的上下文理解和信息整合能力，使其能够从大规模数据中学习复杂的模式，甚至展现出类似“上下文学习”的元学习能力。Transformer已成为AI领域一种近乎通用的架构，其潜力远未耗尽，未来将在处理更长序列、增强记忆与可控性、以及更好地模拟人类智能方面持续演进。

摘要历史 (3)

StreamSparkAI