2024-04-12 | 3Blue1Brown | Transformers (how LLMs work) explained visually

该转录文本主要解释了大型语言模型（如GPT）背后的核心技术——Transformer架构。GPT代表“生成式预训练Transformer”，意味着它能生成新内容，通过大量数据预先训练，而Transformer是其关键的神经网络结构。

Transformer的核心功能是预测序列中的下一个“词元”（token，通常是单词或词语的一部分）。通过不断地：接收一段文本、预测下一个最可能的词元、从概率分布中抽样选择一个词元、将其追加到文本末尾，并重复此过程，模型能够生成较长的连贯内容。

其内部工作流程大致如下：
1. 输入处理与词元化：输入文本被分解为词元。
2. 词嵌入：每个词元被转换为一个数字向量（词嵌入），该向量旨在编码词元的含义。
3. 注意力机制（Attention Block）：词元向量序列通过注意力模块。在此模块中，不同词元的向量会相互“交流”，传递信息并更新各自的向量表示，从而使模型能够理解词语在特定上下文中的含义（例如，“model”在“机器学习模型”和“时尚模特”中的不同含义）。
4. 多层感知机（Multilayer Perceptron / Feed Forward Layer）：经过注意力机制处理后的向量会并行通过多层感知机进行进一步的非线性变换，每个向量独立处理。
5. 重复与输出：上述注意力模块和多层感知机模块的组合可以堆叠多层。最终，模型基于最后一个词元的处理结果，生成一个关于所有可能出现的下一个词元的概率分布。

除了文本生成，Transformer架构还广泛应用于机器翻译（其最初的提出场景）、文本到图像生成（如DALL-E、Midjourney）、语音识别和语音合成等多种任务。在构建聊天机器人时，通常会设定一个系统提示（如定义AI助手的角色），然后将用户输入作为对话的开端，让模型预测并生成AI助手的回复。

该解释将Transformer置于机器学习的更广阔背景下，强调其并非通过显式编程规则来执行任务，而是通过在一个具有大量可调参数（例如GPT-3拥有1750亿个参数）的灵活结构上，利用海量数据进行训练，从而学习模式和行为。

视频科技

媒体详情

上传日期: 2025-05-14 10:31
来源: https://www.youtube.com/watch?v=wjZofJX0v4M
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-preview-06-05

转录

下载为TXT

speaker 1: The initial's GPT stand for generative pre trained transformer. So that first word is straightforward enough. These are bots that generate new text. Pre trained prefers to how the model went through a process of learning from a massive amount of data. And the prefix insinuates that there's more room to fine tune it on specific tasks with additional training. But the last word, that's the real key piece, a transformer, is a specific kind of neural network, a machine learning model, and it's the core invention underlying the current boom in AI. What I want to do with this video and the following chapters is go through a visually driven explanation for what actually happens inside a transformer. We're going to follow the data that flows through it and go step by step. There are many different kinds of models that you can build using transformers. Some models take in audio and produce a transcript. This sentence comes from a model going the other way around, producing synthetic speech just from text. All those tools that took the world by storm in 2022, like Dolly and midjourney, that take in a text description and produce an image, are based on transformers. Even if I can't quite get it to understand what a pie creature is supposed to be, I'm still blown away that this kind of thing is even remotely possible. And the original transformer, introduced in 2017 by Google, was invented for the specific use case of translating text from one language into another. But the variant that you and I will focus on, which is the type that underlies tools like ChatGPT, will be a model that's trained to take in a piece of text, maybe even with some surrounding images or sound accompanying it, and produce a prediction for what comes next in the passage. That prediction takes the form of a probability distribution over many different chunks of text that might follow. At first glance, you might think that predicting the next word feels like a very different goal from generating new text. But once you have a prediction model like this, a simple thing you could try to make it generate a longer piece of text is to give it an initial snippet to work with, have it take a random sample from the distribution it just generated, append that sample to the text, and then run the whole process again to make a new prediction based on all the new text, including what it just added. I don't know about you, but it really doesn't feel like this should actually work in this animation. For example, I'm running GPT two on my laptop and having it repeatedly predict and sample the next chunk of text to generate a story based on the seed text. And the story just doesn't actually really make that much sense. But if I swap it out for api calls to GPT -3 instead, which is the same basic model, just much bigger, suddenly, almost magically, we do get a sensible story, one that even seems to infer that a pii creature would live in a land of math and computation. This process here of repeated prediction and sampling is essentially what's happening when you interact with ChatGPT or any of these other large language models and you see them producing one word at a time. In fact, one feature that I would very much enjoy is the ability to see the underlying distribution for each new word that it chooses. Let's kick things off with a very high level preview of how data flows through a transformer. We will spend much more time motivating and interpreting and expanding on the details of each step. But in broad strokes, when one of these chatbots generates a given word, here's what's going on under the hood. First, the input is broken up into a bunch of little pieces. These pieces are called tokens. And in the case of text, these tend to be words, or little pieces of words, or other common character combinations. If images or sound are involved, then tokens could be little patches of that image or little chunks of that sound. Each one of these tokens is then associated with a vector, meaning some list of numbers, which is meant to somehow encode the meaning of that piece. If you think of these vectors as giving coordinates in some very high dimensional space, words with similar meanings tend to land on vectors that are close to each other in that space. This sequence of vectors then passes through an operation that's known as an attention block, and this allows the vectors to talk to each other and PaaS information back and forth to update their values. For example, the meaning of the word model in the phrase a machine learning model is different from its meaning in the phrase a fashion model. The attention block is what's responsible for figuring out which words in the context are relevant to updating the meanings of which other words, and how exactly those meanings should be updated. And again, whenever I use the word meaning, this is somehow entirely encoded in the entries of those vectors. After that, these vectors PaaS through a different kind of operation. And depending on the source that you're reading, this will be referred to as a multilayer perceptron, or maybe a feed forward layer. And here, the vectors don't talk to each other. They all go through the same operation in parallel. And while this block is a little bit harder to interpret, later on we'll talk about how the step is a little bit like asking a long list of questions about each vector and then updating them based on the answers to those questions. All of the operations in both of these blocks look like a giant pile of matrix multiplications. And our primary job is going to be to understand how to read the underlying matrices. I'm glossing over some details about some normalization steps that happen in between, but this is, after all, a high level preview. After that, the process essentially repeats. You go back and forth between attention blocks and multilayer perceptron blocks until at the very end, the hope is that all of the essential meaning of the passage has somehow been baked into the very last vector in the sequence. We then perform a certain operation on that last vector that produces a probability distribution over all possible tokens, all possible little chunks of text that might come next. And like I said, once you have a tool that predicts what comes next, given a snippet of text, you can feed it a little bit of seed text and have it repeatedly play this game of predicting what comes next, sampling from the distribution, appending it, and then repeating over and over. Some of you in the know may remember how long before ChatGPT came into the scene. This is what early demos of GPT -3 looked like. You would have it autocomplete stories and essays based on an initial snippet. To make a tool like this into A Chabot, the easiest starting point is to have a little bit of text that establishes the setting of a user interacting with a helpful AI assistant, what you would call the system prompt. And then you would use the user's initial question or prompt as the first bit of dialogue, and then you have it start predicting what such a helpful AI assistant would say in response. There is more to say about an added step of training that's required to make this work well, but at a high level. This is the general idea. In this chapter, you and I are going to expand on the details of what happens at the very beginning of the network, at the very end of the network. And I also want na spend a lot of time reviewing some important bits of background knowledge, things that would have been second nature to any machine learning engineer by the time transformers came around. If you're comfortable with that background knowledge and a little impatient, you could probably feel free to skip to the next chapter, which is going to focus on the attention blocks generally considered the heart of the transformer. After that, I want to talk more about these multilayer perception blocks, how training works, and a number of other details that will have been skipped up to that point. For broader context, these videos are additions to a mini series about deep learning, and it's okay if you haven't watched the previous ones. I think you can do it out of order. But before diving into transformer specifically, I do think it's worth making sure that we're on the same page about the basic premise and structure of deep learning. At the risk of stating the obvious, this is one approach to machine learning, which describes any model where you're using data to somehow determine how a model behaves. What I mean by that is, let's say you want a function that takes in an image and it produces a label describing it, or our example of predicting the next word given a passage of text, or any other task that seems to require some element of intuition and pattern recognition. We almost take this for granted these days. But the idea with machine learning is that rather than trying to explicitly define a procedure for how to do that task in code, which is what people would have done in the earliest days of AI, instead you set up a very flexible structure with tunable parameters, like a bunch of knobs and dials, and then somehow you use many examples of what the output should look like for a given input to tweak and tune the values of those parameters to mimic this behavior. For example, maybe the simplest form of machine learning is linear regression, where your inputs and your outputs are each single numbers, something like the square footage of a house and its price. And what you want is to find a line of best fit through this data, you know, to predict future house prices. That line is described by two continuous parameters, say, the slope and the y intercept. The goal of linear regression is to determine those parameters to closely match the data. Needless to say, deep learning models get much more complicated. GPT -3, for example, has not two, but 175 billion parameters. But here's the thing. It's not a given that you can create some giant model with a huge number of parameters without it either grossly overfitting the training data or being completely intractable to train. Deep learning describes a class of models that in the last couple decades have proven to scale remarkably well. What unifies them is that they all use the same training algorithm. It's called back propagation. We talked about it in previous chapters. And the context that I want you to have as we go in is that in order for this training algorithm to work well at scale, these models have to follow a certain specific format. And if you know this format going in, it helps to explain many of the choices for how a transformer processes language, which otherwise run the risk of feeling kind of arbitrary. First, whatever kind of model you're making, the input has to be formatted as an array of real numbers. This could simply mean a list of numbers. It could be a two dimensional array, or very often you deal with higher dimensional arrays, where the general term used is tensor. You often think of that input data as being progressively transformed into many distinct layers, where again, each layer is always structured as some kind of array of real numbers until you get to a final layer, which you consider the output. For example, the final layer in our text processing model is a list of numbers representing the probability distribution for all possible next tokens. In deep learning, these model parameters are almost always referred to as weights. And this is because a key feature of these models is that the only way these parameters interact with the data being processed is through weighted sums. You also sprinkle some nonlinear functions throughout, but they won't depend on parameters. Typically though, instead of seeing the weighted sums all naked and written out explicitly like this, you'll instead find them packaged together as various components in a matrix vector product. It amounts to saying the same thing. If you think back to how matrix vector multiplication works, each component in the output, it looks like a weighted sum. It's just often conceptually cleaner for you and me to think about matrices that are filled with tunable parameters that transform vectors that are drawn from the data being processed. For example, those 175 billion weights in GPT -3 are organized into just under 28000 distinct matrices. Those matrices, in turn, fall into eight different categories. And what you and I are going to do is step through each one of those categories to understand what that type does as we go through. I think it's kind of fun to reference the specific numbers from GPT -3 to count up exactly where those 175 billion come from, even if nowadays there are bigger and better models, this one has a certain charm as the first large language model to really capture the world's attention outside of ml communities. Also, practically speaking, companies tend to keep much tighter lips around the specific numbers for more modern networks. I just want to set the scene going in that as you peek under the hood to see what happens inside a tool like ChatGPT, almost all of the actual computation looks like matrix vector multiplication. There's a little bit of a risk getting lost in the ce of billions of numbers, but you should draw a very sharp distinction in your mind between the weights of the model, which I'll always color in blue or red, and the data being processed, which I'll always color in gray. The weights are the actual brains. They are the things learned during training, and they determine how it behaves. The data being processed simply encodes whatever specific input is fed into the model for a given run, like an example snippet of text. With all of that as foundation, let's dig into the first step of this text processing example, which is to break up the input into little chunks and turn those chunks into vectors. I mentioned how those chunks are called tokens, which might be pieces of words or punctuation. But every now and then, in this chapter, and especially in the next one, I'd like to just pretend that it's broken more cleanly into words, because we humans think in words. This will just make it much easier to reference little examples and clarify each step. The model has a predefined vocabulary, some list of all possible words, say 50000 of them. And the first matrix that we'll encounter, known as the embedding matrix, has a single column for each one of these words. These columns are what determines what vector each word turns into in that first step. We label it. We and like all the matrices we see, its values begin random, but they're going to be learned based on data. Turning words into vectors was common practice in machine learning long before transformers, but it's a little weird if you've never seen it before. And it sits the foundation for everything that follows. So let's take a moment to get familiar with it. We often call this embedding, a word which invites you to think of these vectors very geometrically as points in some high dimensional space. Visualizing a list of three numbers as coordinates for points in 3D space would be no problem, but word embeddings tend to be much, much higher dimensional. In GPT -3, they have 12288 dimensions. And as you'll see, it matters to work in a space that has a lot of distinct directions, in the same way that you could take a two dimensional slice through a 3D space and project all the points onto that slice for the sake of animating word embeddings that a simple model is giving me, I'm going na do an analogous thing by choosing a three dimensional slice through this very high dimensional space and projecting the word vectors down onto that and displaying the results. The big idea here is that as a model tweaks and tunes its weights to determine how exactly words get embedded as vectors during training, it tends to settle on a set of embeddings where directions in the space have a kind of semantic meaning. For the simple word to vector model I'm running here, if I run a search for all the words whose embeddings are closest to that of tower, you'll notice how they all seem to give very similar towerish vibes. And if you want to pull up some Python and play along at home, this is the specific model that I'm using to make the animations. It's not a transformer, but it's enough to illustrate the idea that directions in the space can Carry semantic meaning. A very classic example of this is how if you take the difference between the vectors for woman and man, something you would visualize as a little vector in the space connecting the tip of one to the tip of the other, it's very similar to the difference between king and queen. So let's say you didn't know the word for a female monarch. You could find it by taking king, adding this woman minus man direction, and searching for the embeddings closest to that point, at least, kind of despite this being a classic example for the model I'm playing with, the embedding of queen is actually a little farther off than this would suggest, presumably because the way that queen is used in training data is not merely a feminine version of king. When I played around, family relations seem to illustrate the idea much better. The point is, it looks like during training, the model found it advantageous to choose embeddings such that one direction in this space encodes gender information. Another example is that if you take the embedding of Italy and you subtract the embedding of Germany, and then you add that to the embedding of Hitler, you get something very close to the embedding of Mussolini. It's as if the model learned to associate some directions with Italianness and others with World War Two axis leaders. Maybe my favorite example in this vein is how in some models, if you take the difference between Germany and Japan and you add it to sushi, you end up very close to broadworst. Also, in playing this game of finding nearest neighbors, I was very pleased to see how close cat was to both beast and monster. One bit of mathematical intuition that's helpful to have in mind, especially for the next chapter, is how the dot product of two vectors can be thought of as a way to measure how well they align computationally. Dot products involve multiplying all the corresponding components and then adding the results, which is good since so much of our computation has to look like weighted sums geometrically. The dot product is positive when vectors point in similar directions. It's zero if they're perpendicular, and it's negative whenever they point in opposite directions. For example, let's say you were playing with this model and you hypothesize that the embedding of cats minus cs might represent a sort of plurality direction in this space. To test this, I'm going na take this vector and compute its dot product against the embeddings of certain singular nouns and compare it to the dot products with the corresponding plural nouns. If you play around with this, you'll notice that the plural ones do indeed seem to consistently give higher values than the singular ones, indicating that they align more with this direction. It's also fun how if you take the dot product with the embeddings of the words one, two, three and so on, they give increasing values. So it's as if we can quantitatively measure how plural the model finds a given word. Again, the specifics for how words get embedded is learned using data. This embedding matrix, whose columns tell us what happens to each word is the first pile of weights in our model. And using the GPT -3 numbers, the vocabulary size specifically is 50257. And again, technically this consists not of words per se, but of tokens. And the embedding dimension is 12288. Multiplying those tells us this consists of about 617 million weights. Let's go ahead and add this to a running tally, remembering that by the end we should count up to 175 billion. In the case of transformers, you really want to think of the vectors in this embedding space as not merely representing individual words, for one thing, they also encode information about the position of that word, which we'll talk about later. But more importantly, you should think of them as having the capacity to soak in context. A vector that started its life as the embedding of the word king, for example, might progressively get tugged and pulled by various blocks in this network so that by the end it points in a much more specific and nuanced direction that somehow encodes that it was a king who lived in Scotland and who had achieved his post after murdering the previous king, and who's being described in Shakespearean language. Think about your own understanding of a given word. The meaning of that word is clearly informed by the surroundings, and sometimes this includes context from a long distance away. So in putting together a model that has the ability to predict what word comes next, the goal is to somehow empower it to incorporate context efficiently. To be clear, in that very first step, when you create the array of vectors based on the input text, each one of those is simply plucked out of the embedding matrix. So initially, each one can only encode the meaning of a single word without any input from its surroundings. But you should think of the primary goal of this network that it flows through as being to enable each one of those vectors to soak up a meaning that's much more rich and specific than what mere individual words could represent. The network can only process a fixed number of vectors at a time, known as its context size. For GPT -3, it was trained with a context size of 2048. So the data flowing through the network always looks like this array of 20, 48 columns, each of which has 12000 dimensions. This context size limits how much text the transformer can incorporate when it's making a prediction of the next word. This is why long conversations with certain chatbots, like the early versions of ChatGPT, often gave the feeling of the bot kind of losing the thread of conversation as you continued too long. We'll go into the details of attention in due time. But skipping ahead, I want to talk for a minute about what happens at the very end. Remember, the desired output is a probability distribution over all tokens that might come next. For example, if the very last word is professor and the context includes words like Harry Potter, and immediately proceeding, we see least favorite teacher. And also, if you give me some leeway by letting me pretend that tokens simply look like full words, then a well trained network that had built up knowledge of Harry Potter would presumably assign a high number to the word Snape. This involves two different steps. The first one is to use another matrix that maps the very last vector in that context to a list of 50000 values, one for each token in the vocabulary. Then there's a function that normalizes this into a probability distribution. It's called softmax. And we'll talk more about it in just a second. But before that, it might seem a little bit weird to only use this last embedding to make a prediction, when, after all, in that last step, there are thousands of other vectors in the layer just sitting there with their own context rich meanings. This has to do with the fact that in the training process, it turns out to be much more efficient if you use each one of those vectors in the final layer to simultaneously make a prediction for what would come immediately after it. There's a lot more to be said about training later on, but I just want na call that out right now. This matrix is called the unembedding matrix, and we give it the labell Wu again. Like all the weight matrices, we see its entries begin at random, but they are learned during the training process, keeping score on our total parameter count. This unembedding matrix has one row for each word in the vocabulary, and each row has the same number of elements as the embedding dimension. It's very similar to the embedding matrix just with the order swapped. So it adds another 617 million parameters to the network, meaning our count so far is a little over a billion, a small but not wholly insignificant fraction of the 175 billion that we'll end up with in total. As the very last mini lesson for this chapter, I want to talk more about the sofmax function, since it makes another appearance for us once we dive into the attention blocks. The idea is that if you want a sequence of numbers to act as a probability distribution, say a distribution over all possible next words, then each value has to be between zero and one, and you also need all of them to add up to one. However, if you're playing the deep learning game where everything you do looks like matrix vector multiplication, the outputs that you get by defaults don't abide by this at all. The values are often negative or much bigger than one, and they almost certainly don't add up to one. Softmax is a standard way to turn an arbitrary list of numbers into a valid distribution in such a way that the largest values end up closest to one. And the smaller values end up very close to zero. That's all you really need to know. But if you're curious, the way that it works is to first raise e to the power of each of the numbers, which means you now have a list of positive values, and then you can take the sum of all those positive values and divide each term by that sum, which normalizes it into a list that adds up to one. You'll notice that if one of the numbers in the input is meaningfully bigger than the rest, then in the output, the corresponding term dominates the distribution. So if you were sampling from it, youalmost certainly just be picking the maximizing input, but it's softer than just picking the max, in the sense that when other values are similarly large, they also get meaningful weight in the distribution. And everything changes continuously as you continuously vary the inputs. In some situations, like when ChatGPT is using this distribution to create a next word, there's room for a little bit of extra fun by adding a little extra spice into this function with a constant t thrown into the denominator of those exponents. We call it the temperature, since it vaguely resembles the role of temperature in certain thermodynamics equations. And the effect is that when t is larger, you give more weight to the lower values, meaning the distribution is a little bit more uniform. And ft is smaller than the bigger values will dominate more aggressively. Where in the extreme setting, t equal to zero means all of the weight goes to that maximum value. For example, I'll have GPT -3 generate a story with the seed text. Once upon a time there was a but I'm going to use different temperatures. In each case, temperature zero means that it always goes with the most predictable word, and what you get ends up being kind of a trite derivative of Goldilocks. A higher temperature gives it a chance to choose less likely words, but it comes with a risk. In this case, the story starts out a bit more, originally about a Young web artist from South Korea, but it quickly degenerates into nonsense. Technically speaking, the api doesn't actually let you pick a temperature bigger than two. There is no mathematical reason for this. It's just an arbitrary constraint imposed, I suppose, to keep their tool from being seen generating things that are too nonsensical. So if you're curious, the way this animation is actually working is I'm taking the 20 most probable next tokens that GPT -3 generates, which seems to be the maximum theygive me, and then I tweak the probabilities based on an exponent of one fifth as another bit of jargon, in the same way that you might call the components of the output of this function probabilities. People often refer to the inputs as logits, or some people say logits. Some people say, logets. I'm going to say logits. So for instance, when you feed in some text, you have all these word embeddings flow through the network, and you do this final multiplication with the unembedding matrix machine learning, people would refer to the components in that raw unnormalized output as the logets for the next word prediction. A lot of the goal with this chapter was to lay the foundations for understanding the attention mechanism. Karate Kid, wax on, wax off style. You see, if you have a strong intuition for word embeddings, for softmax, for how dot products measure similarity, and also the underlying premise that most of the calculations have to look like matrix multiplication with matrices full of tunable parameters, then understanding the attention mechanism, this cornerstone piece in the whole modern boom in AI should be relatively smooth. For that, come join me in the next chapter as I'm publishing this. A draft of that next chapter is available for review by Patreon supporters. A final version should be up in public in a week or two. It usually depends on how much I end up changing based on that review. In the meantime, if you want na dive into attention and if you want na help the channel out a little bit, it's there waiting.

概览/核心摘要 (Executive Summary)

该视频旨在通过视觉化的方式，逐步解释大型语言模型（LLM）核心组件Transformer的工作原理。GPT（Generative Pre-trained Transformer）模型的核心在于其“Transformer”架构，这是一种特定的神经网络，是当前AI热潮的基础。视频将重点介绍类似ChatGPT的模型，它们通过预测文本序列中的下一个“词元”（token）来运作，并通过重复此过程生成连贯文本。

数据在Transformer中的基本流程包括：首先将输入文本分解为词元；然后，每个词元通过“嵌入矩阵”（Embedding Matrix）转换为高维向量，这些向量旨在编码词元的初始含义，相似含义的词元在向量空间中位置相近。接着，这些向量序列依次通过“注意力模块”（Attention Block），使向量间能相互“交流”，根据上下文更新各自的含义。随后，向量进入“多层感知机”（Multilayer Perceptron, MLP）或称“前馈层”（Feed-forward Layer）进行并行处理。这一“注意力-MLP”的组合会重复多次。最终，序列中最后一个向量（或在训练时是所有向量）被用于通过“反嵌入矩阵”（Unembedding Matrix）和Softmax函数生成一个关于下一个可能词元的概率分布。通过从该分布中采样并追加到现有文本，模型得以持续生成内容。视频还提及了GPT-3的1750亿参数规模、词元嵌入维度（12288）、词汇表大小（约5万）以及上下文窗口（2048个词元）等具体数据，并解释了Softmax函数中“温度”参数对生成文本多样性的影响。本章内容主要为理解核心的“注意力机制”奠定基础。

引言：Transformer与GPT的核心理念

Speaker 1阐述了GPT（Generative Pre-trained Transformer）的含义，并强调了“Transformer”作为一种特定神经网络，是当前AI繁荣的核心发明。
* Generative (生成式): 模型能够生成新的文本。
* Pre-trained (预训练): 模型通过在海量数据上学习获得初始能力，其‘Pre-’前缀也意味着它具备通过额外训练针对特定任务进行微调的潜力。
* Transformer (转换器): 一种特定的神经网络/机器学习模型，是LLM的基础。
视频的目标是：“通过视觉化的方式解释Transformer内部的实际工作流程，逐步追踪数据流。”

Transformer的应用广泛，包括：
* 音频转录（语音到文本）
* 合成语音（文本到语音）
* 文本生成图像（如DALL-E, Midjourney）
* 语言翻译（Transformer最初的应用场景）

本视频将聚焦于类似ChatGPT的模型，其核心功能是：“接收一段文本...并预测接下来会出现什么内容。”

核心功能：通过预测下一个词元生成文本

Speaker 1解释了模型如何通过预测下一个词元（token）来生成更长的文本。
* 预测机制：模型输出的是一个关于下一个可能出现的文本片段（词元）的“概率分布”。
* 文本生成过程：
1. 给定初始文本片段。
2. 模型预测下一个词元的概率分布。
3. 从该分布中随机采样一个词元。
4. 将采样到的词元附加到现有文本后。
5. 基于更新后的文本，重复整个过程。
* 模型规模的重要性：Speaker 1通过对比在本地运行的GPT-2和通过API调用的GPT-3指出，虽然基本模型相同，但GPT-3由于规模“大得多”，能够“几乎神奇地”生成更连贯、有意义的故事。例如，GPT-3能推断出“pii creature”（饼状生物）可能生活在数学和计算的世界。
* 实时交互体验：当用户与ChatGPT等大型语言模型交互时，看到的逐词生成过程，本质上就是这种“重复预测和采样”的过程。

Transformer内部数据流高层概览

Speaker 1概述了当聊天机器人生成一个词时，其内部发生的主要步骤：
1. 输入分解 (Tokenization)：
* 输入被分解成称为“词元 (tokens)”的小片段。
* 文本词元：通常是单词、单词的一部分或其他常见字符组合。
* 图像/声音词元：可能是图像的小块或声音的小片段。
2. 词元向量化 (Embedding)：
* 每个词元被关联到一个“向量 (vector)”（一列数字），旨在编码该片段的含义。
* “含义相似的词语倾向于落在该空间中彼此接近的向量上。”
3. 注意力模块 (Attention Block)：
* 向量序列通过此模块，允许向量“相互交流并来回传递信息以更新其值。”
* 作用：根据上下文确定哪些词对更新其他词的含义是相关的，以及如何更新。例如，“model”在“a machine learning model”和“a fashion model”中的含义不同。
* 词语的“含义”完全由向量中的条目编码。
4. 多层感知机 (Multilayer Perceptron, MLP) / 前馈层 (Feed-forward Layer)：
* 向量通过此模块，但在此模块中“向量之间不进行交流”，它们都并行通过相同的操作。
* 可被理解为“对每个向量提出一长串问题，然后根据这些问题的答案更新它们。”
5. 重复处理：上述注意力模块和MLP模块的过程会“来回重复多次”（演讲者提示，此处略过了一些模块间的归一化步骤）。
6. 最终输出生成：
* 目标是“将文章的所有基本含义以某种方式融入序列中的最后一个向量。”
* 对最后一个向量执行特定操作，产生一个关于所有可能下一个词元的“概率分布”。
* 通过重复预测、采样、追加的循环来生成长文本。

Speaker 1提到，将此类模型转变为聊天机器人的简单起点是使用“系统提示 (system prompt)”设定AI助手的角色，然后将用户的初始问题作为对话的开端，让模型预测AI助手的回应。

深入理解Transformer的准备知识：深度学习基础

Speaker 1强调，在深入Transformer细节之前，理解深度学习的基本前提和结构至关重要。
* 机器学习的核心思想：使用数据来决定模型的行为，而非显式编程定义任务流程。模型具有“可调参数 (tunable parameters)”，通过大量样本数据调整这些参数以模仿期望行为。
* 例如，线性回归通过调整斜率和截距两个参数来拟合数据。
* GPT-3拥有“1750亿个参数”。
* 深度学习的特点：
* 能够很好地扩展，即使参数数量巨大，也能有效训练而不过度拟合或难以处理。
* 统一的训练算法：“反向传播 (backpropagation)”。
* 模型格式要求 (为反向传播服务)：
1. 输入格式化：输入必须是“实数数组 (array of real numbers)”，可以是列表、二维数组或更高维数组（张量, tensor）。
2. 分层转换：输入数据被“逐步转换到许多不同的层 (layers)”，每层仍是实数数组。
3. 输出层：最终层代表输出，如文本处理模型中的下一词元概率分布。
* 参数与数据交互方式：
* 模型参数几乎总是被称为“权重 (weights)”。
* 权重与数据通过“加权和 (weighted sums)”进行交互。
* 通常这些加权和被打包成“矩阵向量乘积 (matrix-vector product)”的形式。
* GPT-3的1750亿权重被组织到“近28000个不同的矩阵”中，这些矩阵分为“八个不同的类别”。
* 权重与数据的区分：
* 权重 (Weights)：模型的“大脑”，在训练中学习得到，决定模型行为。视频中用蓝色或红色表示。
* 被处理的数据 (Data being processed)：特定运行中输入模型的具体内容（如文本片段）。视频中用灰色表示。

输入处理：词元化与词嵌入 (Tokenization and Embedding)

这是Transformer处理文本的第一步。
* 词元化 (Tokenization)：输入文本被分解为“词元 (tokens)”，可能是词的一部分或标点符号。为便于理解，有时会假设词元是完整的单词。
* 词嵌入 (Word Embedding)：
* 模型有一个预定义的“词汇表 (vocabulary)”，包含所有可能的词元（例如50000个）。
* 嵌入矩阵 (Embedding Matrix, W_e)：该矩阵为词汇表中的每个词元（或单词）设有一列。这些列决定了每个词元在第一步中转换成的向量。
* 其值初始随机，通过数据学习得到。
* 向量的几何意义：词嵌入向量被视为高维空间中的点。
* GPT-3中的嵌入向量有“12288个维度”。
* “空间中的方向具有某种语义含义。”
* 示例1 (相似性)：与“tower”嵌入向量最接近的词向量都具有“tower-ish vibes”（塔式氛围）。
* 示例2 (关系类比)：经典例子 vector(woman) - vector(man) 近似于 vector(queen) - vector(king)。Speaker 1指出，在他使用的模型中，这个例子不完美，可能是因为“queen”在训练数据中的用法不仅仅是“king”的女性对应词。家庭关系类比效果更好。
* 示例3 (国家与领导人)：vector(Italy) - vector(Germany) + vector(Hitler) 的结果非常接近 vector(Mussolini)。
* 示例4 (国家与食物)：在某些模型中，vector(Germany) - vector(Japan) + vector(sushi) 的结果接近 vector(bratwurst) [德国香肠]。
* 点积 (Dot Product) 的作用：衡量向量对齐程度。
* 正值：方向相似。零：垂直。负值：方向相反。
* 示例：vector(cats) - vector(cs) 可能代表“复数方向”。单数名词与此向量的点积通常低于对应复数名词的点积。数字（one, two, three...）的嵌入与此方向的点积也随数字增大而增大。
* 嵌入矩阵的参数数量 (以GPT-3为例)：
* 词汇表大小 (Vocabulary size): 50,257 个词元。
* 嵌入维度 (Embedding dimension): 12,288。
* 参数数量: 50,257 * 12,288 ≈ 6.17亿 (617 million) 个权重。
* 向量的动态性与上下文：
* 初始嵌入向量仅代表单个词元，不含上下文。
* 网络的主要目标是使每个向量能够“吸收更丰富、更具体的含义”，远超单个词所能代表的。
* 向量不仅代表词义，还编码位置信息（后续讨论）。
* 上下文窗口 (Context Size)：网络一次能处理的向量数量是固定的。
* GPT-3的上下文大小为“2048”个词元。
* 这意味着流经网络的数据始终是2048列、每列12288维的数组。
* 这限制了模型在预测下一个词时能整合的文本量，可能导致长对话中模型“失去对话线索”。

输出处理：生成概率分布

这是Transformer处理流程的末端。
* 目标输出：关于所有可能下一个词元的“概率分布”。
* 反嵌入矩阵 (Unembedding Matrix, W_U)：
* 该矩阵用于将序列中的“最后一个向量”（在训练时，为提高效率，会使用最终层中的每个向量来预测其后的词元）映射到一个包含词汇表中每个词元对应值的列表（logits）。
* W_U的参数数量 (以GPT-3为例)：
* 行数：词汇表大小 (50,257)。
* 列数：嵌入维度 (12,288)。
* 参数数量：与嵌入矩阵类似，约为 6.17亿 (617 million)。
* 至此，参数总数累计“略超10亿 (a little over a billion)”，占GPT-3总参数1750亿的一小部分。
* Softmax 函数：
* 作用：将一个任意数值列表（logits）转换为有效的概率分布。
* 确保每个值在0和1之间。
* 确保所有值总和为1。
* 使得输入列表中较大的值在输出概率中占据主导，较小的值接近于零。
* 工作机制：
1. 对输入列表中的每个数 x，计算 e^x（得到一个正数列表）。
2. 计算这些正数的总和。
3. 用每个 e^x 除以该总和，进行归一化。
* 温度参数 (Temperature, T)：在Softmax计算指数的分母中引入的一个常数。
* e^(x/T)
* T 较大时：赋予较低值更多权重，使分布“更均匀一些”（更随机、更有创意）。
* T 较小时：较大值会更显著地主导分布。
* T = 0 时 (极端情况)：所有权重都分配给最大值的输入（最可预测的词）。
* Speaker 1演示了用GPT-3以不同温度生成故事：温度为0时故事陈腐；温度较高时（例如，通过调整API返回的前20个词元的概率，模拟高于API限制的温度），故事开头更具原创性，但可能迅速变得无意义。API通常限制温度不超过2。
* Logits (逻辑单元)：
* 指Softmax函数的输入，即“原始的、未归一化的输出”。
* 例如，词嵌入流经网络后，与反嵌入矩阵相乘得到的原始输出值，就是下一词预测的logits。

本章目的与后续展望

Speaker 1总结本章内容：
* 主要目标是“为理解注意力机制奠定基础”。
* 强调了对词嵌入、Softmax、点积以及矩阵运算在模型中的核心作用的直观理解，对于掌握注意力机制至关重要。
* 下一章预告：将聚焦于“注意力模块 (attention blocks)”，这被认为是“Transformer的核心”。
* 后续章节还将讨论MLP模块、训练过程及其他细节。

结论

该视频通过深入浅出的视觉化解析，揭示了Transformer模型这一现代AI基石的内部运作机制。其核心在于通过词嵌入赋予文本向量表示，利用注意力机制捕捉上下文依赖，并通过多层感知机进行信息提炼，最终实现精准的序列预测。掌握词嵌入的语义空间、矩阵运算的核心作用以及Softmax的概率转换，是理解Transformer乃至当前AI技术浪潮的关键。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-preview-06-05

2025-06-15 21:38

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-14 10:37

StreamSparkAI