Generative AI Interview Prep 2024: LLMs, Transformers [Crash Course for AI/ML Engineers]
该音频内容主要介绍了一个面向AI/ML工程师的生成式AI面试速成课程。主讲人讲解了一个时长一小时的短期课程,旨在帮助求职者准备与生成式AI和大型语言模型相关的面试问题。课程计划涵盖七个热门问题,包括生成模型的定义、生成模型与判别模型的区别,以及Transformer架构的细节(如嵌入、位置编码、多头注意力机制、层归一化、残差连接等)。
在具体内容中,主讲人首先定义了生成模型,指出其目标是学习数据的联合概率分布P(X,Y)以模拟数据生成过程,从而能够生成新的数据实例,并常用于无监督学习任务。生成模型的例子包括GPT系列、变分自编码器(VAEs)和生成对抗网络(GANs),可应用于图像生成、合成数据和语音生成等。
随后,内容详细对比了生成模型与判别模型。生成模型关注数据如何生成,学习联合概率P(X,Y);而判别模型仅关注学习条件概率P(Y|X)以区分不同类别,即学习决策边界。主讲人还提及了两种模型在训练焦点、数据用途和具体模型示例(如朴素贝叶斯、隐马尔可夫模型属于生成模型)上的差异。该速成课程被定位为面试准备的起点,并提及了一个更全面的包含100个问题的八小时深度学习面试课程。
标签
媒体详情
- 上传日期
- 2025-05-14 10:18
- 来源
- https://www.youtube.com/watch?v=k8_OEVdWGiU
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-exp-03-25
转录
speaker 1: Hi there. In the short course, we are going to cover seven most popular generative AI questions. So if you're preparing for AI engineering, machine learning or data science questions related to generative AI and large language models, then this course is for you. In the short course, one hour duration, we are going to cover seven most popular generative AI interview questions with detailed answers. This would be a great starting point for anyone preparing for this rigorous interviews such that you can also answer the follow up questions that usually you can expect from this type of interso. We are going to cover seven questions, and those will be a great starting point for anyone who wants to start their interview preparation journey. We will talk about generative models, their definition, and also the difference between generative and discriminative models. We're also going to discuss the transform architecturing detail, including the embeddings and then positional encodings, these different layers, including the multi ad attention MaaS, multi ad attention, the layer normalizations, the residual connections, etc. And then we are going to discuss in detail this concept of positional encodings and much more. So this would be a great starting point when it comes to your interview preparation, but this definitely won't be enough if you want to learn everything in one place for your complete interview preparation package. Therefore, if you start with this one and you believe you would like to learn more, then make sure to check our deep learning interview course with eight hour duration, where we cover 100 most popular deep learning and GenAI interview questions with answers. So without further ado, let's get started. What are generative models? Give examples. So let's first start with degenerative models. So generative models into model, how data is generated. They want to learn the joint probability distribution, what we are often referring as P, X and y, which is the joint probability distribution of two random variables, x and y, where x represent the features of our data and y represent the labels. So what generative models do is that they try to model the underlying data distribution. So they want to figure out how data was generated in the Press place. And in this way, they can then generate new data instances and they can also predict probabilities. And besides this, generative models are also great when it comes to unsupervised learning tasks. So for machine learning, you are most likely familiar with this idea of supervised versus unsupervised. When in case of supervised, we do have the label or the y values, but in case of the unsupervised case, we don't have labels to supervise the learning process, which means that we need to perform the learning only using the features. Think about clustering, think about outlier detection or dimendireduction or some data generation. All these cases can be considered as unsupervised learning tasks and generative models great in doing so. Now, when it comes to the generative models, they are particularly useful in scenarios when we want to understand and we need to understand the underlying data distribution or when we need to generate new data points. So let's say you want to generate synthetic data, or you have a lot of images and you want to generate new images, images that do make sense, and they are similar to the input data, but they are new, they are norvac, they are fresh. And this can be done by using generative models and all the models that these days you hear about, for instance, the GPT series, the variation auto encoders, the gens so generative adversal series networks, all these models, they are generative models and they are largely used for generating new images, generating synthetic data, for generating news page. For instance, when there is a data, that is there is a model that is trained on a large amount of the speech, you provide few examples of your speech, and then the model is able to generate new voice. So recording of based on a text based on your own voice, but you haven't yet spoken that that's generative model that is using that. The next question is, what is the difference between generative models and discriminative models? So we learned as part of question number 50 that the generative models had this goal to learn more about the original data distribution. So they want to understand how this data that has been provided to the model is generated in the first place. So they want to study and model the joint probability distribution of the features and the labels. Whereas in case of discriminative models, what discriminative model only cares about is to learn the conditional probability pgiven x in order to understand how it can correctly discriminate between classes, hence the name discriminative models. So if you are familiar with biasian ctisix, and if you are familiar with biastheum, which is one of the most important risk clatheorems out there, and also if you are familiar with this idea of conditional probability versus joint probability, then this formula will look very familiar to you. So this is exactly the idea behind generative versus discriminative models. So we are always talking about this idea of posterior probability and prior probability. You don't need to know these details to understand the nature of generative models, but I would highly suggest you to at least do some reading or maybe just Google it to understand what is the difference between them. But do you think that posterior distribution is a distribution that we are learning based on getting more information on the top of the prior distribution? So prior distribution is a distribution that we are assuming our data follows before knowing anything. So this is just our assumption. And then once there comes more data, we learn about this data, and then we use this in order to update our information. And this new distribution based on prior distribution and the data is then our posterior distribution. And ideally, posterior distribution should reflect much better the actual distribution of the data versus the prior distribution. So in here you can see that the in the left hand side is the joint probability distribution pxy, where xare the features of y are the labels. And this joint probability distribution is simply equal to the px, which is the prior distribution multiplied by pconditioned on x, which is simply the condidistribution of the labels provided the feature space. So as you can see, as part of discriminative models, we are just modeling this point. So probability of y given x from the right hand side, whereas in case of generative models, we also need the prior distribution in our two model, the actual underlying data distribution, the join probability distribution payxi. So let's quickly compare degenerative models versus discriminative models based on various features. And those features are the probability learned, the training focus, the data usage, the examples of different models from machine learning and deep learning, as well as the key differences. And also we will be looking into the use cases, the computational cost and also how they can handle missing data. So basically, as part of this question, we will be covering multiple aspects. And the way the reason that I'm doing that is because I think it's a good refreshment just in case you get full up questions. But if you are asked about the difference between the two, you can even keep it much more simpler than this. Just mention few features and definitely mention this difference between the probability that they are trying to learn. So let's begin. So when it comes to generative models, the model al is trying to learn the joint probability distribution pxi. So the one for features and see the one of labels, and it tries ves to then learn how the data is generated. And in case of discriminative models, what models tries to do is that it wants to learn the conditional probability py given x. So what is the probability of the labels? And so observation belonging to a certain class provided the corresponding feature space. When it comes to the training focus. Then the generative model tries to maximize the likelihood of observed data. They are trying to capture the data structure. When it comes to the discriminative models, the model tries to learn the decision boundary, the difference between the classes, because this can then help the model to discriminate or distinguish between different classes. Now we have the data usage, so generative models can be used to generate new data points, and it can be used for both supervised and for the unsupervised learning. I do have to mention that generative models are being used more and more in unsupervised learning cases because they are very powerful. They are able to learn and model the underlying data distribution, and then help you generate new instances that do make sense. So they are similar to the required input, but they still provide norvac and new innovative data points. When it comes to discriminative models, they are focusing on distinguishing between classes, so they are mainly used for classification tasks. When it comes to the examples, generative model al examples are the nibias approach, because it also tries to model the joint probability distribution x and y, also the hidden mark of models. You might be familiar with these models from advanced tical courses, or even from game theory related or advanced mathematical courses. They are largely used when performing quantitative analysis, largely used quantitative finance. And of course, we here need to also mention the variation outer encoders, which are the specific and more advanced type of ouencoders specifically made to be of generative nature. And to introduce this uncertainty into outhanquarders, also the gen so generative adversal networks, those are also popular generative models used largely across the industry when it comes to discriminative models. Examples of these models are logistic regression, svm, and also various type of neural networks such as rns, cnn two. When it comes to the key difference, degenerative models try to capture the overall distribution and characteristics of the data for each class when we're trying to perform classification, for instance, but when it comes to discriminative models, the model tries to directly learn the decision boundary between the classes in order to distinguish between the classes. When it comes to the use cases, the genmodels is more general. As the name suggests, it can be a use for data generation. So to generate synthetic data, this can be super handy when you are performing experimentation and you don't have label data. Then you can also try to use it to understand patterns in your data. You can use it for image recognition, not image recognition, image generation. So to generate new images and various high bulb other applications. When it comes to discriminative models, those are used in specific cases, and they are less general, of course, and they are specifically used for classification tasks. When it comes to the computational cost for generative models, the cost, which you might have already expected, is much higher than compared to discriminative models, which is low, because in case of generative models, the process is more extensive. It takes more resources to continuously learn and add information to the prior distribution, to estimate deposterior distribution, and to properly learn the underlying data structure. And then finally, when it comes to handling missing data, generative models are able to handle missing data, whereas the majority of discriminative moare struggling a lot with handling missing data. What is self attention mechanism? So self attention mechanism is based on the attention mechanism and the reason and the intuition behind this attention mechanism is try to mimic the way we humans would approach text translation or text processing. So imagine that we are trying to translate the long sentence. The way the traditional rnns or stms or other sequential based type of neural networks would approach this case is to treat each of these words separately and sequentially, which means that by the time it comes to the end of that sentence, it might be a long sentence. It already forgets the initial information that it's so as part of the initial words of this sequence. So what this means is that we are using all the information in the entire sequence. And each of these tokens that are forming the entire sequence, we are using them to compare to the target word that we are trying to process. And we are looking at their relationship, trying to understand how intensive this relationship is by using the idea of attention scres. And then you are are using this attention scores to contribute to estimation of the context of that target word. And we are doing this for all the words in our model, which means is that, for example, if we are looking into this Dutch sequence that we are trying to translate, which is somas bezoparafhallooctober, then in this case, in order to estimate the context of the words Thomas represented by the c one. We need to look at each pair of target word, and then the corresponding word in the sequence, including the pair of the target word with itself, the word Thomas, and obtain this attention scores. In this case, the attention scores corresponding to the first word Thomas are alpha 11, alpha 12, alpha 13, alpha 14, and alpha 1:15. What this means is that we will be looking into the word Thomas, and we will see how much attention this word Thomas should get when we are looking at the context of the word Thomas. So c one, how much attention the word beserves ves should get when we are looking into the context of the word Thomas. The same holfor the third word. So alpha 13 and then alpha 14 for the fourth word, and the alpha 15 for the fifth word. Then we will be using a weighted sum of this attention scores to get the context of the word Thomas. So c one, then we are doing the same for the second word to get the context of the second word, then the third word, context, etc. So why are we doing this? So intuition behind this is that we are trying to mimic the way we people would approach the processing or text translation tasks. Because when we are looking at long text, we are looking at all the words, and we are not forgetting the context of the earlier words that we sow and translated by the time we approach the end of the sequence. But when we are looking at the way rnns and lstms work, we can see that those two used to the vaniish ingradient problem end up being unable to discover these long range dependencies. What this means in practice is that by the time they approach this end of the sequence or the middle of the sequence, if it's a very long sentence, then they already forget the initial words, which means that they will not be able to accurately estimate the context of the of that word, which is something that we want to fix. Because we translate the words or process the words and sequences differently, we will not be forgetting the earlier words by the time we come in the middle of the sequence or at the end of the sequence. Therefore, in order to address this problem of the rnn's and lms and the way they approach these text processing tasks is by using this idea of attention mechanism, and specifically the self attention mechanism. Now let's look into the technicalities behind the self attention mechanism, which I would suggest you to briefly mention when you are answering this type of question to show to your interview that you understand how the attention scores are calculated. So attention scores is alpha 11, alpha 12 and alpha 13, etc. They are all scored in this attention matrix. And the way we compute this attention matrix is by using this qe matrix, kematrix and the volume matrix. So here it's important to understand what those quries are, what those keys are, the values, and how we compute the how we make use of this idea of that product and then scaling it, then applying the sofmath transformation in order to obtain the attention scores. So I will not go too much into details of these matrices, because in this case, in this tutorial, we are just trying to answer an interview question rather than to explain your entire concept of it. But just think of the quories and the keys as questions that we are answering. Asking per token in the form of ques and keys are answers that we are providing per token. And the values are the actual values that we are then using from the embeddings. So when it comes to the quries and the keys, let's look into an example that will make much more sense in terms of the interpretation of them and what they represent. So this qe matrix consists of different qe values or the tokens that we have in our entire sequence. So let's say we are trying to process, sentence the cat, chase the mouse. For the token cat, it is actually a noun, and it does a noun in this sentence. And the quof, this token will be I'm a noun for searching verb that will be related to the action that I'm associated with. So I'm a cat in this case is a noun that should be associated with a verb chased. So for this token catch the Curie will be a question, a cure that we are asking, saying, well, I'm a noun and I'm searching for an associated action. So what is that associated action that is highly related to me? Then the corresponding key, for instance, for the word chase will be a well, I'm a verb and I'm connected to the noun before me, so I am looking for that noun in a close surrounding of me. So in this case, by using this key and the theory, we are basically answering with the key the question that we are asking as part of a Quso pair token, we have the quories that they represent, and then we have a bunch of keys that corresponds to those tokens. And then we will look into the relationship between the two. And most likely we will end up with a high value as an intensity between the qand, the key, which means that we can then use this in order to say, well, chased is the verb connected to the noun before it. And the corresponding query that it answers is that is the one of the cso I'm nsearching for my associated action. Given that there will be a high intensity between this qand the key, we'll be saying, well, this key is actually answering that curso. Let's bring them together, which means that they actually contribute to the context of each other. This relationship should account for a large amount of the context estimation. So the query for the cat and the key for the chase are then generated based on their values and the positions that they take part in the sentence. So once we have the quries and the key matrices, we then take the transpose of the keys, and then we perform the dot product between the query matrix and transpose of the keys matrix. I will not go into the details behind the dot product and then the transpose, because those by this time, I assume that you have the background of linear algebra for this specific task. So you know what matrices are, what the dot product between matrices are, and what is a transpose and how we perform these operations. If you do want to learn about this, then make sure to check out my other tutorials, because in them I do actually go into details behind this linear algebra. But for now I will be assuming that you already know them and this is just a matter of refreshment. So we'll will be taking this scaled dot product between the q and the transpose of k because we will be using the square roof of dk as the scale, and we will be dividing this dot product between two and transpose of k to divided to the square roof of dk. So dk is just the size of the model. And if we are using multi head cell attention, then this will be the size of the head. So this will be the number of features from the embeddings that we are using for that specific head. We'll come to this in the next question for now. Just assume that dk is simply the size of the model. So the number of features that you are using as part of your embeddings to to describe a single token. So then we apply the sofmax transformation to transform this to value in the range of zero and one. And you are multiplying this with the matrix of values v, where v they come from, the actual values from the embeddings. So in this way, when using the self attention mechanism, we will then ensure that we can find the affinity between each pair of tokens that we have in our sequence. And we then no longer need to process this text sequentially, but instead we can use them in a paralyzed way. Now you might notice that this is something that we spoke on multiple lications as a limitation of rns and tms, because we were unable to parallelize the rns and lstms due to their sequential processing nature, something that we no longer have to deal with as part of the self attention based mechanisms like transformers. And that's the idea behind self attention mechanisms, to ensure that the language model that we are using based on the self attention mechanism will be able to process the text in a parallezed way, which means that it can handle long range dependencies, and it can also have a computational efficiency when dealing with very large applications and skilled applications. Deattention and how does it enable interest formers more effective processing of sequence? So when it comes to the attention mechanism, we saw that we were using the entire vector of this embeddings per token. So we were using all the embeddings that we initially created per token, let's say 512 features. We were processing them at the same time without dividing them into two sections. This is what we are referring us, one head, self attention. So we have just one head. And just in one go, we are using all these features, pair token to process and to learn about that word. Now, in case of multi self attention, instead of performing the self attention process, once we are performing it in a multiple iteration. So we are performing in the multi head mechanisms with with using the iof heads. So we will be then processing this operation of obtaining the attention scores not in one goal, but in the multiple goals. And the amount of times we will do this, let's say we are doing it in 33 turns or we are doing it in four turns, then we are cooling it heads. So instead of processing it just once in one head, we'll be processing it in four separate heads. So four separate cases when we will be computing the attention scores, and then we will be at the end concatenating all these scores into one attention matrix. So the way that each head will learn will be different from the other heads. So in this way, we will be able to learn the context behind each of this set of pair of words, the intensity between them in a different way. So this will means that we will have multiple sets of the same word, where each set will look at the information from a different perspective. So we will be looking at the same word, but we will be looking at from a different perspective with different using different features of that word, because we will be using only one part of that entire embedding feature. And in this case, we will then be able to learn a different aspect of that token. And behind this, from a mathematical perspective, is that we will be having not just a single set of quries keys and volume matrices, but instead, we will be having multiple of them. So this graph, it comes actually from the initial paper, attention is all you need. And you can see that the way multi head attention works is that it has these different layers, you see, linear. And then it has this shadows, which represent actually those multiple heads in which we'll be processing our text. And here we have matrix v, which rets the volume matrix matrix. And you can see that we have multiple of them, and then we will have the keys head matrices. So pair head. We will have a different keys matrix, and then we will have a qe matrices, so multiple of them. And I think it greatly represents the idea of multi head self attention, because in the one head self attention, the way we do it is that we make use of our embeddings to transform them into curing matrix, a single one, and then the key matrix, and then the volume matrix. Now, in case of the muldiitself of attention, you can see that we are decomposing each of those matitrices into multiple versions of them and smaller ones. So you can see here we have a single quing matrix denoted by q nawhich is decomposed in Q1, Q2, Q3 and Q4. In this case, we have four different heads. And then we have the same for the keys matrices. We decompose them into four different keys matrices, K1, K2, k three, k four, and then we have V1, V2, V3 and v four for the values, which means that here, head, let's say, head one, head two, head three and head four, we have the corresponding much smaller hearing matrix, keys matrix and a volume matrix. This means that every time we will be looking into the specific features of the same token, because we have the different tokens and we have the entire size of the model, let's say, 512 characters. And this means that we are describing each of these tokens by 512 features. And if we divide them into, let's say, four different matrices, this means the every time we are just looking in pahead, we are just looking into 128 features of the token. This means that paear head, the mechanism, the entire attention mechanism, will only be exposed to the different 128 features characteristics of the same token. This means that we will be learning the same token from different aspects, and this will then help us to identify the context of these different tokens in much better way at higher quality. So once we have these different matrices, we can then here head compute the corresponding dot product. So the dot product between the head qumatrix and then the corresponding transpose of that keys. So let's say Q1 and then dot product of that Q1 with the transpose of Q1. And then we are scaling depso. We are dividing it to the square roof of dk, where dk represents the 128. So it's the size of the head. We are taking the sofmax transformation. We are multiplying it with the value matrix for that head, and we are computing the attention matrix for that head. So once we have computed the attention matrix for the head one, head two, head three and head four, then what we can do is to simply concatenate those matrices and that we will end up with one huge attention matrix that that will contain all the information, all the attention scores. So once we have this, we will then be able to provide better quality in terms of identifying the context behind these tokens. And this has proven to have higher quality in terms of processing this text for various applications, that is, for machine translation and for any other type of applications of using transformers. The next question is, what are transformers and why are they important in combating the problems of models like rns and lstms? So transformers, they have been introduced first as part of the paper attention is all you need. And in this paper, this changed entirely the way the industry was looking into the various type of nlp tasks, from machine translation to content generation, but also applications beyond the nlp. So natural language proprocessing tasks, the transformers, they are deep learning models that are based on the self attention mechanism, and to be more exact, the multi head self attention mechanism. And unlike the algorithms such as rnas or tms, these transformers process the text in not insequential, but in a parallezed nature, which means that in one go it processes all the information. Whereas in case of rnnin lcms, we saw that this process was done sequentially every time in every time step using that information from the previous state. This is entirely different in transformers, and this architecture allows for significantly improving the performance of the model in terms of the competitation. So allowing for the parallezation, which was a key factor in the speed of the algorithm, in the efficiency, but also in its quality, because due to its nature and the architecture, the transformers introduced not only this parallelization aspect, but also the ability to handle long range dependencies, being more context aware, being able to generalize and handle various type of generative tasks. So the algorithms that you might have heard about, like GPT series t five, but all these algorithms, they are based on either the entire or just part of the transformer model. And from there, they have been improved and improved to become those state of the art large language models. So let's look into those different aspects and compare the transformers versus the rnn's and lstms. And I will not go too much into detail of architecture for this specific question, but I will go into it step by step in the next question. So when it comes to the parallezation, as I just mentioned, the transformers, unlike the rnn's and lstms, allows for parallelization. So it makes it possible to process the entire data in one go and parallelize the process, which means that we can train the algorithm in much faster way. Unlike the rnst and the lstms, which are sequential processing based algorithms, they sequentially process the data, which means that the training process is much slower. The second aspect that we can compare is the ability to handle long range dependencies. So given that the rn's and lstms, they struggle from this vanishing and explding gradients problem a lot, this also means that as a result, they are unable to handle long range dependencies. So by the time they come close to the end of the sentence of a long sentence or even the middle part of it, the start to forget what they saw in the earlier layers. So in the earlier part of the sentence, something that we want to avoid when we want to provide high quality processing of the text. So that's something that the transformers have overcome due to the self attention mechanism. And by using the self attention mechanism, transformers allow us to process a text imparallezation, but also no longer suffer from the vention gradient problem a lot. This also means being able to handle long range dependencies and to consider all the parts of the text at the same time when figuring out what is the context of that one specific word due to the self attention mechanism. Then the other thing we can talk about, the aspect in terms of the difference between transformers and lstms or rnn's, is the scalability, so use to the parallelization mechanism and ability to efficiently train the network. Unlike the rns or science ans, the transformers are more scalable, so afford ded tasks that go even beyond the language processing, tasuch as image recognition or playing games. For all these cases, we can generate flexible, generalizable models that we can adopt, and we can adapt to data scale. So in this case, both the scalability aspect, but also the flexibility and generalizability of the models is different because lstms and rnn's are much harder to be used for all the other applications. Whereas in case of transformer based models like GPT for instance, they are pre trained on one type of data, which is usually a very large data with billions of parameters. Then we can fine tune this pretrade model and we can apply it for entirely different sorts of tasks. Therefore, they are also called general models or generative models. So then we can talk about the attention mechanisms, which is definitely absent in case of the rnn's and lstms, because those models inherently, they do not make use of this attention mechanism, and they process the input differently. Whereas the transformers, they make use of this attaching mechanism to process the data in a very efficient way and also in very context aware and context efficient way. The next question is Rocan make the architecture of transformers. So in this question, your interior wants you understand whether you know, all these different components that form transformers, which is usually the exact architecture that we see as part of the attention, is all you need paper. So here I have taken the figure from the attention all you need paper. And as you can see here, we have the two parts that form the transformers. So in the left hand side, we have this encoder part, and in the right hand side we have the decoder. So in the encoder part, we are processing the information. And in the decoder part, we are trying to reconstruct from this learned representation. So in the encoder part, we start with the input embeddings. Those are the token embeddings that we are using to represent the tokens in our sequence. This can be over the size when it is 512, where each token will be represented. This 512 features, and each of these features will describe one characteristic, one aspect of that token. Then next thing is to add on the top of these input embeddings, the positional encodings. The posiencodings tells you the transformer. What is the position of that word in the entire sequence? Is this word in the second part of the entire sequence? In the second position, in the third, in the middle position, in the end position. So basically in this way, we are keeping track of where exactly is this token coming from and what is the corresponding position. So the posicodings, they are based on the sinus and cosinus functions, and those are based on the vthe position of that element is an even number or an odnumber. I will not go too much into details in here specifically because this is out of the scope of this interview question, but just keep in mind that the positional encodings helps us to keep track of the position of the token in the entire sequence. And we have pair token, an entire vector, in this case in the size zes 512. We will then have 512 elements pair token describing the position of that feature for that specific token. So like in case of embeddings, we will have 512 token embeddings describing that word. The same also holds for the positions. Then we will have the corresponding positions or this token represented by this vector of the size 512. So once we have heaonce, we have headed towards the initial stage of neural network, and we have now the positional encodings added on the top of the input embeddings. The next step is to divide this data into queries, keys and values. So the ques are the keys and the vthat we saw as part of the attention scores calculation. So once we have this, then we supply this to the layer that we call multi head attention sion. So this multi head attention sion is performing the multi head self attention sion that we saw as part of the previous interview questions. So basically computing hair head the attention scores and then bringing them all together, concatenating them into a single attention matrix. Then the output of this multi potential layer is being provided to the next layer. And as you can see here, we have this small arrow that goes from multi attention to the layer where we have add and norm. So what is add ennorpresent is the layer normzation in the residual connections. We add the two in order to optimize our model, in order to ensure that our gradient has a shortcut and can directly flow into the network, and we will combat the vanishing gradient problem, the layer normalization is applied to optimize the moral indirectly, also to solve the problem of venshing gradients and exploiting gradients, but also at the same time stabiilize the network, so to ensure that we have constant weight changes, so we can then allow the network to learn the dependencies in a stable way. So as you can see here, this arrow from the input not only goes into the multi head attention sion, but also goes to the add and norm layer. And this is basically the residual connections, because in the residual connections, we are adding the input on the top of the output. So we are adding the input of the multi head tension on the top of the output of the multi head tension. And then we are applying the layer normalization to stabilize our network. Then the next step is to add a fully connected feforward network that then we'll be able to learn from this attention course Res, and we'll be able to utilize this optimization alwhere test, back propagation, optimizing the network to learn the dependencies between these different embeddings. Then we again apply the residual connections and layer normalization to ensure that the network is optimized. Once we have done this for any x time, because we do this in blocks to ensure that we can have the players, then the output of this from the encoders are then supplied to the decoder. And in this decoder, we are then taking the output embeddings, and we are adding all these output embeddings, the positional encodings, like in case of encoder. And then we are again performing our multi head attention. But then this time we are performing unmasked multi head attention, which means that we are masking the upper part, upper diagonal of our tenmatrix. And the reason for that is that we don't want our algorithm to cheat. We want it to predict the next word. And if we already provide you the algorithm, the next words, so the attention scores, then the algorithm will be inclined to cheat and to see what are these intensities in the upper diagonal. So for the words that come after this world, which is something that we want to avoid, we want the algorithm to predict the next word. And for the algorithm to predict the next word, we should ensure that the algorithm sees only the preceding words, and now the words that come after that word, because then that will help the algorithm to cheat and to see what is that word actually most likely to predict it. Because if we know what is the word that comes after the target word, then it will be much easier for the algorithm two to know that the target word is that word. So to avoid the moral from cheating and to actually predict the next word, we then need to mask all the attention scores for the upcoming and the coming words. There's the idea behind masking mechanism, and that's what we are applying here. We are doing the masks Goldy head attention. And then again, we are adding the residual connecthe layer normalization to optimize the network. And then we are taking the values from here. And then we are combining it to the keys and the values coming from the encoder in order to perform multi head self attention again. Then we are again adding the leonnormalization residual connections in here. Then we are adding a fully connected feet forward, then again, adding the residual connections and laonnormalizations. And once we are done with this, we are successfully reconstructing the words in our sequence it tokens, and we end up creating this word, embeddings. And this. Then we need to transform after training and training the model and ending up with the model that performs well or the top and criteria is reached, let's say, the number of iterations. Then we are performing linear transformation and then we are performing soltmax transformation to end up with probabilities. So to end up with set o values that will say that for that specific position, the probability of that word is this, probability of the other word is this. And now we can, for instance, either select the value with the highest probability or we can randomly select from the top five probabilities. So depending on the criteria that you are using for selecting the probabilities, but the idea behind the transformers at higher level is this, so encoder and deccoder structure, and then using all these parts in order to perform text processing. The next question is, what are positional encodings and how are they calculated? So posiencodings, they play a very crucial role as part of the transformer models. As you can see, they appear in here as part of the encoding, and they appear here as part of the deccoding process. So here this, the encoder, those are the positional encodings. The they are added on the top of the input embeddings. And here we have the posilink codings added on the top of the output embeddings as part of the decoder architecture. So the purpose of using the positional encodings in the transformers is to provide information about the position of the elements in the sequence. So unlike the rnns or lcms that use the previous information to sequentially process the data, which directly means knowing what was the word in the previous Sep. So what is the position of the sequence given that we are processing this data sequentially? In case of transformers, we don't have the information because we are using the self attention mechanism to look into the word embeddings and to process this text in a parallezed way. Given that we are doing it in a parallelized way, it means that we no longer know where that word is coming from. We just obtain those attention scores and we just know what are the intensities between these different words, but we don't know where they do come from. And to properly train and then get a sensible output. So for instance, ces, when translating the text, to keep in mind that first comes the word Thomas, for instance, in that example that we saw before, but then the word Bali, it comes in the third position. Then for that, we need to use the positional encodings, because unlike the rns or lstms, we don't know what was the previous word. In rn and lms, we knew what was a previous word. And we were using that combined with the qand state qand input in order to process the text and understand what is this, what is the qand text. In case of the transformers, we don't have that information. We are not processing the data sequentially. We are doing it in a paralyzed way. And given that we are doing it in a parallelized way, we need the posiencodings to keep track of the positions of the tokens, which can then help both in terms of understanding the context behind the world, dependent on the positions that they take, but also ensuring that the format of the output that we are providing is in the format that we want it to be. Because if we are translating a sentence, we don't want the first and the first word to come in the middle of the sentence when translating it, then the end element of the sequence sentence to become in the beginning of the sentence once we are trustating it. So position encodings basically have two different functionalities. The first one is to ensure that the output is in in the desired format. And we are doing our job properly when it comes to machine translation, but also the it has other functionality which is finding the context behind the text dependent on the positions it's taking. So when it comes to the calculation of the positionally encodings, we are here using the sine and cosine functions from the geometrical mathematics. So from geometry, we know what sinuses and what cosines is. And dependent on the position of the feature per token, we can then know how to calculate the corresponding position encoding. So for all the positions in our lectual representation of the token, we can then know whether the position is an even position or an odposition. Then we can take for all the even positions the following function. So sign us. And then position divided to the 10000, to the power of two. I divided to d model, because you might recall that we were representing each of those tokens with matrix, with vector, where each vtor had size of 505. So in the same way, we are going to end up with a vector of the size 512 for each of those words. So therefore, once we have this positions per position that we know this is the first position, second and third, up until 512, we can diuse this. For all the cases when position is odnumber, we can then use this function. So the cosine and then position divided to ten k to the power two I divided to d model al, where d model al will then be 512. And for all the positions in this vector, when the position is an even number, then we will take this function, which will be sinus, and then position divided to ten k to the power two I divided to 512. For instance, the position is one. So we are looking at the first element of the first token. So first feature, then the position is an odnumber. This means that we can take this one. We can then divide it to ten k, then to the power two times one, divide to 512, and then we take the cosine of it. This will then give us the first element in our position encodings. And we will continue this process until we end up with vector o 512 posiencodings per token. The last question is why do we add positional encodings to transformers but not to rns or lstms? So when it comes to the rnn's and lstms, we already know what are the positions of the questions, because we are processing the text sequentially, and for each input, we then get the corresponding output. And also for each word, we know what is the previous word. So we are using the previous statinformation to update the behidden state of that time step. And this helps us to keep track of the positional information of those texts. Now this is not the case when it comes to transformers, because unlike the stms or rnns, transformers are not processing the text sequentially. They are processing the text in parallezed way. And this means that we somehow need to add information about the positions of these tokens in the entire sequence. And the two reasons of this is that, first of all, to use the positional information as a way to estimate the context of those words, to learn the semantic interpretation, semantic nature, because if we know that this word appears in the beginning of the sentence, this can mean that the context of this word will be different than if the same word appears in the surrounding of other words than at the end of the sentence. So this positionally infolation will help to improve the context that the moral will learn. And the second functionality is to ensure that the output that we are processing makes sense. So when we are translating a sentence, then we are we don't want to end up with a case when the word that we are translating, the first word, appears in the third position or at the end of that sentence. We want the first word to be translated in the corresponding right position when we are generating output from our transformers. So those are basically the two reasons why we would use the positional encodings as part of transformer models, but we are not using them as part of rns or the long short term memory. So.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
该速成课程旨在帮助AI/ML工程师准备生成式AI相关的面试,特别是关于大型语言模型(LLMs)和Transformer的面试。课程计划在一小时内详细解答七个最热门的生成式AI面试问题,为面试者提供坚实的起点,并有能力应对后续问题。课程首先介绍了生成式模型的定义,阐述了其学习数据生成方式、建模联合概率分布P(X,Y)的核心思想,并将其与判别式模型(学习条件概率P(Y|X))进行了详细对比,涵盖了概率学习、训练焦点、数据用途、模型示例、关键差异、应用场景、计算成本和缺失数据处理能力等多个维度。接着,课程深入探讨了Transformer架构的关键组成部分,包括自注意力机制(模仿人类文本处理,通过Q, K, V矩阵计算注意力得分,解决RNN/LSTM长程依赖问题并实现并行处理)、多头注意力机制(通过并行运行多个注意力头,从不同子空间学习信息,提升模型性能)、Transformer的整体架构(编码器-解码器结构,包含输入嵌入、位置编码、多头注意力、残差连接、层归一化、前馈网络等组件)以及位置编码(使用正弦和余弦函数为并行处理的Transformer提供序列中词元的位置信息,这对于RNN/LSTM等序列模型则非必需)。讲者同时提及,对于希望更全面学习的学员,可以关注其团队提供的包含100个问题的八小时深度学习面试课程作为进阶学习资源。
生成式模型与判别式模型
Speaker 1首先介绍了生成式模型的概念及其与判别式模型的区别。
什么是生成式模型?
- 定义:生成式模型旨在学习数据是如何生成的 (model how data is generated)。它们试图学习特征X和标签Y的联合概率分布 P(X,Y)。
- 目标:理解并建模数据的底层分布。
- 能力:
- 生成新的数据实例。
- 预测概率。
- 适用于无监督学习任务,如聚类、异常检测、维度降低和数据生成。
- 应用场景:
- 需要理解底层数据分布。
- 需要生成新的数据点,例如:
- 生成合成数据。
- 生成与输入数据相似但全新的图像。
- 基于少量语音样本和文本生成新的语音。
- 模型示例:
- GPT系列
- 变分自编码器 (Variational Autoencoders, VAEs)
- 生成对抗网络 (Generative Adversarial Networks, GANs)
生成式模型与判别式模型的区别
Speaker 1详细对比了两种模型:
| 特征 | 生成式模型 (Generative Models) | 判别式模型 (Discriminative Models) |
|---|---|---|
| 学习的概率 | 学习联合概率分布 P(X,Y) (特征X和标签Y) | 学习条件概率分布 P(Y|X) (给定特征X下标签Y的概率) |
| 训练焦点 | 最大化观测数据的似然性,捕捉数据结构 | 学习类别间的决策边界,区分不同类别 |
| 数据用途 | 可生成新数据点;可用于监督学习和无监督学习(在无监督学习中尤为强大) | 主要用于分类任务,区分不同类别 |
| 模型示例 | 朴素贝叶斯 (Naive Bayes)、隐马尔可夫模型 (HMMs)、VAEs、GANs | 逻辑回归 (Logistic Regression)、支持向量机 (SVMs)、各种神经网络 (RNNs, CNNs) |
| 关键差异 | 捕捉每个类别数据的整体分布和特征 | 直接学习类别间的决策边界 |
| 应用场景 | 更通用:数据生成、模式理解、图像生成等 | 特定任务:主要用于分类 |
| 计算成本 | 通常较高,因其学习过程更广泛,需估计后验分布和学习底层数据结构 | 通常较低 |
| 处理缺失数据 | 能够处理缺失数据 | 大多数模型在处理缺失数据方面存在困难 |
Speaker 1提到,理解这一区别可以类比贝叶斯统计中的先验概率和后验概率,其中联合概率分布 P(X,Y) = P(X) * P(Y|X)。生成式模型关注整个 P(X,Y),而判别式模型仅关注 P(Y|X)。
自注意力机制 (Self-Attention Mechanism)
Speaker 1解释了自注意力机制的原理和重要性。
- 核心思想:模仿人类处理文本(如翻译长句)的方式,即在处理某个词时会关注句子中的所有词,而不是像传统RNN/LSTM那样只依赖于前面词的序列信息,从而避免遗忘初期信息。
- 解决的问题:传统RNN/LSTM在处理长序列时可能出现的长程依赖问题 (long-range dependencies) 和梯度消失 (vanishing gradient problem),导致模型在序列末尾遗忘初始信息。
- 工作方式:
- 对于序列中的每个目标词(token),计算其与序列中所有词(包括自身)的注意力得分 (attention scores)。
- 这些注意力得分反映了其他词对当前目标词上下文的重要性。
- 使用这些得分对所有词的表示进行加权求和,从而得到目标词的上下文感知表示。
- 例如,翻译句子 "Thomas bezoekt op vakantie Bali in oktober",在估计 "Thomas" (c1) 的上下文时,会计算其与 "Thomas" (α11), "bezoekt" (α12), "op" (α13), "vakantie" (α14), "Bali" (α15) 等词的注意力得分。
- 技术细节 (Q, K, V):
- 注意力得分存储在注意力矩阵中,通过查询 (Query, Q)、键 (Key, K) 和值 (Value, V) 矩阵计算。
- Q (Query):代表当前词提出的“问题”(例如,词 "cat" 作为名词,其查询可能是“我是一个名词,正在寻找与我相关的动作”)。
- K (Key):代表序列中其他词提供的“答案”或“标识”(例如,词 "chased" 作为动词,其键可能是“我是一个动词,与我前面的名词相关联”)。
- V (Value):代表词的实际嵌入值。
- 计算公式:
Attention(Q, K, V) = softmax( (Q * K^T) / sqrt(d_k) ) * VK^T是键矩阵的转置。d_k是键向量的维度(在单头注意力中指整个模型的嵌入维度,在多头注意力中指每个“头”的维度,例如讲者提到的128维),用于缩放点积结果。softmax将得分转换为概率分布。
- 优势:
- 能够捕捉序列内任意两个词之间的依赖关系,无论距离多远。
- 允许并行处理 (parallelized way) 文本,不像RNN/LSTM那样需要顺序处理,从而提高了计算效率。
多头注意力机制 (Multi-Head Attention)
Speaker 1进一步解释了多头注意力机制。
- 与单头自注意力的区别:
- 单头自注意力:对整个词嵌入向量(如512维)进行一次注意力计算。
- 多头自注意力:将注意力计算过程并行执行多次(例如,4个头)。
- 工作方式:
- 原始的Q, K, V矩阵(或词嵌入)被分割/线性投影成多组较小的Q, K, V矩阵,每组对应一个“头”。
- 例如,如果模型维度是512,有4个头,则每个头的Q, K, V向量维度可能是128。
- 每个头独立地执行缩放点积注意力计算。
- 由于每个头处理的是输入信息的不同子空间部分,它们可以学习到输入序列的不同方面的依赖关系和上下文信息。
- “每个头将从不同的角度看待相同的信息 (each head will look at the information from a different perspective)”。
- 所有头的输出(注意力矩阵)被拼接 (concatenate) 在一起,并通过另一次线性变换得到最终的输出。
- 原始的Q, K, V矩阵(或词嵌入)被分割/线性投影成多组较小的Q, K, V矩阵,每组对应一个“头”。
- 优势:
- 允许模型同时关注来自不同表示子空间的信息 (jointly attend to information from different representation subspaces at different positions)。
- 能够学习到更丰富、更细致的词间关系,从而提高模型对上下文的理解质量。
- 引用自 "Attention Is All You Need" 论文的图示,清晰展示了多个并行的注意力计算过程。
Transformer模型
Speaker 1介绍了Transformer模型及其相对于RNN/LSTM的优势。
- 来源:首次在论文 "Attention Is All You Need" 中提出,彻底改变了NLP领域及其他领域的处理方式。
- 核心:基于多头自注意力机制 (multi-head self-attention mechanism) 的深度学习模型。
- 与RNN/LSTM的关键区别:
- 并行处理:Transformer以并行方式处理整个序列,而RNN/LSTM是顺序处理。
- 优势:
- 并行化 (Parallelization):显著提高训练速度和计算效率。
- 处理长程依赖 (Long-range dependencies):通过自注意力机制,可以直接建模序列中任意两个位置之间的依赖关系,有效克服了RNN/LSTM的梯度消失问题和遗忘问题。
- 可扩展性 (Scalability):由于并行性和高效训练,Transformer更易于扩展到处理大规模数据和更复杂的任务,甚至超越NLP领域(如图像识别、游戏)。
- 灵活性和泛化能力 (Flexibility and Generalizability):预训练的Transformer模型(如GPT系列)可以在大量数据上训练,然后通过微调应用于各种不同任务,展现出强大的泛化能力。
- 注意力机制:这是Transformer的核心,而RNN/LSTM本身不包含此机制。
- 影响:GPT系列、T5、BERT等先进的大型语言模型均基于Transformer架构或其变体。
Transformer架构详解
Speaker 1详细解读了Transformer的经典架构(源自 "Attention Is All You Need" 论文图示)。
- 主要组成部分:编码器 (Encoder) 和解码器 (Decoder)。
编码器 (Encoder)
由N个相同的层堆叠而成,每层包含两个主要子层:
1. 输入处理:
* 输入嵌入 (Input Embeddings):将输入序列的词元(tokens)转换为固定大小的向量(例如,512维)。
* 位置编码 (Positional Encodings):将词元在序列中的位置信息编码成向量,并与输入嵌入相加。这是因为自注意力机制本身不包含序列顺序信息。
2. 多头自注意力层 (Multi-Head Self-Attention Layer):
* 输入:经过嵌入和位置编码的序列。
* 输出:对序列中每个词元加权聚合上下文信息后的表示。
3. 残差连接与层归一化 (Add & Norm):
* 将多头自注意力层的输入与其输出相加(残差连接),然后进行层归一化。
* 作用:帮助梯度传播,加速训练,稳定网络。
4. 位置前馈网络 (Position-wise Feed-Forward Network, FFN):
* 一个全连接的前馈网络,独立地应用于每个位置的表示。通常包含两个线性变换和一个激活函数(如ReLU)。
5. 残差连接与层归一化 (Add & Norm):
* 同样,将FFN的输入与其输出相加,然后进行层归一化。
解码器 (Decoder)
也由N个相同的层堆叠而成,每层包含三个主要子层:
1. 输出处理:
* 输出嵌入 (Output Embeddings):目标序列(通常是前一时刻的输出或真实标签,向右偏移一位)的词元嵌入。
* 位置编码 (Positional Encodings):与编码器类似,为输出序列的词元添加位置信息。
2. 掩码多头自注意力层 (Masked Multi-Head Self-Attention Layer):
* 对解码器的输入序列执行多头自注意力。
* 掩码 (Masking):关键在于此处的注意力计算会“掩盖”掉当前位置之后的信息 (masking the upper part, upper diagonal of our attention matrix)”。这是为了确保在预测当前词元时,模型只能依赖于已经生成的词元,防止模型“作弊”看到未来的信息。
3. 残差连接与层归一化 (Add & Norm):
* 同样应用残差连接(将子层输入加到子层输出上)与层归一化,其作用与编码器中类似,旨在帮助梯度传播、加速训练并稳定网络。
4. 编码器-解码器注意力层 (Encoder-Decoder Attention Layer / Cross-Attention):
* 查询 (Q) 来自前一个解码器子层(掩码多头自注意力层)的输出。
* 键 (K) 和 值 (V) 来自编码器最终的输出。
* 这一层允许解码器的每个位置关注输入序列中的相关部分。
5. 残差连接与层归一化 (Add & Norm)。
6. 位置前馈网络 (Position-wise Feed-Forward Network, FFN):与编码器中的FFN结构相同。
7. 残差连接与层归一化 (Add & Norm)。
输出层 (Final Layers)
- 解码器栈的输出会经过一个线性层 (Linear Layer),将向量映射到词汇表大小的维度。
- 然后通过一个 Softmax 层,生成词汇表中每个词元的概率分布,用于预测下一个词元。
位置编码 (Positional Encodings)
Speaker 1解释了位置编码的必要性及其计算方法。
- 作用:为Transformer模型提供序列中词元的位置信息 (information about the position of the elements in the sequence)。
- 为什么需要?
- Transformer的自注意力机制是并行处理所有词元的,本身不感知词元的顺序或位置。没有位置信息,模型无法区分 "A B C" 和 "C B A"。
- 上下文理解:词元的位置对其语义和上下文至关重要。
- 输出格式:确保在生成任务(如翻译)中,输出词元的顺序是正确的。
- 与RNN/LSTM的对比:RNN/LSTM通过其顺序处理的特性自然地包含了位置信息,因此不需要额外的位置编码。
- 计算方法:
- 使用正弦 (sine) 和余弦 (cosine) 函数的不同频率来生成位置编码向量。
- 对于序列中的位置
pos和嵌入维度中的索引i(从0到d_model-1):- 偶数维度 (2i):
PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) - 奇数维度 (2i+1):
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) d_model是词嵌入的维度(例如,512)。
- 偶数维度 (2i):
- 每个位置都会生成一个与词嵌入维度相同的独特位置编码向量。
- 这个位置编码向量会加到 (added on top of) 相应的词嵌入向量上。
总结核心观点
Speaker 1强调,理解生成式模型、判别式模型、自注意力机制、多头注意力机制、Transformer的完整架构以及位置编码等核心概念,对于准备生成式AI相关的面试至关重要。这些知识点不仅是面试常考内容,也是理解现代大型语言模型工作原理的基础。该速成课程旨在为面试者打下坚实的基础,以便更好地应对面试中的各种问题。