speaker 1: Hi there. In the short course, we are going to cover seven most popular generative AI questions. So if you're preparing for AI engineering, machine learning or data science questions related to generative AI and large language models, then this course is for you. In the short course, one hour duration, we are going to cover seven most popular generative AI interview questions with detailed answers. This would be a great starting point for anyone preparing for this rigorous interviews such that you can also answer the follow up questions that usually you can expect from this type of interso. We are going to cover seven questions, and those will be a great starting point for anyone who wants to start their interview preparation journey. We will talk about generative models, their definition, and also the difference between generative and discriminative models. We're also going to discuss the transform architecturing detail, including the embeddings and then positional encodings, these different layers, including the multi ad attention MaaS, multi ad attention, the layer normalizations, the residual connections, etc. And then we are going to discuss in detail this concept of positional encodings and much more. So this would be a great starting point when it comes to your interview preparation, but this definitely won't be enough if you want to learn everything in one place for your complete interview preparation package. Therefore, if you start with this one and you believe you would like to learn more, then make sure to check our deep learning interview course with eight hour duration, where we cover 100 most popular deep learning and GenAI interview questions with answers. So without further ado, let's get started. What are generative models? Give examples. So let's first start with degenerative models. So generative models into model, how data is generated. They want to learn the joint probability distribution, what we are often referring as P, X and y, which is the joint probability distribution of two random variables, x and y, where x represent the features of our data and y represent the labels. So what generative models do is that they try to model the underlying data distribution. So they want to figure out how data was generated in the Press place. And in this way, they can then generate new data instances and they can also predict probabilities. And besides this, generative models are also great when it comes to unsupervised learning tasks. So for machine learning, you are most likely familiar with this idea of supervised versus unsupervised. When in case of supervised, we do have the label or the y values, but in case of the unsupervised case, we don't have labels to supervise the learning process, which means that we need to perform the learning only using the features. Think about clustering, think about outlier detection or dimendireduction or some data generation. All these cases can be considered as unsupervised learning tasks and generative models great in doing so. Now, when it comes to the generative models, they are particularly useful in scenarios when we want to understand and we need to understand the underlying data distribution or when we need to generate new data points. So let's say you want to generate synthetic data, or you have a lot of images and you want to generate new images, images that do make sense, and they are similar to the input data, but they are new, they are norvac, they are fresh. And this can be done by using generative models and all the models that these days you hear about, for instance, the GPT series, the variation auto encoders, the gens so generative adversal series networks, all these models, they are generative models and they are largely used for generating new images, generating synthetic data, for generating news page. For instance, when there is a data, that is there is a model that is trained on a large amount of the speech, you provide few examples of your speech, and then the model is able to generate new voice. So recording of based on a text based on your own voice, but you haven't yet spoken that that's generative model that is using that. The next question is, what is the difference between generative models and discriminative models? So we learned as part of question number 50 that the generative models had this goal to learn more about the original data distribution. So they want to understand how this data that has been provided to the model is generated in the first place. So they want to study and model the joint probability distribution of the features and the labels. Whereas in case of discriminative models, what discriminative model only cares about is to learn the conditional probability pgiven x in order to understand how it can correctly discriminate between classes, hence the name discriminative models. So if you are familiar with biasian ctisix, and if you are familiar with biastheum, which is one of the most important risk clatheorems out there, and also if you are familiar with this idea of conditional probability versus joint probability, then this formula will look very familiar to you. So this is exactly the idea behind generative versus discriminative models. So we are always talking about this idea of posterior probability and prior probability. You don't need to know these details to understand the nature of generative models, but I would highly suggest you to at least do some reading or maybe just Google it to understand what is the difference between them. But do you think that posterior distribution is a distribution that we are learning based on getting more information on the top of the prior distribution? So prior distribution is a distribution that we are assuming our data follows before knowing anything. So this is just our assumption. And then once there comes more data, we learn about this data, and then we use this in order to update our information. And this new distribution based on prior distribution and the data is then our posterior distribution. And ideally, posterior distribution should reflect much better the actual distribution of the data versus the prior distribution. So in here you can see that the in the left hand side is the joint probability distribution pxy, where xare the features of y are the labels. And this joint probability distribution is simply equal to the px, which is the prior distribution multiplied by pconditioned on x, which is simply the condidistribution of the labels provided the feature space. So as you can see, as part of discriminative models, we are just modeling this point. So probability of y given x from the right hand side, whereas in case of generative models, we also need the prior distribution in our two model, the actual underlying data distribution, the join probability distribution payxi. So let's quickly compare degenerative models versus discriminative models based on various features. And those features are the probability learned, the training focus, the data usage, the examples of different models from machine learning and deep learning, as well as the key differences. And also we will be looking into the use cases, the computational cost and also how they can handle missing data. So basically, as part of this question, we will be covering multiple aspects. And the way the reason that I'm doing that is because I think it's a good refreshment just in case you get full up questions. But if you are asked about the difference between the two, you can even keep it much more simpler than this. Just mention few features and definitely mention this difference between the probability that they are trying to learn. So let's begin. So when it comes to generative models, the model al is trying to learn the joint probability distribution pxi. So the one for features and see the one of labels, and it tries ves to then learn how the data is generated. And in case of discriminative models, what models tries to do is that it wants to learn the conditional probability py given x. So what is the probability of the labels? And so observation belonging to a certain class provided the corresponding feature space. When it comes to the training focus. Then the generative model tries to maximize the likelihood of observed data. They are trying to capture the data structure. When it comes to the discriminative models, the model tries to learn the decision boundary, the difference between the classes, because this can then help the model to discriminate or distinguish between different classes. Now we have the data usage, so generative models can be used to generate new data points, and it can be used for both supervised and for the unsupervised learning. I do have to mention that generative models are being used more and more in unsupervised learning cases because they are very powerful. They are able to learn and model the underlying data distribution, and then help you generate new instances that do make sense. So they are similar to the required input, but they still provide norvac and new innovative data points. When it comes to discriminative models, they are focusing on distinguishing between classes, so they are mainly used for classification tasks. When it comes to the examples, generative model al examples are the nibias approach, because it also tries to model the joint probability distribution x and y, also the hidden mark of models. You might be familiar with these models from advanced tical courses, or even from game theory related or advanced mathematical courses. They are largely used when performing quantitative analysis, largely used quantitative finance. And of course, we here need to also mention the variation outer encoders, which are the specific and more advanced type of ouencoders specifically made to be of generative nature. And to introduce this uncertainty into outhanquarders, also the gen so generative adversal networks, those are also popular generative models used largely across the industry when it comes to discriminative models. Examples of these models are logistic regression, svm, and also various type of neural networks such as rns, cnn two. When it comes to the key difference, degenerative models try to capture the overall distribution and characteristics of the data for each class when we're trying to perform classification, for instance, but when it comes to discriminative models, the model tries to directly learn the decision boundary between the classes in order to distinguish between the classes. When it comes to the use cases, the genmodels is more general. As the name suggests, it can be a use for data generation. So to generate synthetic data, this can be super handy when you are performing experimentation and you don't have label data. Then you can also try to use it to understand patterns in your data. You can use it for image recognition, not image recognition, image generation. So to generate new images and various high bulb other applications. When it comes to discriminative models, those are used in specific cases, and they are less general, of course, and they are specifically used for classification tasks. When it comes to the computational cost for generative models, the cost, which you might have already expected, is much higher than compared to discriminative models, which is low, because in case of generative models, the process is more extensive. It takes more resources to continuously learn and add information to the prior distribution, to estimate deposterior distribution, and to properly learn the underlying data structure. And then finally, when it comes to handling missing data, generative models are able to handle missing data, whereas the majority of discriminative moare struggling a lot with handling missing data. What is self attention mechanism? So self attention mechanism is based on the attention mechanism and the reason and the intuition behind this attention mechanism is try to mimic the way we humans would approach text translation or text processing. So imagine that we are trying to translate the long sentence. The way the traditional rnns or stms or other sequential based type of neural networks would approach this case is to treat each of these words separately and sequentially, which means that by the time it comes to the end of that sentence, it might be a long sentence. It already forgets the initial information that it's so as part of the initial words of this sequence. So what this means is that we are using all the information in the entire sequence. And each of these tokens that are forming the entire sequence, we are using them to compare to the target word that we are trying to process. And we are looking at their relationship, trying to understand how intensive this relationship is by using the idea of attention scres. And then you are are using this attention scores to contribute to estimation of the context of that target word. And we are doing this for all the words in our model, which means is that, for example, if we are looking into this Dutch sequence that we are trying to translate, which is somas bezoparafhallooctober, then in this case, in order to estimate the context of the words Thomas represented by the c one. We need to look at each pair of target word, and then the corresponding word in the sequence, including the pair of the target word with itself, the word Thomas, and obtain this attention scores. In this case, the attention scores corresponding to the first word Thomas are alpha 11, alpha 12, alpha 13, alpha 14, and alpha 1:15. What this means is that we will be looking into the word Thomas, and we will see how much attention this word Thomas should get when we are looking at the context of the word Thomas. So c one, how much attention the word beserves ves should get when we are looking into the context of the word Thomas. The same holfor the third word. So alpha 13 and then alpha 14 for the fourth word, and the alpha 15 for the fifth word. Then we will be using a weighted sum of this attention scores to get the context of the word Thomas. So c one, then we are doing the same for the second word to get the context of the second word, then the third word, context, etc. So why are we doing this? So intuition behind this is that we are trying to mimic the way we people would approach the processing or text translation tasks. Because when we are looking at long text, we are looking at all the words, and we are not forgetting the context of the earlier words that we sow and translated by the time we approach the end of the sequence. But when we are looking at the way rnns and lstms work, we can see that those two used to the vaniish ingradient problem end up being unable to discover these long range dependencies. What this means in practice is that by the time they approach this end of the sequence or the middle of the sequence, if it's a very long sentence, then they already forget the initial words, which means that they will not be able to accurately estimate the context of the of that word, which is something that we want to fix. Because we translate the words or process the words and sequences differently, we will not be forgetting the earlier words by the time we come in the middle of the sequence or at the end of the sequence. Therefore, in order to address this problem of the rnn's and lms and the way they approach these text processing tasks is by using this idea of attention mechanism, and specifically the self attention mechanism. Now let's look into the technicalities behind the self attention mechanism, which I would suggest you to briefly mention when you are answering this type of question to show to your interview that you understand how the attention scores are calculated. So attention scores is alpha 11, alpha 12 and alpha 13, etc. They are all scored in this attention matrix. And the way we compute this attention matrix is by using this qe matrix, kematrix and the volume matrix. So here it's important to understand what those quries are, what those keys are, the values, and how we compute the how we make use of this idea of that product and then scaling it, then applying the sofmath transformation in order to obtain the attention scores. So I will not go too much into details of these matrices, because in this case, in this tutorial, we are just trying to answer an interview question rather than to explain your entire concept of it. But just think of the quories and the keys as questions that we are answering. Asking per token in the form of ques and keys are answers that we are providing per token. And the values are the actual values that we are then using from the embeddings. So when it comes to the quries and the keys, let's look into an example that will make much more sense in terms of the interpretation of them and what they represent. So this qe matrix consists of different qe values or the tokens that we have in our entire sequence. So let's say we are trying to process, sentence the cat, chase the mouse. For the token cat, it is actually a noun, and it does a noun in this sentence. And the quof, this token will be I'm a noun for searching verb that will be related to the action that I'm associated with. So I'm a cat in this case is a noun that should be associated with a verb chased. So for this token catch the Curie will be a question, a cure that we are asking, saying, well, I'm a noun and I'm searching for an associated action. So what is that associated action that is highly related to me? Then the corresponding key, for instance, for the word chase will be a well, I'm a verb and I'm connected to the noun before me, so I am looking for that noun in a close surrounding of me. So in this case, by using this key and the theory, we are basically answering with the key the question that we are asking as part of a Quso pair token, we have the quories that they represent, and then we have a bunch of keys that corresponds to those tokens. And then we will look into the relationship between the two. And most likely we will end up with a high value as an intensity between the qand, the key, which means that we can then use this in order to say, well, chased is the verb connected to the noun before it. And the corresponding query that it answers is that is the one of the cso I'm nsearching for my associated action. Given that there will be a high intensity between this qand the key, we'll be saying, well, this key is actually answering that curso. Let's bring them together, which means that they actually contribute to the context of each other. This relationship should account for a large amount of the context estimation. So the query for the cat and the key for the chase are then generated based on their values and the positions that they take part in the sentence. So once we have the quries and the key matrices, we then take the transpose of the keys, and then we perform the dot product between the query matrix and transpose of the keys matrix. I will not go into the details behind the dot product and then the transpose, because those by this time, I assume that you have the background of linear algebra for this specific task. So you know what matrices are, what the dot product between matrices are, and what is a transpose and how we perform these operations. If you do want to learn about this, then make sure to check out my other tutorials, because in them I do actually go into details behind this linear algebra. But for now I will be assuming that you already know them and this is just a matter of refreshment. So we'll will be taking this scaled dot product between the q and the transpose of k because we will be using the square roof of dk as the scale, and we will be dividing this dot product between two and transpose of k to divided to the square roof of dk. So dk is just the size of the model. And if we are using multi head cell attention, then this will be the size of the head. So this will be the number of features from the embeddings that we are using for that specific head. We'll come to this in the next question for now. Just assume that dk is simply the size of the model. So the number of features that you are using as part of your embeddings to to describe a single token. So then we apply the sofmax transformation to transform this to value in the range of zero and one. And you are multiplying this with the matrix of values v, where v they come from, the actual values from the embeddings. So in this way, when using the self attention mechanism, we will then ensure that we can find the affinity between each pair of tokens that we have in our sequence. And we then no longer need to process this text sequentially, but instead we can use them in a paralyzed way. Now you might notice that this is something that we spoke on multiple lications as a limitation of rns and tms, because we were unable to parallelize the rns and lstms due to their sequential processing nature, something that we no longer have to deal with as part of the self attention based mechanisms like transformers. And that's the idea behind self attention mechanisms, to ensure that the language model that we are using based on the self attention mechanism will be able to process the text in a parallezed way, which means that it can handle long range dependencies, and it can also have a computational efficiency when dealing with very large applications and skilled applications. Deattention and how does it enable interest formers more effective processing of sequence? So when it comes to the attention mechanism, we saw that we were using the entire vector of this embeddings per token. So we were using all the embeddings that we initially created per token, let's say 512 features. We were processing them at the same time without dividing them into two sections. This is what we are referring us, one head, self attention. So we have just one head. And just in one go, we are using all these features, pair token to process and to learn about that word. Now, in case of multi self attention, instead of performing the self attention process, once we are performing it in a multiple iteration. So we are performing in the multi head mechanisms with with using the iof heads. So we will be then processing this operation of obtaining the attention scores not in one goal, but in the multiple goals. And the amount of times we will do this, let's say we are doing it in 33 turns or we are doing it in four turns, then we are cooling it heads. So instead of processing it just once in one head, we'll be processing it in four separate heads. So four separate cases when we will be computing the attention scores, and then we will be at the end concatenating all these scores into one attention matrix. So the way that each head will learn will be different from the other heads. So in this way, we will be able to learn the context behind each of this set of pair of words, the intensity between them in a different way. So this will means that we will have multiple sets of the same word, where each set will look at the information from a different perspective. So we will be looking at the same word, but we will be looking at from a different perspective with different using different features of that word, because we will be using only one part of that entire embedding feature. And in this case, we will then be able to learn a different aspect of that token. And behind this, from a mathematical perspective, is that we will be having not just a single set of quries keys and volume matrices, but instead, we will be having multiple of them. So this graph, it comes actually from the initial paper, attention is all you need. And you can see that the way multi head attention works is that it has these different layers, you see, linear. And then it has this shadows, which represent actually those multiple heads in which we'll be processing our text. And here we have matrix v, which rets the volume matrix matrix. And you can see that we have multiple of them, and then we will have the keys head matrices. So pair head. We will have a different keys matrix, and then we will have a qe matrices, so multiple of them. And I think it greatly represents the idea of multi head self attention, because in the one head self attention, the way we do it is that we make use of our embeddings to transform them into curing matrix, a single one, and then the key matrix, and then the volume matrix. Now, in case of the muldiitself of attention, you can see that we are decomposing each of those matitrices into multiple versions of them and smaller ones. So you can see here we have a single quing matrix denoted by q nawhich is decomposed in Q1, Q2, Q3 and Q4. In this case, we have four different heads. And then we have the same for the keys matrices. We decompose them into four different keys matrices, K1, K2, k three, k four, and then we have V1, V2, V3 and v four for the values, which means that here, head, let's say, head one, head two, head three and head four, we have the corresponding much smaller hearing matrix, keys matrix and a volume matrix. This means that every time we will be looking into the specific features of the same token, because we have the different tokens and we have the entire size of the model, let's say, 512 characters. And this means that we are describing each of these tokens by 512 features. And if we divide them into, let's say, four different matrices, this means the every time we are just looking in pahead, we are just looking into 128 features of the token. This means that paear head, the mechanism, the entire attention mechanism, will only be exposed to the different 128 features characteristics of the same token. This means that we will be learning the same token from different aspects, and this will then help us to identify the context of these different tokens in much better way at higher quality. So once we have these different matrices, we can then here head compute the corresponding dot product. So the dot product between the head qumatrix and then the corresponding transpose of that keys. So let's say Q1 and then dot product of that Q1 with the transpose of Q1. And then we are scaling depso. We are dividing it to the square roof of dk, where dk represents the 128. So it's the size of the head. We are taking the sofmax transformation. We are multiplying it with the value matrix for that head, and we are computing the attention matrix for that head. So once we have computed the attention matrix for the head one, head two, head three and head four, then what we can do is to simply concatenate those matrices and that we will end up with one huge attention matrix that that will contain all the information, all the attention scores. So once we have this, we will then be able to provide better quality in terms of identifying the context behind these tokens. And this has proven to have higher quality in terms of processing this text for various applications, that is, for machine translation and for any other type of applications of using transformers. The next question is, what are transformers and why are they important in combating the problems of models like rns and lstms? So transformers, they have been introduced first as part of the paper attention is all you need. And in this paper, this changed entirely the way the industry was looking into the various type of nlp tasks, from machine translation to content generation, but also applications beyond the nlp. So natural language proprocessing tasks, the transformers, they are deep learning models that are based on the self attention mechanism, and to be more exact, the multi head self attention mechanism. And unlike the algorithms such as rnas or tms, these transformers process the text in not insequential, but in a parallezed nature, which means that in one go it processes all the information. Whereas in case of rnnin lcms, we saw that this process was done sequentially every time in every time step using that information from the previous state. This is entirely different in transformers, and this architecture allows for significantly improving the performance of the model in terms of the competitation. So allowing for the parallezation, which was a key factor in the speed of the algorithm, in the efficiency, but also in its quality, because due to its nature and the architecture, the transformers introduced not only this parallelization aspect, but also the ability to handle long range dependencies, being more context aware, being able to generalize and handle various type of generative tasks. So the algorithms that you might have heard about, like GPT series t five, but all these algorithms, they are based on either the entire or just part of the transformer model. And from there, they have been improved and improved to become those state of the art large language models. So let's look into those different aspects and compare the transformers versus the rnn's and lstms. And I will not go too much into detail of architecture for this specific question, but I will go into it step by step in the next question. So when it comes to the parallezation, as I just mentioned, the transformers, unlike the rnn's and lstms, allows for parallelization. So it makes it possible to process the entire data in one go and parallelize the process, which means that we can train the algorithm in much faster way. Unlike the rnst and the lstms, which are sequential processing based algorithms, they sequentially process the data, which means that the training process is much slower. The second aspect that we can compare is the ability to handle long range dependencies. So given that the rn's and lstms, they struggle from this vanishing and explding gradients problem a lot, this also means that as a result, they are unable to handle long range dependencies. So by the time they come close to the end of the sentence of a long sentence or even the middle part of it, the start to forget what they saw in the earlier layers. So in the earlier part of the sentence, something that we want to avoid when we want to provide high quality processing of the text. So that's something that the transformers have overcome due to the self attention mechanism. And by using the self attention mechanism, transformers allow us to process a text imparallezation, but also no longer suffer from the vention gradient problem a lot. This also means being able to handle long range dependencies and to consider all the parts of the text at the same time when figuring out what is the context of that one specific word due to the self attention mechanism. Then the other thing we can talk about, the aspect in terms of the difference between transformers and lstms or rnn's, is the scalability, so use to the parallelization mechanism and ability to efficiently train the network. Unlike the rns or science ans, the transformers are more scalable, so afford ded tasks that go even beyond the language processing, tasuch as image recognition or playing games. For all these cases, we can generate flexible, generalizable models that we can adopt, and we can adapt to data scale. So in this case, both the scalability aspect, but also the flexibility and generalizability of the models is different because lstms and rnn's are much harder to be used for all the other applications. Whereas in case of transformer based models like GPT for instance, they are pre trained on one type of data, which is usually a very large data with billions of parameters. Then we can fine tune this pretrade model and we can apply it for entirely different sorts of tasks. Therefore, they are also called general models or generative models. So then we can talk about the attention mechanisms, which is definitely absent in case of the rnn's and lstms, because those models inherently, they do not make use of this attention mechanism, and they process the input differently. Whereas the transformers, they make use of this attaching mechanism to process the data in a very efficient way and also in very context aware and context efficient way. The next question is Rocan make the architecture of transformers. So in this question, your interior wants you understand whether you know, all these different components that form transformers, which is usually the exact architecture that we see as part of the attention, is all you need paper. So here I have taken the figure from the attention all you need paper. And as you can see here, we have the two parts that form the transformers. So in the left hand side, we have this encoder part, and in the right hand side we have the decoder. So in the encoder part, we are processing the information. And in the decoder part, we are trying to reconstruct from this learned representation. So in the encoder part, we start with the input embeddings. Those are the token embeddings that we are using to represent the tokens in our sequence. This can be over the size when it is 512, where each token will be represented. This 512 features, and each of these features will describe one characteristic, one aspect of that token. Then next thing is to add on the top of these input embeddings, the positional encodings. The posiencodings tells you the transformer. What is the position of that word in the entire sequence? Is this word in the second part of the entire sequence? In the second position, in the third, in the middle position, in the end position. So basically in this way, we are keeping track of where exactly is this token coming from and what is the corresponding position. So the posicodings, they are based on the sinus and cosinus functions, and those are based on the vthe position of that element is an even number or an odnumber. I will not go too much into details in here specifically because this is out of the scope of this interview question, but just keep in mind that the positional encodings helps us to keep track of the position of the token in the entire sequence. And we have pair token, an entire vector, in this case in the size zes 512. We will then have 512 elements pair token describing the position of that feature for that specific token. So like in case of embeddings, we will have 512 token embeddings describing that word. The same also holds for the positions. Then we will have the corresponding positions or this token represented by this vector of the size 512. So once we have heaonce, we have headed towards the initial stage of neural network, and we have now the positional encodings added on the top of the input embeddings. The next step is to divide this data into queries, keys and values. So the ques are the keys and the vthat we saw as part of the attention scores calculation. So once we have this, then we supply this to the layer that we call multi head attention sion. So this multi head attention sion is performing the multi head self attention sion that we saw as part of the previous interview questions. So basically computing hair head the attention scores and then bringing them all together, concatenating them into a single attention matrix. Then the output of this multi potential layer is being provided to the next layer. And as you can see here, we have this small arrow that goes from multi attention to the layer where we have add and norm. So what is add ennorpresent is the layer normzation in the residual connections. We add the two in order to optimize our model, in order to ensure that our gradient has a shortcut and can directly flow into the network, and we will combat the vanishing gradient problem, the layer normalization is applied to optimize the moral indirectly, also to solve the problem of venshing gradients and exploiting gradients, but also at the same time stabiilize the network, so to ensure that we have constant weight changes, so we can then allow the network to learn the dependencies in a stable way. So as you can see here, this arrow from the input not only goes into the multi head attention sion, but also goes to the add and norm layer. And this is basically the residual connections, because in the residual connections, we are adding the input on the top of the output. So we are adding the input of the multi head tension on the top of the output of the multi head tension. And then we are applying the layer normalization to stabilize our network. Then the next step is to add a fully connected feforward network that then we'll be able to learn from this attention course Res, and we'll be able to utilize this optimization alwhere test, back propagation, optimizing the network to learn the dependencies between these different embeddings. Then we again apply the residual connections and layer normalization to ensure that the network is optimized. Once we have done this for any x time, because we do this in blocks to ensure that we can have the players, then the output of this from the encoders are then supplied to the decoder. And in this decoder, we are then taking the output embeddings, and we are adding all these output embeddings, the positional encodings, like in case of encoder. And then we are again performing our multi head attention. But then this time we are performing unmasked multi head attention, which means that we are masking the upper part, upper diagonal of our tenmatrix. And the reason for that is that we don't want our algorithm to cheat. We want it to predict the next word. And if we already provide you the algorithm, the next words, so the attention scores, then the algorithm will be inclined to cheat and to see what are these intensities in the upper diagonal. So for the words that come after this world, which is something that we want to avoid, we want the algorithm to predict the next word. And for the algorithm to predict the next word, we should ensure that the algorithm sees only the preceding words, and now the words that come after that word, because then that will help the algorithm to cheat and to see what is that word actually most likely to predict it. Because if we know what is the word that comes after the target word, then it will be much easier for the algorithm two to know that the target word is that word. So to avoid the moral from cheating and to actually predict the next word, we then need to mask all the attention scores for the upcoming and the coming words. There's the idea behind masking mechanism, and that's what we are applying here. We are doing the masks Goldy head attention. And then again, we are adding the residual connecthe layer normalization to optimize the network. And then we are taking the values from here. And then we are combining it to the keys and the values coming from the encoder in order to perform multi head self attention again. Then we are again adding the leonnormalization residual connections in here. Then we are adding a fully connected feet forward, then again, adding the residual connections and laonnormalizations. And once we are done with this, we are successfully reconstructing the words in our sequence it tokens, and we end up creating this word, embeddings. And this. Then we need to transform after training and training the model and ending up with the model that performs well or the top and criteria is reached, let's say, the number of iterations. Then we are performing linear transformation and then we are performing soltmax transformation to end up with probabilities. So to end up with set o values that will say that for that specific position, the probability of that word is this, probability of the other word is this. And now we can, for instance, either select the value with the highest probability or we can randomly select from the top five probabilities. So depending on the criteria that you are using for selecting the probabilities, but the idea behind the transformers at higher level is this, so encoder and deccoder structure, and then using all these parts in order to perform text processing. The next question is, what are positional encodings and how are they calculated? So posiencodings, they play a very crucial role as part of the transformer models. As you can see, they appear in here as part of the encoding, and they appear here as part of the deccoding process. So here this, the encoder, those are the positional encodings. The they are added on the top of the input embeddings. And here we have the posilink codings added on the top of the output embeddings as part of the decoder architecture. So the purpose of using the positional encodings in the transformers is to provide information about the position of the elements in the sequence. So unlike the rnns or lcms that use the previous information to sequentially process the data, which directly means knowing what was the word in the previous Sep. So what is the position of the sequence given that we are processing this data sequentially? In case of transformers, we don't have the information because we are using the self attention mechanism to look into the word embeddings and to process this text in a parallezed way. Given that we are doing it in a parallelized way, it means that we no longer know where that word is coming from. We just obtain those attention scores and we just know what are the intensities between these different words, but we don't know where they do come from. And to properly train and then get a sensible output. So for instance, ces, when translating the text, to keep in mind that first comes the word Thomas, for instance, in that example that we saw before, but then the word Bali, it comes in the third position. Then for that, we need to use the positional encodings, because unlike the rns or lstms, we don't know what was the previous word. In rn and lms, we knew what was a previous word. And we were using that combined with the qand state qand input in order to process the text and understand what is this, what is the qand text. In case of the transformers, we don't have that information. We are not processing the data sequentially. We are doing it in a paralyzed way. And given that we are doing it in a parallelized way, we need the posiencodings to keep track of the positions of the tokens, which can then help both in terms of understanding the context behind the world, dependent on the positions that they take, but also ensuring that the format of the output that we are providing is in the format that we want it to be. Because if we are translating a sentence, we don't want the first and the first word to come in the middle of the sentence when translating it, then the end element of the sequence sentence to become in the beginning of the sentence once we are trustating it. So position encodings basically have two different functionalities. The first one is to ensure that the output is in in the desired format. And we are doing our job properly when it comes to machine translation, but also the it has other functionality which is finding the context behind the text dependent on the positions it's taking. So when it comes to the calculation of the positionally encodings, we are here using the sine and cosine functions from the geometrical mathematics. So from geometry, we know what sinuses and what cosines is. And dependent on the position of the feature per token, we can then know how to calculate the corresponding position encoding. So for all the positions in our lectual representation of the token, we can then know whether the position is an even position or an odposition. Then we can take for all the even positions the following function. So sign us. And then position divided to the 10000, to the power of two. I divided to d model, because you might recall that we were representing each of those tokens with matrix, with vector, where each vtor had size of 505. So in the same way, we are going to end up with a vector of the size 512 for each of those words. So therefore, once we have this positions per position that we know this is the first position, second and third, up until 512, we can diuse this. For all the cases when position is odnumber, we can then use this function. So the cosine and then position divided to ten k to the power two I divided to d model al, where d model al will then be 512. And for all the positions in this vector, when the position is an even number, then we will take this function, which will be sinus, and then position divided to ten k to the power two I divided to 512. For instance, the position is one. So we are looking at the first element of the first token. So first feature, then the position is an odnumber. This means that we can take this one. We can then divide it to ten k, then to the power two times one, divide to 512, and then we take the cosine of it. This will then give us the first element in our position encodings. And we will continue this process until we end up with vector o 512 posiencodings per token. The last question is why do we add positional encodings to transformers but not to rns or lstms? So when it comes to the rnn's and lstms, we already know what are the positions of the questions, because we are processing the text sequentially, and for each input, we then get the corresponding output. And also for each word, we know what is the previous word. So we are using the previous statinformation to update the behidden state of that time step. And this helps us to keep track of the positional information of those texts. Now this is not the case when it comes to transformers, because unlike the stms or rnns, transformers are not processing the text sequentially. They are processing the text in parallezed way. And this means that we somehow need to add information about the positions of these tokens in the entire sequence. And the two reasons of this is that, first of all, to use the positional information as a way to estimate the context of those words, to learn the semantic interpretation, semantic nature, because if we know that this word appears in the beginning of the sentence, this can mean that the context of this word will be different than if the same word appears in the surrounding of other words than at the end of the sentence. So this positionally infolation will help to improve the context that the moral will learn. And the second functionality is to ensure that the output that we are processing makes sense. So when we are translating a sentence, then we are we don't want to end up with a case when the word that we are translating, the first word, appears in the third position or at the end of that sentence. We want the first word to be translated in the corresponding right position when we are generating output from our transformers. So those are basically the two reasons why we would use the positional encodings as part of transformer models, but we are not using them as part of rns or the long short term memory. So.