speaker 1: Welcome to the fifth iteration of our cs 25 transformers class. So dive, and I kind of started this class a long time ago after seeing you know how transformers and machine learning in general and AI became such a prevalent thing, and how we predicted how it would become an even bigger part of our lives going forward, which does seem to be the case. So as large language models and AI in general takes over the world, whether it's through things like ChatGPT or image generation models like Sora, video generation models like Sora and so forth, we felt that, you know, having a class where people are able to sort of come and learn about transformers, how they work. And especially hear from leading experts in industry and academia working on state of the art research in this area, how that would be very beneficial to everybody's learning and help us progress further within AI and technology in general. So Yeah. So welcome to our class. So how our class works is typically each week we invite a leading researcher from either industry or academia to come speak about some state of ar art topic they're working on in transformers. So we have an exciting lineup of speakers prepared for you guys this quarter. And so this first lecture will be delivered by us where we'll sort of go through the basics of transformers. And then we sort of divided this lecture a bit differently from the previous lectures in that we kind of have a section on pre training and data strategies and then a section focused more on post training, which has become a very popular topic these days. We'll also touch briefly on some applications of transformers and some remaining sort of weaknesses or challenges that we should hopefully address to be able to further improve the state of AI and our machine learning models. So Yeah, I forgot, I have Yeah so we'll start with some instructor introductions. So we have a very good team of coinstructors. So my name is Stephen. I'm a current third year cphd here. Previously did my undergrad at Waterloo in Canada. I've done some research in industry as well at Amazon, nvidia. And in general, my research sort of focuses hovers around natural language processing. So machine learning for language and text, looking at things like, can we improve the controllability and reasoning models of large language models? And more recently, cognitive science and psychology inspired work, especially bridthe gap, the data gap, and the learning efficiencies between machine learning models and how the humans learn, how human children learn, and how our brains are able to learn so efficiently. Also done some work with multimodal as well as computer vision. So things like diffusion models and image generation. And just for fun, I also run the piano club here with Quran, and we have an upcoming concert on April eleventh, in case you guys are interested.
speaker 2: Hi everyone. I' M Kern, a second year electrical engineering PhD students. I did my undergrad at Cal Poly San Luis Obispo, after which I was a research scientist here, not doing my PhD. I'm a little bit more on the medical imaging and computer vision side. So a lot of my current work is at the intersection of computer vision and neuroscience, working with things like fMRI and ultrasound. And I currently work at the sai lab, a new lab under Dr. Sn Adeli.
speaker 3: Hi everyone. I'm Chelsea. I'm a first year master student in symbolic systems, and my general research interests are in multi ogentic frameworks, self improving AI agents, and overall just kind of improving like the interpretability and explainability of models. So previously, I studied applied math and neuroscience, and I did a bunch of like interdisciplinary research in computer vision, robotics, cognitive science, things of that sort. And currently I'm working part time at a vc firm, and over the summer I'll be interning at a conversational AI startup as a machine learning engineer. So I'm very interested in exploring the startup scene here at Stanford, so feel free to reach out.
speaker 4: Hi everyone. I'm Jenny. I'm a current student majoring in simpsis as well as a sociology co term here at Stanford. My background is primarily in technology, ethics, and poliso. If you have any questions or want to talk about that, I'd love to have a conversation. In the past, I've worked doing product at deshaw and also research in the tech ethics and policy space. And the summer I'll be working at daydream, which is an AI fashion tech startup in New York.
speaker 1: And so Yeah so dive was unable to join us today, but he's working on his new agent startup called AI Inc, currently on leave from a cd here. He's passionate about you know robotics, AI agents and so forth. And later to this, orhelikely be giving a lecture actually on everything to do with AI agents. So if you're interested in that, definitely look forto it. And previously, you know, he's worked at nvidia, Google and so forth, and he's the one who sort of know, started this class in the first place.
speaker 4: All right, so I'll go over some of the course logistics. So first announcement is we have a new website up that's just cs 25Stanford edu. And so all of our updates and as well as the speaker lineup will be posted there in the coming weeks. That will also be the link to share our zoom with people who are not Stanford affiliated or are on the walist or have not been able to gain admission into the class. So we encourage everyone to share this class with their network and ensure that anyone can access it from zoom. Yeah. So some takeaways from the course include a better understanding of transformers and the underlying architecture of many of our large language models, guest speakers which will be talking about applications in language, vision, biology, robotics, and more exposure to new research, especially from leading researchers all across the country, innovative methods that are driving the next generation of models, as well as key limitations, open problems, and the future of AI .
speaker 1: police sting.
speaker 2: Okay, next I'll give a really brief intro about transformers and how attention works. So the first step for language is word embeddings. So words aren't numbers, so we obviously can't just PaaS them into a model as is. So the first step is converting them into dense vectors in a high dimensional space. This is done through various methods, but the goal is to capture semantic similarity. Essentially that cat and dog are more similar than cat and car, even though the latter is more similar from a character standpoint. Doing so enables visualization, learning with transformer models or arithmetic. Like I've shown, like king minus man plus queen would approximately be queen in some embedding space. And classical methods for this are like words to vec, fast text, and many more these days. But static embeddings, for instance, giving the word bank the same meaning in just bank as in riverbank, have limitations. Therefore, the current standard is using contextual embeddings, which take into account the context and the sentence that the word is in. Self attention can be applied to this to learn what to focus on for given token. To do this, you learn three matrices, a query key, end value, which together comprise the attention process. Quick analogy for this is, imagine you're in a library looking for a book on a certain topic. This would be your query. Now, let's say each book has some summary associated with it, a key. You can match your query and key and get access to the book that you're looking for. The information inside the book would be your value. So in attention, we do a soft match over the values to get info from, say, multiple books. And this comprises the attention operation. And as you can see in this visualization, when you apply this to language, you can see that across different layers of the model, different words have connections to the rest of the words in the sentence. The next component is positional encodings, or embeddings, which add order to the sequence. Without these, the model, since you have just linear multiplications here, would not know what the first or the last word in the sentence is. Therefore, you add some notion of order through, say, sinusoids. Or in the simplest form, you could think that the first word would be a zero, the second 11 and on beyond. This is basically just scaling through multiple layers and multihead attention. More heads to attend to different parts of the sentence and more parameters means that you can capture more diverse relationships from your sequences. And this gives you the final transformer. Transformers today have overtaken pretty much every field from llms like GPT -4 zero zero three deep seek to vision with models that are getting increasingly better at segmentation and not speech biology video. You'll see a lot of these applications throughout the quarter with large language models. These are essentially just scaled up versions of attention. And the transformer architecture, you essentially just throw a large amount of data, general text data derived from the web at these models, and they can learn very well to model through a next token prediction objective language. And as you scale up, we've seen that immersion abilities pop up. So while at a smaller scale, you might not be able to do a certain task, once you get to a certain scale, you just have a pek in the ability to do that task. Some disadvantages though are that these models have very high computational costs and therefore also concerns like with climate and the carbon emissions they may produce. And like I was mentioning, with larger models, they're very good at generalizing too many abilities or tasks, and they're essentially plug and play with fewer zero shot learning.
speaker 1: All right, so now I'll talk a bit more about pre training. So as Kuran explained how the transformer works, but typically with a language model, especially a large language model, you typically divide it into two stages, pre training stage, where you sort of train the neural network from scratch, from randomized or initialized weights, randomly initialized weights to given more general capabilities. And a big portion of this is the data itself. So the data is sort of the fundamental fuel that sort of allows your model to learn because that's what the model is learning from. So your goal typically, again, like I said, with pre training, is to train on a large amount of data to obtain some sort of general level of capabilities and overall knowledge or intelligence. And this is arguably, again, the most important aspect of training, especially pre training, especially because llms learn, again, based on statistical distributions, predicting the next token, given previous tokens. So to effectively learn this, you typically need a large amount of data. So because of its importance, you know how do we maximally leverage it? So again, smart data strategies for pre training is definitely one of the most important topics these days. So I'll briefly touch upon two of the top projects I recently worked on on two different scales. The first is looking at you know what makes small, childlike data sets potentially effective for language learning, especially on the smaller scale. And the second is looking at smart data strategies for training large models on billions or trillions of tokens, which is on the much larger scale. So sort of why are humans able to learn so efficiently? This kind of looks at, you know, like how human children learn and interact with an environment and learn language compared to a model like chagpt, which is a bit analogous, how the human brain learns language and learns in general compared to something like a neural network. So some potential key differences are that humans learn continuously. We're continually learning. We don't just you know prere train. We don't just sit in the chair, have someone read the whole Internet to us and then we kind of just stop learning from there. So that's unlike a lot of current models, which are more single PaaS prere training models. Further, we have more go based approaches to learning and interaction with the environment. That's a major reason we learn. Whereas again, these models are typically just pre training on large amounts of data using next token prediction or autoregression. Further, we learn through continuous multimodal or multiti sensory data. So it's not just text only. We're subconsciously exposed to you know probably hundreds of senses that sort of guide the way we learn and sort of approach our daily lives. Further, I believe our brains are fundamentally different in that we learn probably in more structured or hierarchical manners, for example, through compositionality, rather than again, simply next token prediction. And the focus of this project in particular is more on the data differences. So again, humans are exposed to you know dialogue from people we talk to, storbooks, especially children, up compared to large amounts of data on the Internet. So this is a work that was published. So why do we care about small models and training on small amounts of data? Well, this will greatimprove the efficiency of training and using large language models, and this will open the door to potential new use cases. For example, models that can run on your phone, that you can run locally, and so forth. For many different use cases, smaller models and train on less data are also more interpretable and easier to sort of control or align, whether it's for safety purposes, to reduce bias and so forth, to ensure you know people are using them for safe reasons and you have appropriate guardrails in place. This will also enhance the open source availability, allowing research and the usage of these models for more people around the world, rather than simply companies with large amounts of compute. And in general, this might even allow us to more greatly understand the other direction, which is how humans are able to learn so effectively and efficiently. Yep. So this work is titled as child directed speech effective training data for language models, which I presented at emnlp in Miami last November. So again, the sort of hypothesis here is that children, you know, we sort of probably learn fundamentally different from llms. This is why we're able to learn on several magnitudes less language data in particular than many of these large language models these days, which require trillions of tokens. Now there's several hypotheses. One is the data we receive as humans is different fundamentally from llms, right? Rather than just training on Internet data, you know, we actually interact with people, we talk to people, we hear stories that our parents, our teachers tell us and so forth. The other is maybe the human brain just fundamentally learns different. So our learning algorithm is just different from large language models. And another is maybe it's the way or the structure in which we receive this data. So any data we receive is somewhat curricullarized. We start off with simple data, simple language as a child, and then you know learn more complex grammars. We hear more complex speech from our parents, coworkers and so forth. Anything we do, whether it's learning math, you know we start simple and then you know move on to more difficult problems. Whereas language models, you typically don't care too much about ordering our curriculum. So there's multiple of different hypotheses here. So in order to test some of these, what we did is we trained some small GPT two n Roberta models on five different data sets. One is childise, which is a natural conversation data with children. So this is transcribed. And then we collected a synthetic version called tiny dialogues, which I'll discuss more later, baby lm, which is a diverse mixture of different types of data. This includes reddata, Wikipedia data and so forth. So this is closer to your typical large language model, pretraining data. And then we also did a bit of testing with Wikipedia as well as open subtitles, so movie and tv transcriptions. So we collected tiny dialogue. And this was inspired by the fact that, you know, a lot of, again, I said, our learning as children is through conversations with other people. And conversations naturally lead to learning, right? We talk to someone, they give us feedback, we reflect on how the conversation went. So it's both pure and self reflection. Furthermore, conversations lead to not only learning of knowledge, but other things like ethics and morals, for example, parents or teachers, you know, telling us as children know what's right or wrong to do. And there's many different types of conversations you can have with many different types of people, leading to a lot of diversity and learning. So what we did is we collected a fully grammatical and curcularized conversation data set with a limited, childlike, restrictive vocabulary using GPT -4. And we collected different examples that diffdiffer by child age, the different participants in the conversation, and so forth. And here's just some examples of some data points in our collected data set. So you'll see as the age goes up, you know, the utterances or conversations become more complex, they become longer. The participants also differ by age appropriately. So we also ran an experiment, a curriculum experiment, where we ordered either by ascending age order, you know, so the model will first see two year old conversations, and then five year old conversations, and then ten year old and so forth, versus descending order. Maybe it's possible a language model might actually learn somehow better from more complex examples first. And then, of course, the typical baseline of randomly shuffling all your data examples. So we have some basic evaluation metrics targeted at fundamental capabilities. One is basic grammatical and syntactic knowledge, and there's another is a free word association metric called word similarity for assessing more semantic knowledge. So you see here from the different data sets that actually it seems like training on childlike data is worse than a heterogeneous mixture of just inite data like baby lm. So both metrics degrade quite substantially, especially on child desthe, more natural conversation data set between children and their caregivers. And you'll see in terms of curriculum, we don't see many substantial differences no matter what order. You sort of provide the examples into the model, which is again, surprising because as humans, you know we sort of go from simple to more difficult. So looking more closely at sort of convergence behavior or loss curves, you'll see here that the training loss the training loss sort of has these sorts of cyclical pattern depending on the sort of buckets you use for curriculum. But the validation loss, which is what you really care about, so the generalization and learning, it has the exact same trend no matter what order you feed the examples in, which is again, a very interesting sort of finding. So overall, we see that diverse data sources like baby lm seem to provide a better learning signal for language models than purely child directed speech. We do see, however, that our tiny dialogue data set noticeably outperforms the natural conversation data set, likely because that datset is very noisy, whereas ours is again synthetically collected by GPT -4. And again, global development mental ordering using curriculum learning seems to have negligible impact on performance. So overall, we can kind of sort of conclude that it's possible that other aspects of children's learning, not simply the data they're exposed to, are responsible for their efficient language learning, for example, learning from other types of information like multimodal information, or it's the fact that our learning algorithm in our brain is just fundamentally different and more data efficient than language modeling techniques. So if you wish to learn more, we have our data sets released on hugging face as well as GitHub and the papers up on archive as well. So now let's go bigger scale. So we were investigating small models trained on small amounts of data similar to a human child. Now what about current large models, billions of parameters trained on trillions of tokens. So I recently, during my last summer internship, I worked on a project with nvidia titled maximizer data's potential enhancing llm accuracy with two phase pre training. So this is the sort of optimized data selection as well as training strategies and large scale prere training. So a lot of works like glama highlighted the effectiveness of different sorts of data mixtures, but still really shed light into the exact mixtures and how these decisions were made. Whereas we know know data blending and ordering is crucial to effective llm pretraining. So can we shed more light on this, which is what our work does? So firstly, we sort of formalize and systematically evaluate this concept of two phase pre training, and we show that empirically, it improves over continuous training, which is typically what's done with llm training. And you just feed in all the data rather than separating it into particular buckets or a different schedule. We also do a fine grade analysis of data blending for these two pre training phases. And we sort of have this notion of prototyping blends on smaller token counts before scaling up. So this two phase pre training of poroach, it's sort of inspired kind of you know by how pre training and post training works, which is the first phase is on more general data. So this is to learn more broadly, so it's on more diverse data. And the second is to shift to more high quality domain specific data such as math and so forth. However, it's important to sort of balance between quality and diversity in both phases, as if you upgrade any data set too much, it can lead to overfitting. So firstly, does phase two phase training actually help? So we found that you know all our phase two blends or our phase two our two phase ptraining experiments outpered the baseline of simply just continuing training on a single phase. And this is noticeably better than just a randomized mixture of both phases as well as the natural data distribution compared to our sort of upsample data distribution for phase two. And we also showed that this is able to scale both on model scale and data scale. So if you blow up the token counts as well as the model size, and we show that performance further improves with our two phase pretrading compared to a single phase. So this kind of highlights also the effectiveness of prototyping on smaller data blends before I'm scaling up. And Furthermore, we investigate a sort of the phase the duration of phase two. So you know should we train on diverse data for a little bit and immediately switch to highly specialized data like math? Or should we wait longer? And what we found is performance improves up to a point around 40% until there's diminishing returns, likely from overfitting because specialized data, you know it's more specialized, it's more there's typically a lower number of it and it's less diverse compared to things like you know web craw data. So too much of it can lead to detrimental or dimission returns. So overall, we see a well structured two phase pre training approach with careful data selection and management is essential for optimizing llm performance while maintaining scalability and robustness across different downstream tasks. And in case you're interested, this paper is also preprint is up on archive. So overall, I guess the overall takeaway from these two projects and what I wanted to get at is like the fact that data effectiveness, especially for pre training, it's not just the amount of data, but it's about you know the quality of a data, the ordering and structure of data and how exactly you use it. So for our first project, we saw there's negative impact of global order in small scale training, but we saw that phase based training for larger scale ales is highly effective. And in general, smart data decisions are essential for models to generalize across tasks. So sort of takeaway is our research underscores that effective language modeling isn't just about amassing data, but about smarter data organization that harnesses its structure, quality and characteristics. And by continuing to sort of refine data centric approaches, the future of llm training promises smarter, more efficient and highly adaptable models. So now we'll be moving to sort of the second stage after pre training, which is post training, which Chelsea will talk about.
speaker 3: All right, so we have a pre train model. Now what like how do we adapt to specific tasks and different domains? So some major strategies include fine tuning, for instance, like reinforcement learning with human feedback or some prompt based methods or some sort of like rag architecture and retrieval based methods. So one major approach is called chain of thought reasoning. I'm sure you all have heard of it by now. So it's essentially a prompting technique to think step by step. So it shows the intermediate steps provide guidance. And this is sort of similar to the way how humans think we can imagine that we typically break down a problem into subsequent steps to help us better understand the problem itself. And another benefit of chain of thought is that it allows some sort of interpretable window into the behavior of the model. And this can kind of suggest that there is more knowledge embedded in the model's weights than just prompting a response. So this here is an example of chain of thought. On the left, we have it solve a problem in like a one shot manner, which turns out to get to the wrong answer. And on the right over there, it produces a sequence of reasoning chains, and ultimately it arrives at the correct answer. So naturally, this brings up an extension of chain of thought, which is called a tree of thought. And this is another prompting technique. But instead of producing a singular reasoning path as a chain of thought does, it considers multiple reasoning trajectories and then uses some sort of self evaluation process to kind of decide on the final outputs, such as like majority voting. So in the picture, you can see that tree of thought kind of generates like different reasoning paths and selects the best one at the end. So another way is through program of thought. And this basically generates code as the intermediate reasoning steps. And overall, what this does is that it offloads some sort of problem solving technique to some code interpreter. So it formalizes language into programs to arrive at more precise answers. So we have seen that this sort of problem decomposition seems helpful for different tasks. So one way is through Socratic questioning, which is basically using a self questioning module to propose subproblems related to the original and solthose in like a recursive sort of manner. So for instance, if the question is like what fills the balloons, this kind of leads to the next subquestion, which is like what can make a balloon float. And then by decomposing the original problem into like subsequent problems, it can better solve at the end. So finally, another problem decomposition method is through computational graphs. So this basically formulates compositional tasks as a computation graphs by breaking down the reasoning into different subprocedures and nodes. So the key takeaway here is that transformers can solve compositional tasks by reducing reasoning into subgraphs. And this is like without developing some sort of systematic problem solving skill, right? So Chelsea sort .
speaker 1: of touched on chain of thought and everything that sort of expands upon it or improves it. And that's sort of mainly a prompting based method for inference time. Next I'll be talking more at reinforcement learning and feedback mechanism. So are typically used for things like further fine tuning a pre train model. So the most popular is this thing called reinforcement learning with human feedback, or rlhf. So this trains a reward model directly from human feedback. So what you sort of do is you take your pre train ined model, get it to generate several responses, and then you typically take a pair of responses and have human sort of rate, which one they prefer. And you can sort of train a reward model based on this, basically using a reinforcement learning optimization algorithm such as ppo. Now there's an improvement to pvo called dpo, or direct preference optimization. So this sort of more directly trains the model to prefer outputs that humans rank higher compared to having a separate reward model, which is much more efficient. So basically, it actually gets sort of you can think of it as it sort of more closely ties the reward directly into the loss function itself by helping the llm to maximize the likelihood of generating preferred responses and minimize the likelihood of the responses that the humans did not prefer. And there's a sort of extension to rhf, which is called rl af. So this is simply replacing the human with an AI. So you typically have a pretty good llm that's able to provide accurate preference of judgments of which response it prefers. And this is less costly basically compared to human annocators. And then you basically you do the same thing. You train a reward model based on the llms preferences instead. And they found that actually human evaluators found that rl aif f tune outputs were around similar to rlhf, showing that you know this is a more scalable and cost efficient approach compared to human feedback. But there's one sort of disadvantage here, which is it really depends on the capabilities or the sort of accuracy of judgments of the llm you're using to provide your preferences. So if you're using one that is sort of incapable or very noisy, then that's going to hurt your post training. The next is this thing sort of very hot now, which was used in deep sebut. There are one as well as some other models like their math ones. So this is called group relative policy optimization, or grpo. So this is a variant of the ppo optimization algorithm. But rather than ranking simply pairs of responses, it actually ranks a group of responses in a different order. So this provides richer feedback, which is more fine grained and is much more sort of efficient compared to simply ranking pairs of outtwts. So this helps stabilize training, which is one reason deep seek is very much more data and compute efficient. And also they saw that it improves even things like llm reasoning, especially on things like math. There's also been other variations of rlhf and so forth. One is this thing called con man versky optimization. Not sure if I'm pronouncing that correctly, but kto. So this modifies the standard loss function typically used in post training things to account for human biases such as loss aversion. So as humans, you know we typically we care more about minimizing disastrous or negative outcomes then achieving positive ones. Were more risk averse in the most case, although it's very dependent on the person. So they encourage the AI to sort of behave in a similar manner by avoiding negative outcomes. And this basically adjuthe training process to reflect this. And they showed that this is able to sort of improve performance on different tasks, although it kind of depends on the task, but overall, it shows more sort of human like behavior on particular tasks. And these are just a subset of the sort of rlhf and sort of reinforcement learning and feedback based algorithms. One I want to touch upon before I finish off is this thing called personalizing rhf with variational preference learning. So the authors sort of saw that different demographics, you have different preferences. So typical rlhf sort of averages everything together. So what the authors do is they introduce a latent variable for every user preference profile, for example, a different demographic like children, adolts and so forth, and trains reward models conditioned on these sort of latent vectors or factors. So this leads to something they call pluralistic alignment, which is improving the reward accuracy for these particular demographics or subgroups. So it enables a single model to sort of adapt its behavior to different preferences, preference profiles and different demographics or groups of people. And now I'll hunit back to Chelsea to talk about self improving.
speaker 3: All right. So Yeah, let's talk a little bit about self improving AI agents. So what exactly is an AI agent? So it's essentially a system that perceives the environment, makes decisions and takes actions towards achieving some sort of specific goal. And usually this goal is given by the human. So for instance, like gameplaying task solving or like research assistance. And there's several components of an AI agent. So one, it's goal directed. Two, it can make its own sort of decisions. Three, it can act iteratively for there's usually some sort of memory component to it and like state tracking component to it. And finally, there's some agents that can use some tools such as like api calls or like function calling. And finally, it can learn and adapt on its own.
speaker 1: Okay, Yeah. So self .
speaker 3: improvement, basically models can reflect on their own outputs, leading to iterative improvements over time. So this typically consists of several steps. There's you know some sort of reflection of its own internal states. There's an explanation of its own reasoning process. It can evaluate the quality of its own outputs. And finally, it can also simulate multi step reasoning chains. So one technique is refinement. So this is where you have some sort of iterative prompting technique, where an llm critiques and improves its own outputs. So it generates some sort of initial response and then refines it over time. And this kind of uses feedback loops to sort of enhance the overall performance. So an example would be like it generates some answer, and then it evaluates itself for weaknesses and inconsistencies. And finally, it refines the response based on the own self critique method. Another technique is called a self reflection. So this is where a model learns from past mistakes and adjusts future responses based on past failures. So there usually is some sort of like long term memory component to this. And an example would be like the model first detects some sort of like weak response from its own outputs, and then it kind of reflects on its own mistakes and generates some sort of improved answer to it. And over multiple iterations, accuracy and reasoning should improve over time. Another technique is called react, which is essentially just combining reasoning with external actions such as you know api calls or like retrievals from a database. And this is basically some model that can interact dynamically with its environment. So it gets feedback from taking multiple action sequences and kind of incorporating that into its outputs. So for instance, the model will generate a reasoning plan and then it will call some sort of external tool, such as like web search or some api call. And then this model incorporates the retrieved data into its final response. And finally, this leads us to a framework called language agent research. So basically what lces is that it extends the react framework to incorporate multiple planning pathways. So you can kind of think this like analogous to chain of thought versus tree of thought. It kind of gathers feedback from every path to improve the future search process, which is kind of like some sort of verbal reinforcement learning inspired technique. And it uses Monte Carlo tree search to optimize planning trajectories, where in the tree structure, every node represents a state and every edge represents an action that the agent can take. So an example would be like it generates n best new action sequences, and then it will just execute them all in parallel. Then it will use some sort of like self reflection technique to score each one. And then overall, just continue exploring from the best state and update the probabilities of the past node. And Yeah, all right.
speaker 1: Next I'll be .
speaker 2: talking about a few other applications of transformers outside of language. I'll start with vision transformers, which have taken vision by storm. The dohere is that. So as I talked about, transformers taken sequences, right? But images aren't sequences. However, what the authors of the vit paper came up with was to split an image up into patches, which can then be embedded to form a sequence, passing this through a simple transformer yielded very good results. For instance, on classification, just by adding an mlp head to the end, you might ask, why apply transforces from when cnns are such a mainstay in the field? The main reason is that when you have a very large data set saying the tens of millions of examples, transformers bring in less inductive biases, cnn's assume locality, and that pixels are grouped together. Whereas with transformers and treating your images as sequences, you can see better results when you have enough data to train them. One common architecture that was impacted by the Swiss clip, which uses vits for its image encoders. This is the basis of models like GPT -4 zero or other vision language models, and essentially works through contrasted learning. So you take a data set of paired images and text pairs, and you train your model to align the encoded representations of both. So if you have an image of a cat and the word cat, then you can learn to align those embeddings. And like I mentioned, these have been applied to vision language models like GPT -4 or four o. The way these are trained is you can atenate your encoded image and text, and you can train in different stages such that your model learns to take both to account for its responses. And these have done very well on benchmarks and tasks, for instance, like test questions like I've shown here. Next, I'll talk about a bit of my work and neuroscience, which applies vits to other kinds of data. So a mainstay in my field is functional magnetic resonance imaging, or fMRI. Essentially, this captures the amount of oxygen that each voxel part of your brain is using at a given point. And this provides a very detailed proxy for the activity going on in your brain. Can be used to diagnose diseases and capture various amounts of data for a better cognitive understanding. However, if this is very high dimensional, you might have like a million or so boxels, or 100000 in the brain. So the first step to using this data with transformer models is usually averaging across like well known regions, or just grouping together voxels. And this gives you a more tractable, computationally tractable number of parcels that you can train in on. A traditional tool in this field was just to use linear pairwise correlation maps. And just these were enough to get pretty good diagnoses of things like Parkinson's. However, with the advent of tons of computer vision techniques, we can apply larger and more sophisticated models to these tasks. One cool, large body of work in this area is divvying up the brain into different functional networks. So let's say like your vision system, or your daydreaming network, or control, etcetera. And I'll get into how we use this to sort of guide our work. So like I mentioned, early ml models just took like linear correlation maps, so making lots of assumptions about the data, and just supplied typical like neural networks to the task for regression or classification tasks, or in some cases, graph based analyses to try to get a deeper understanding of how the brain, different parts of the brain interact with each other. With computer vision, we can take our raw data and just throw that at a transformer model, and that does very well as a pre training objective. So what we do is, let's say we have some number of rois across time. We can just mask out some portion of that data, PaaS the rest of the data through a transformer model and have it predict this portion. You repeat this across a large data set and all of your rois. And this provides a very good self supervised training objective for this task. So self supervised essentially means that there is no paired labell data here. We are essentially just using our raw data and posing our objective such that we can learn directly off of it. Once you've trained this sort of model, you have these dense representations inside the model that can be applied downstream to various tasks, like predicting patient attributes or the risk of disease. And you can also look at the weights that your model has learned to to analyses of brain .
speaker 1: networks. So in brief.
speaker 2: our approach essentially consists of taking the activity in the entire brain, partitioning out some small region. Let's say it's your vision system. You PaaS the unmaportion into a transformer model, which learns to predict the MaaS portion, and you can compare this to your ground truth to provide your training objective. One key thing we use here is cross attention. So what we talked about before with language, with self attention, wherein you are attending to the current sequence you're looking at, in cross attention, you have two different sequences, let's say in machine translation, you have one in English and in French. And essentially you apply attention between the two sequences instead of just on a single sequence. So our most basic architecture takes advantage of this through just a singular cross attention decoder. Having a very small model makes for better interpretability. And like I mentioned, this model just learns to predict MaaS brain regions from unmasked ones. Once we've done this, we can analyze again the attention weights to gain a deeper understanding of networks and also apply this to downstream tasks. So some modeling results here I've plotted, like the brain activity from different patients, and you can see that the model does pretty well in matching the ground truth for two networks that I've shown here. The salience network, which is involved in your senses and decision making. And the default mode network, or dmwhich, is responsible for like daydreaming or just recapitulating your brain information when you're not doing a certain task. On the bottom, we have the attention weights for this model, which I've split up by all of the other networks. So for instance, on the left, when predicting the salience network, we can see from our model that it is heavily dependent on the default mode and control networks. So this gives us a better understanding of how different brain networks are connected to each other, or how they might share information inside the brain. For other networks, though, like vision, these are more singular, we can't predict them very well. Or subcortical regions, say those involved in like memory, we also cannot predict very well. So this is all well and cool. We can predict brain activity, but what can we do with this model? If we simply replace one component of the model with a learnable token, which corresponds to predicting Parkinson's disease, then we can use this model to predict that ailment. So if you look on the right, after some fine tuning on a label data set, we can see some clustering in the models, embedding, which corresponds to getting close to 70% accuracy in predicting this disease, which is much higher than using the correlation based methods or linear assumptions that I talked about earlier.
speaker 4: All right, so now that we have some background on these transformer models and a couple of their applications, let's talk about the future and what's next. So overall, these transformer models can enable a lot more applications across every industry and sector. This includes generalist agents as well as longer video understanding and generation across the finance and business sector. Domain specific foundation models, like for example, one could imagine a doctor GPT or a lawyer GPT, or an insert field GPT. As well as potential real world impacts like personalized education and tutoring systems, advanced healthcare diagnostics, environmental monitoring and protection, real time multilingual communication, as well as an interactive environment and gaming, for example, non playable characters. What is missing though, what information might we need and what can we develop in the future? Currently, we're missing reducing computation complexity, enhancing human controllability, alignment with the language models of the human brain, adapted learning and generalization across different domains, and finally, multi sensory multimodal embodiment like intuitive physics and common sense. So these might, one might consider these barriers to developing artificial general intelligence. And these are some of the limitations of current transformer models. Some other things that are missing include infinite and external memory like neural Turing machines, infinite self improvement capabilities like continual or lifelong learning. This is another central tenet of human learning that we're not able to replicate at the moment. Complete autonomy, including curiosity, desires and goals, and long horizon decision making, as well as emotional intelligence, social understanding, and of course, ethical reasoning and value alignment.
speaker 1: Right. So there's still a sort of plethora of remaining sort of weaknesses or challenges around transformers, larger language models and AI in general these days. So I'll touch upon a couple of them briefly. The first is, like I mentioned earlier, efficiency being able to minify or sort of have tiny llms or models that you can run on your phone, on your smartwatch and etcec. So that's a big trend these days is using llms for everyday applications and purposes. And again, you want to be able to run them quickly and easily on smaller devices. Right now, there is more and more work on smaller and more efficient open source models, things like deep seek, llama and mistreal, but they're still somewhat large and a bit expensive, especially if you're looking to fine tune. There's not not accessible to everybody, especially on smaller devices. So in the future, again, we want na aim to have the ability to sort of fine tune or run these models locally on whatever device you want. The second is as our models, as our llm scale up trillions of parameters, trained on trillions of tokens across the Internet, what happens is this makes it a huge black box that is difficult to understand or interpret. It's hard to know what exactly is going on behind the scenes when you ask it to solve xyz and it comes up with answers abc, how exactly did it get there? Why did it choose those answers instead, and so forth. So more work on interpretability for loms will give us a better idea of what or how to improve them, is there ways of controlling them and better alignment, for example, being able to prevent them from producing certain outputs that might be unsafe or unethical. So there's this area which has gotten an even more popular reason called mechanistic interpretability, which is trying to understand how individual components or operations, even sometimes down to the individual node level, so very granular in an mo model, contribute to its overall decision making process. With the goal, again, of sort of unpacking this black box for a clear insight on how exactly they work behind the scenes. Next, I feel like we're approaching or we're already seeing diminishing returns. We're simply scaling up. So larger models on more data does not seem to be the band l solution. So one size fits all in frozen pretrain models have already started, leading to diminishing returns. So again, pre training performance. So the first the first sort of half right of training llms, it's likely saturating. Hence, there's been more focus on post training methods. Everything we've talked about feedback and rl mechanisms on prompting methods like chain of thought, self improvement and refinement and so forth. However, all of these post training mechanisms are going to be fundamentally limited by the overall performance or capabilities of the base model. So you can argue that the pre training is fundamentally what gives the basis or the foundational knowledge and capabilities to the model. Riso, we should not just stop investigating pre training just because we're hitting scaling limits. Furthermore, too much post training can actually lead to an issue. This is called catastrophic forgetting where the model forget stuff it's learned beforehand, for example, during pre training because you're overloading it with tons of new information in a new domain or a new task during post training. So how do we break through this sort of scaling law limit? Some potential things to investigate would be new architectures is there are different things like mommba state space machines, those sort of architectures. And it would be good to see more investigation on even non transformer architectures, which is a bit ironic in this class as transformers United. But we also always encourage more diversity and thinking outside the box. Also, again, everything I've talked about, about high quality data and smart data ordering and structuring strategies and overall improved training procedures, improved algorithms, loss functions, optimization algorithms and so forth. Another goal, as we've mentioned several times, is to be able to have bring these advanced capabilities to smaller models. Furthermore, we would still encourage more theoretical and interpretability research, including things like cognitive science and neuroscience inspired work, which kand I have some talked about some of that, that we've done recently. And so the next step will be models that are not just larger, but ones that are more smarter and more adaptable. So again, this there's this one major thing or major weakness that I think still bridges the gap between AI and humans, which is sort of continual or lifelong learning. So AI systems that can continuously improve by learning after deployment, after being pre trained, using implicit feedback, real world experience and so forth. So essentially, this is infinite and permanent sort of fundamental self improvement. We're not just talking about rag or retrieval, like putting knowledge in a retrieval database that you can retrieve at test time, but updating the brain or the weights of the model continuously. So this is similar to us, right? So we're learning every day. I'm learning right now by talking to you. I learn every time I talk to somebody else as I'm going through my daily life. But these models, after they're frozen or ptrained, that doesn't really happen. The only way they truly learn or their brain or weights are updated is through fine tuning. And again, we don't do that right? We don't sit in a chair every three months and have someone, we read the Internet to us or something like that. So again, this is almost wasted work, right? So currently during inference, the models are not actually learning and updating their weights when they're talking. When ChatGPT is talking to you, it's not truly updating its brain or weights. So this is a very challenging problem. But in our painting is likely one of the keys potentially to agi or truly human like AI systems. So there's different current work that tries to tackle this. There's things like fine tuning a smaller mail ale model based on traces from a larger model, things like model distillation related to a lot of things like improvement and so forth. But this is again, naturally continual learning. So some questions are, what mechanisms could potentially truly enable real lifelong learning? Will this be gradient updates, so actually updating the brain? Will it be things like targeting particular nodes in the architecture? Will it be having things like particular memory architectures or different parts of the neural network solely focused on sort of continuous updates and learning or even things like meta learning and looking more at the broad scale of things or the broader scope? So one line of work which has seen a bit of traction is moediting. So this is related to work on mechanistic interpretability. So this is instead of updating the whole model, if we're given a new factor, a new data point, can we target specific nodes or neurons in the model that we should update? So one work called rank one mile editing, or Rome, tries to do this through causal intervention mechanisms to determine you know which neuron activations most correspond to particular facial predictions, and then updating them appropriately. But as you can possibly suspect, this has a lot of weaknesses. So firstly, this works mainly for knowledge based things or simple facts. What if we want to update the actual skills or capabilities of a model? We want it to be better at math in general. We want it to be better at advanced and logical reasoning like humans. Then something like model editing based on factual predictions doesn't seem like itwork. The second is these are targeting one fact at a time, so it's not easy to propagate these changes to other nodes based on related facts. For example, let's say the mother like someone's mother, we want to update a fact about them. Then we should also update a fact about that person's brother because you know they have the same mother, but you know this sort of approach were only updated for the original person in question, but not any of the relatives. So this is just one example. So there's a lot of other works which have spuno recently in continual learning, which is good that this sort of area has seen more work. So I will very briefly describe some of these. One is distinthing commment, which is directly related to what I just said about Rome, but it's MaaS editing of factual knowledge instead of a simple sort of fact or memory at a time. It's able to simetiously modify thousands, even once, which might be related to each other, like I said, which is useful. There's things like cheor continue evolving from mistakes, so it actually identifies the element mistakes, somewhat similar to what self improvement Chelsea was talking about, but incrementally updates the model to self improve. Just things like lifelong mixture of experts. So what it does, instead of having a simple fixed mixture of experts architecture, it continually adds new experts for different domains over time while freezing potentially past experts which are no longer useful or don't need to be updated to avoid things like catastrophic forgetting. So this is a very smart sort of approach. Another is called club. So this enables continutask learning using only prompting without updating model weights by summarizing past knowledge into a compressed prompt memory. However, not a criticism of this work, but again, this is not technically updating the brain or the fundamental capabilities of the model. So this is a more of a prompt only approach. And another one of these is called progressive prompts, which again, alerts a soft prompt vector for each task and again progressively compresses them and composes them together, so allowing llms to continually learn without weight updates or catastrophic forgetting. But again, my opinion is continual learning would update the brain or the weights of the model in some way, I guess. So thanks. That's mainly our lecture. So you know we gave a brief overview of transformers, how they work, talked about pre training and especially how data is important for that. Various pultraining techniques, feedback, bad mechanisms, prompting mechanisms like chain of thought, self improvement, some applications to neuroscience, vision, and so forth. And some remaining weaknesses like things like the lack of continual learning and data efficiency, being able to scale down and run these models on our phone. So before we send you guys off, I know where ended a bit early. So this class going forwards every week, in case you haven't attended, we'll have a speaker, typically from industry or academia, come in to talk about the state of the art twork they're doing. And we have a cool lineup of speakers to prepare for you guys for the remainder of the quarter. And some more logistical things will be posting updates about lectures and so forth on our website through the mailing list, discord and so forth. So please join those if you haven't already. Thank you, guys. Hope you enjoyed the first lecture. And if anybody has any questions, feel free to come up and stay around. Thanks.