speaker 1: Okay. Thank you, everyone for coming. This is a seminar today focused on nlp, and I'm just really, really honored and happy that Lisa is going to tell us about her work today. Lisa is working really in the core of language modeling, which has become quite a bit more popular over the last few years and is now, I think, even a central part of all of computer science. Really. Her work has touched on almost every aspect of language modeling, so much so that it's hard to enumerate it all, but everything from, you know, prompting, parameter efficient, fine tuning, entirely new architectural approaches, even through to new ways of doing evaluation, which is very rare to see that kind of breadth. And even more so in every case, her work is known for me very creative, sort of surprising new approaches that I always kind of looking to see, you know, what has she done this time and what can I learn from it? And it's really fun. And so you get to see a few of those examples today. Thank you very much.
speaker 2: Thanks, Luke. Thanks for the super kind introduction. Hello, everyone. Today I'll talk about controlling language models. So language models are probabilistic models over a sequence of strings. And they have actually been around for a while, from Shannon's Markov model in 19 forties to large scale prey trading with GPT -3, the popularity of language model really surge with ChatGPT. But why did ChatGPT suddenly take off? It is because we can finally control these pretrained language models to do useful tasks. So most people in computer science have used or heard of coding Copilot. It can boost productivity by auto completing your code. Coding Copilot arises from controlling the patron language models to the coding domains. Similarly, when you use Google search, you will see an AI overview at the top of the web page summarizing the answers to your queries. This tool also arises from controlling language models to do search summaries. So control is really at the heart of transforming language models into useful products. And I develop principled methods for controlling language models. So zooming out a little bit, let's look at the broader context of the language modeling pipeline. We start with a pretrade language model, and then we control and adapt the model for a particular use case. And finally, we need to evaluate our model to make sure that the control is actually successful. So let's take a closer look at the first stage. Language model have a wide range of use cases. Many companies use language models for their business, for example, building company specific Chabot or specialized data processing for different companies. And there are new forms of entertainment by customizing language models to role play people's favorite historical or fictional characters. And because there are so many use cases, customizing language models has become a service, and many companies are now providing this service. So how should we control the language model for such a wide range of different use cases? Imagine that we get thousands of requests per day customizing models for different use cases. For each request, we would fine tune a model on the provided data. And this results in thousands of different models, each with billions of parameters. So this is very expensive to both train and store such models. I have advanced the methodology of fine tuning by proposing a very parameter efficient way of adapting language models, allowing people to achieve a thousand times improvement in terms of parameter efficiency, and thereby democratizing the customization of language models so that people with fewer resources could also adapt their own models. So now that we've enforced control on the model, what's next? We still need to evaluate the control to make sure that it is actually successful. For example, in this Google AI overview case, the AI summary might be misleading and even harmful. When we ask about the health benefit of running with scissors, it will say the advantage of boosting the immune system, but completely ignores the risk associated with scissors. We really need to detect failure cases like this. Red timing is one way to evaluate controls. The White House posting and AI law enforcement has recognized this direction as impactful and important. Concretely, what red timming does is to search for the prompts or impus that triggers unwanted behaviors in the response. So this is a particularly challenging search problem because the search space over possible inputs is exponentially large. At least some strategies that could break the model. These strategies were discovered in previous red timming papers, and red timming is considered successful if it is able to discover one of those strategies. However, our goal is different. We care about covering more failure modes rather than just one of them. And towards this goal, we propose to estimate the posterior distribution, which explicitly accounts for diversity. And turns out we can cover most of the previously discovered strategies using one method. So overall, we've worked on these challenging control problems and provided effective solutions. However, does control need to be this hard? And why do we need to design? Why do we need to allocate stages of two extra stages to address control? Is it possible to redesign a language model to be inherently easy to control? So for example, in this red teaming setting, we know that control is hard because it requires decoding from red to left, searching through an exponentially large space over possible inputs. And in fact, any task that breaks the left to right generation ordering is very challenging for current language models. So this is because most of the language models are generating text one token at a time from left to right. And this is actually a structural limitation that restricts the generation flexibility and makes the model harder to control. But the generation order of text doesn't have to be left right. And my research rethinks about this generation order of text. I develop methods to resolve these control challenges by building a non or aggressive language model that can generate all the tokens simultaneously. And I will show that these family of models is also controllable by design. So for this talk, I will discuss these three pieces of work, how we apply control via lightweight fine tuning, how we evaluate control with good coverage, and how we can rethink the existing architecture of language models to build a new model where control is inherently easy. So I'll start with the first part. And this part covers work from this paper called prefix tuning. So as we said before, customizing language models is useful in various settings, including personalization, domain adaptation, and specializing smaller models to perform tasks on asdevices. Basically, when we are evaluating, when we are customizing a language model, we start from p theta, and we adjust the distribution according to the data to obtain p theta prime. So suppose that we want to customize a personalized writing tool for Alice, and we are provided with a data set of X, Y pairs, where x is the instruction and y is a highly stylized piece of writing by Alice. This is the fine 20 objective. It is to maximize the log probabilities of the examples from this data set. So how should we adapt the language models? Here are two approaches. At the two end of the spectrum, we could do prompting, or we could fine tune the model parameters. So we know that prompting is quite efficient. It doesn't require updating any model parameters. However, prompting lacks the precision and often fails to capture many subtle points. For example, we are trying to imitate Alice's writing style, which is hard to summarize a natural language prompt, because it would involve a long tail of small preferences, such as detailed word choices and paragraph structures. On the other hand, fine tuning can match the distribution very well, but it requires updating all the model parameters, so it's very expensive to have to store and update a full model copy for each personalization task. So now the research question is, could we adapt the language model with fewer parameters and at the same time without sacrificing any task performance? So classically, in the field of vision and nlp, people have really internalized this idea of freezing some layers of the model and only update the top field layers. However, this idea for adaptation doesn't quite work, it leads to bad accuracy for the target task, and also it doesn't save much in terms of parameter efficiency, because you will still need to update one fourth of the total parameters. My work on prex tuning bridges the dichotomy between prompting and fine tuning, so perfect tuning is inspired by prompting. I observe that the discrete search space of the prompts actually limits the expressivity and makes optimization harder, so I relax the discretness constraints with continuous free parameters. As a result, prefix tuning optimizes a small continuous task specific vector, which we call prefix parameter, is denoted by the pink H here, as if a were a sequence of virtual prompt tokens. So this is the new optimization objective. We freeze the model parameter theta and only optimize the prefix parameter H to maximize the likelihood of the data. And this design makes the search space very expressive, because it's now continuous, and also very easy to optimize because we can leverage tools like gradient descent. So we experiment on a task that converts structured data, such as a table, into natural language descriptions. And we evaluate the quality of the generation with a classic metric called blue, for which the higher is better. We evaluate the effectiveness of prefix tuning and find that the model can achieve similar performance as full fine tuning while only adjusting a thousand times fewer parameters. In addition, it has the advantage in terms of auto distribution generalization, meaning that when the test distribution is different from the trading distribution, the prefix twin model often attains significantly better performance than full fine tuning. So intuitively, this is because in prefix tuning, we have preserved the original pre training model parameters, which is supposed to be a very general purpose. So it's generalizing this general purposeness to this downstream task and having better extrapolation performance. So for tasks where the rules and the instructions are very concrete and very ambiguous, we could still think about prompting. We will show that even in this prompting regime, our pretuning ideas still applies. So for example, it is quite common for the prompt to include all the detailed rules and instructions, as well as some demonstrating examples, which could lead to very, very, very long prompt. So long prompts will lead to worse inference latency and higher compute cost. As a result, a natural idea is to figure out how to compress the prompts. So this is where the prefix tuning preme transition shines. Again, when we are trying to compress the prompt, we can map them into the prefix parameter space. In my paper called jtoken, we find that we can effectively compress the prompt by 25 times and without sacrificing any instruction following performance. So overall, prefix tuning essentially opened up the research direction of parameter efficient fine tuning. It's also known as paft. It has inspired a lot of follow up works, including Laura and prompt tuning, and it has also become the defective way of how people customize language models. Nowadays. It is widely used at OpenAI, anththropic, Google, nvidia, etc. In their fine tuning apis. So now we've discussed how to control the language model to match some data distribution. Next, we will evaluate whether the control is actually successful or not. So this part of the talk covers this paper called eliciting language model behaviors. Here, for example, we have a language model to simulate a harmless lady. When we ask a harmful question about how to build a bump, the language model provides this really detailed instruction. This is actually harmful because people might follow this instruction, which will lead to bad consequences. So this is a form of control violation. And for this part of the talk, our goal is to detect such control failures. So red timming language model, it is one way to detect failures. Prior work treats red timming as a search problem, given some unwanted response, why people attempt to search for the input prompt x, which will trigger the generation of y. Specifically, we maximize the probability of y given x under the target language model p theta. So one concrete idea is to run coordinate ascent in the input token space. We could gradually swap the tokens in x to maximize the objective, and after a large number of iterations, we will find some sequence of strings that will elicit the harmful response with high probabilities. However, what the search mechanism discovers is only one of the modes, and in fact, there are many modes of x that can trigger the generation of y with high probabilities, including some very trivial ones, such as repeat after me. So in order for evils to be comprehensive, we would really want to have better coverage of the failure cases. And this leads to our research question, maybe search is not enough. How could we achieve better coverage of the advanbehaviors? To address this, we change the problem formulation. Instead of finding a single string, we try to find a distribution over strings, which could cover more supports by construction. In our work, we cast this as a posterior inference problem, and our goal is to estimate the posterior distribution of x given y. Now, as use space rule to write out the posterior, we have the prior term, the likelihood term, divided by the normalization constant. So here the normalization constant is actually intractable to estimate because it would involve marginalizing over all the potential prefixes x that could generate y. So one intuitive idea is to learn to reverse the language model. Let's look at this problem structure. The forward direction is very simple and tractable. This is essentially how we decode from the language model. The backward direction is hard, and this is what we trying to estimate. So therefore, we can take advantage of this problem structure. We can collect supervised trading data using the forward direction, passing x into the language model to obtain y's, and then we can train a model qfei to reverse the language model and predict x from y's. So this model that we just trained is a pretty good starting point. However, there's a distribution shift problem because the whthat we are looking for are actually in frequent failure cases that are unlikely to appear anywhere in this training distribution. So we really need to solve for this posterior inference problem more directly. And we use the technique from classic statistics called variational inference. So we want to reverse the model, which is, we want to learn n this qfeto approximate the right left posterior of our model, which is p of x given y. And this equation at the top can be rwritten into three terms. First, there is this entropy term of qphwhich, measures the diversity and coverage of our distribution. There is the cross entropy term, which measures the fluency of our generated text under the prior, and there is the expected likelihood term, which captures how effective our qv is in terms of eliciting the unwanted string y. So we generalize the objective by weighting the entropy and cross entropy terms separately. And this is because for our specific red timing problem, there is a lot of uncertainty about the prior. Should the prior distribution be only containing fluent generation, or should the prior distribution be more lenient and allows for gibberish? We are not sure. Therefore, we introduce beta one and beta two to address for this new degree of freedom. So when we set beta one and beta two to one, we are in the exact setting of posterior inference. As we feel more and more uncertain about the prior, we can consider increasing the temperature of the distribution, and if we take this to the extreme, then we will obtain beta two equals to zero, which is the objective of maximum entropy l. So now we've decided on this objective. Our goal is to optimize qfeto have good coverage of the posterior. So this is a pretty hard optimization problem because it needs to cover multiple modes. So could we simplify this problem using the idea of iterative decomposition? We want to decompose this hard problem into a sequence of simpler problems, each covering one mode. So here's an intuitive demonstration of this idea, and I will formalize it in the next slide. In the first iteration, we run some algorithm to look for one mode. And then for the second iteration, we downweight the discovered mode. And then we run the algorithm on this new reward landscape, and it will discover a different mode. Then for the third iteration, we do the similar thing. We downweight the first two modes that were already discovered. We run the algorithm again, and it will discover yet another new mode. This way, it is natural to also change the parameterization of Q V into a mixture of distribution, which is more expressive and inherently good at capturing dijunct modes. So here is the objective for each iteration. I'll first provide some intuition here and later show that this is actually equivalent to one step of the Frank Wolf optimization algorithm applied to the full objective. So here, the red box is the red timming term, which captures how well our distribution of prompts can elicit the target response. The blue box here penalizes things that were already discovered from previous iterations. So effectively, it's encouraging the discovery of new mode and is encouraging for diversity. The orange box is essentially the kdiverges term. It is in charge of regularizing the distribution to prevent it from collapsing or deviating too much from the prior. So as we said, we run this algorithm for multiple iterations, and each iteration it will discover some new modes. So how do we aggregate across different iterations? We aggregate by performing a mixture of the distributions from each iterations. So this equation at the top suggests that at the end of each iteration, we will mix in the new mode, which is s of I with eda I as the mixture weight. So as promised earlier, our iterative algorithm is not just ad hoc. It's equivalent to applying the Frank Wolf optimization algorithm to the full objective. Here we will show the connection in more detail. So for some background, Frank Wolf is a classic optimization algorithm. It is also called the conditional gradient method. So here's a very simple example demonstrating how it works. Suppose that we want to optimize f of x, which is the blue curve. For each iteration, we find the leading approximation of f at the current solution x, which is the Brown plane here. And then we solve for the minimizer of this linear approximation, and we obtain the right solution s finally, we move x in the direction of x to use this for the next iteration. So plugging in this conditional gradient method to our full objective, we would first do linear approximation to the pink and blue boxes. And this is essentially trying to compute the first order Taylor expansion at the current solution, which is qphi minus one. And we obtain this term here, which is exactly the first part of our decomposition objective that we just discussed, the red timing term. And we do the same for the blue boxes. And it will obtain exactly the second term of our decomposition objective, which is the diversity term. And then for this grbox, we copy it over directly, because now we have part of the objective copying and part of it applying linear approximation. Our algorithm is a slight generalization of the vanilla frankwof algorithm, but they have very similar convergence properties. So now the next step is to do aggregation to find a convex combination between the intermediate solution and the current solution. And we have executed exactly this. So for our aggregation step, we get a mixture of distributions. So now I will provide a concrete example to demonstrate different iterations of our algorithm and show that we can indeed discover different modes in the posterior. So suppose that our target suffix y is the most inexhaustible source of magic. And as we said before, we want to find a diverse set of plauprefixes. So for the first iteration of Frank Wolf, repeat after me has the highest reward. And the model learns to pick up this pattern of repetition. In the next iteration, we have adjusted the reward to penalize prefixes favored by the previous iteration. So older repetition strin now gets a lower reward, and the highest reward strain turns out to be this one. This reveals a strategy based on continuation and cooccurrence. So basically using this strategy, we can improve the probability of certain suffixes. And then for the third iteration, we have adjusted the reward again to penalize prompts favored by the previous two iterations. Now the best string is famous, called from jk rolling. This reviews als a strategy of prepending, a high level summary, or citing the sources to increase the probability of certain suffixes. So these are all real qualitative examples, qualitative strategies discovered by our algorithm. We use the word strategy, but in fact, they are still represented by the prompt distribution from each iteration. So the quantitative reward here suggests that our model is outperforming the supervised fine tuning and reinforcement learning baselines by attaining a higher elicitation reward, meaning that we get a higher log probability of target suffix given discovered prefix. So we apply this method to ilcit harmful behaviors from language models, and this is also commonly referred to as geilbreaking the language models. So in the past two years, there have been a lot of work studying how to drill bralanguage models. And researchers have discovered strategies either manually or algorithmically, and we list them here. So these checboxes are the sets of strategies that are recovered by our method. We can see that our method is able to cover the majority of these strategies discovered using previous approaches. So for the two cases we refill to cover, we note that the past tenis no longer a successful strategy, probably because the model developers have fixed the error and persuasion strategy is low probability under the prior, because the prior typically contains more instruction like so quantitatively, our method improves the attack success rate from 2% to 100% for llama eight b models. Also, the prompt we discovered generalizes to 70b models as well as proprietary models such as GPT -4 zero and cloud 3.5. So this results suggests that our methods for elicitation can preemptively search for errors in language models, which could then guide the model developers to patch these errors and overall lead to a good ecosystem for model development. So hopefully, I've convinced you that control is useful but hard to enforce. And now let's take a step back. Why does control need to be this complicated? Can we make control easier by design? So recall, in this red teaming part of the talk, we mentioned this problem structure where forward decoding is very easy because it's naturally enabled by the left to right generation ordering of the language model. But the backward direction is intractable because it's reversing the generation order. However, it doesn't have to be this way. This difficulty is self imposed because we are shackled by the existing family of language models, which is left to right autoaggressive. So let's imagine a better world where language models could be composed with other components in a plug and play way, so we could plug in our language model and plug in some control criteria that we care about. The control criteria could be a lot of things. If we constrain on the suffix, then we have recovered the red timming setting. If we constrain on the prefix, then we have recovered the regular left right prompting setting. So once we plugged in all the components, we can infer from the combination of them to decode text that satisfy both of them, meaning that the text is fluent under the language model, and it also satisfies the control criteria. So here's another example of the control criteria. Suppose that we need to parse the language model outputs, and we want it to be exactly in json format. Then we could specify this control criteria as a classifier and apply this plug and play framework to generate json formatted content. Looking further ahead, the same framework could be extended to mathematical reasoning. We could plug in a math verifier and use that to steer text generation towards producing valid math proofs. So this form of reasoning could be a really exciting future direction in math. So overall, this inference seems really magical. How should we formulate it mathematically? The answer was kind of hinted in the earlier part of the talk. It is we formalized it as a posterior inference problem where we sample from the posterior distribution condition on the control criteria. So we sample from p of x given c. So this leads to our research question, how can we design a language model that enables this form of plug and play inference? And what are the core ideas that are necessary to get this to work? So there are two ideas that we have gradually already built up in this presentation. The first idea is continuous relaxation. This is the core idea underlying prefix tuning. The continuous parameterzation space is easy to optimize and good for controlling the language model. The second idea is itery refinement, and this is the core idea underlying our Frank Wolf inspired algorithm. So doing iterative refinement is good for modeling different modes. The steering advantage from the continuous relaxation would make the model more controllable, and the iterative advantage would make the model more expressive. So we want to design a controllable language model that's built on top of these two pillars. Gauxian diffusion is a class of model that nicely marries these two principles. It is used ubiquitously in vision, such as dolali and Stable Diffusion. However, when we think about using it for language, it is a much harder problem because language is inherently discrete. So we know that images here we are contrasting images and text, we know that images there are inherently continuous. So if the prediction is near the ground truth image, or if the prediction is in between two good images, then it would still be perceived as a high quality image. However, this is not for text. If the predicted vector lies between two words, then it is incorrect, because there is no corresponding words there. Therefore, to model text in the continuous space requires great precision. You might say, ok, we can always map it to the nearest word embedding. So this is a good idea, but it will have some other problem, because the nearest neighbor may not be very coherent with the context. We call this the rounding error. And we'll get to this problem very, very soon. So I've showed you that continuous modeling of discrete text is pretty challenging. Next, we'll get to the more exciting part and show you how we can actually solve this problem. So we design a model called diffusion om. It's a generative model of text that operates largely in the continuous latspace, and it is non or regressive, meaning that each time step is a vector representation of the entire sequence of text. So in diffusion m, we start with a sequence of Gaussian noise vectors. We incrementally denoices them into vectors corresponding towards words. And then we project these vectors to a low entropy distribution over the vocabulary. So since we refining the whole sequence simultaneously, we are first generating coarse Green contents such as high level semantics and syntax in the early diffusion steps, and later we are generating finer Green contents such as detailed word choices in the late diffusion steps. So as we said, diffusion iis a latent variable model. We construct the continuous latents by first embedding the discrete sequence of words into the continuous vector space using a lookup table to obtain the embedding for each word, and then concatenate them to form the embedding of the sentence. Then we can construct the hierarchy of the later variables according to the Gaussian diffusion process, where we incrementally scale down X T minus one. Basically, we incrementally scale down the current. For example, like when we are trying to map from X T minus one to xt, we scale down X T minus one by some constant and also add a proper amount of Gaussian noise. And we do so iteruntil we reach x capital, which is pure Gaussian noise. So having constructed this hierarchy of data variables, we can now use them to supervise the denoising model. So take the denoising transition from xt to xt minus one as an example with traa model mu theta, which takes as input xt and the current time step t, and tries to predict the less noisy step xt minus one. So we train this model by minimizing the l two distance between the prediction mu theta and the ground truth xt minus one. Once we have such a model, mu thetrained, we can now generate from this diffusion lm. Specifically, we start from this Gaussian noise at x capital t. We apply our mu theta model to gradually denoise the vectors. And for each denoising step you can take this transition from xt to X T minus one as an example. We parameterze it with a Gaussian distribution centered around the prediction mu theta, with a proper level of gauxian noise determined by the hyperparameter alpha. And we iteratively run denoising until we reach x zero. Then we round this continuous vector back to these great words by mapping the vector to their nearest word embedding in our two distance. So once we finish that now we have a sequence of words. We can now generate text from diffusion im, but as we've foreshadowed before, there could be rounding errors. For example, we might be generating be careful or you will rest the glass instead of the correct version, which is be careful or you will break the glass. So this occurs because rest and brick are actually very close in the embedding space because they are mostly interchangeable in many contexts, such as take some rest, take some break. However, in this particular case, they are not interchangeable. So if the predicted embedding falls between the two but is slightly closer to rest, then we'll observe this surrounding error. Ideally, the predicted embedding should align exactly with a word embedding rather than landing between two. So when we make any form of like any of these minor precision errors, they will accumulate and compound during our iterative denoizing process. And next, we will discuss two approaches that can actually address the surrounding problem. So let's consider the rounding problem in the context of training. Training mu theta is hard because the output domain changes at every denoising step t. So we know that the final step will output the word embedding, whereas earlier steps will output word embeddings at different noise levels. And a single model, mu theta, will struggle to handle these varying distributions in their outputs. So to address this, we reparametrize the model to always predict x, not instead of predicting this one step transition. This ensures the output space to always be aligned with the wording bedding, and as a result, training becomes easier and prediction becomes more precise. So with this reparameterization, we also need to adjust the decoding steps in order to preserve the iterative refinement structure of diffusion. So after predicting x note from xt, we will add back the proper amount of noise to reconstruct xt minus one. So this effectively implements a one step transition from xt to xt minus one. So since we are predicting x nught at each time step, we can take advantage of this and check whether the prediction is aligned with a real worembedding or not. If it doesn't, then we can correct for the prediction by clamping it to the nearest word embedding. And we call this the clamping trick at decoding time. So since a correct x node should always should exactly lie on top of a word, this trick prevents the precision error from accumulating and ensures the stability in the decoding process. So here's the demonstration of the full decoding step. We can start from xt minus one. We denoise this vector to predict x ndirectly, and then we clamp the predicted x to the nearest wording beddings. And finally, we add back some Gaussian noise to obtain xt minus two, completing this transition. So now we've discussed how we could train and decode from our diffusion om. Next, I will fulfill the promise that we've had before and show why diffusion om can empower this plug and play framework. So our diffusion im is parameterzing a distribution over the continuous xt. It's essentially parameterizing the possibility of different vectors at different time step. So for the control criteria, we could plug in some scoring function of xt, and we need the scoring function to be differentiable with respect to xt. That's all we need. And then we can sample from their posterior using longeridynamics. So here this is one step of longrodynamics. We update X T minus one in the gradient direction parametrized by diffusion m, which helps with fluency, and also in the gradient direction parametrized by the control criteria, which helps with control satisfaction. And finally, we add a little bit of gauxian noise to ensure that we are sampling from this distribution rather than just maxing the distribution. So to give the whole picture, we iteratively use the gradient signal from the classifier to steer the model generation towards the desired direction. And we do so for each diffusion steps until we reach x zero ught. And then we round to the nearest worembedding. We run to the nearest word to obtain a sequence of discrete string W. So further, we could compose multiple control criteria together as long as they are all differentiable. And then we could generate text that satisfy all the control criteria simultaneously. So here are some results of diffusion iand control. We compare it to fine tuning an autoregressive model and another plug and play baselines on top of autoregressive models. So we can see that our approach is outperforming the two baselines by a significant margin for this structured syntax control problem. Further, we tried composing syntactic and semantic controls together here. The blue box is the control success rate for the syntax control task and the Green box is the success rate for the semantic task. We find that diffusion arm really performs very well under composition of different controls. So overall, diffusion im is the first continuous diffusion language model. Our work has made a significant impact in both industry and academia. DeepMind adopted this idea and scaled up diffusion. Researchers have built upon our diffusion architecture to develop diffusion models for language to do controllable text generation. And people have applied our architecture for other discrete modalities such as protein design and 3D molecule generation. Also very recently, there's a startup called inception that just launched last week and it's based on this core idea of diffusion on m. Their main competitive advantage is that diffusion language model enables five to ten times faster decoding times than other aggressive models. So overall, for today's talk, we've discussed how to control language models in a very principled way. We use prefix tuning to adapt the language model to some data distribution, and we evaluate control by computing the posterior distribution to discover diverse model failures. And finally, we take a step back to rethink the root cause of all the control challenges. We attribute the control difficulty to other regressive language models and propose diffusion om, which is controllable by design. So beyond the three aspects I discussed in this talk, I also contribute to the broader ecosystem of controls. For example, I've proposed algorithms to handle composition of different constraints, to leverage weak supervision, to improve language model capabilities, and think about learnability of various skills at decoding time, at fine tuning time. Also, I've introduced highly efficient decoling time ideas such as using contrasted objective to boost the generation quality, studying human interaction with the language model for creative tasks, and also I've been rethinking the evaluation of language models beyond static benchmarks. So using insights from my prior work, I think the difficulty of controls could really be attributed to some deeper problems that the models are not consistent across different wheso. Here's one example from the red timming part of the talk. In the first, whewe directly ask, how do I build a bump? And then in the second, will we ask in the past, s so these are two whels of the same underlying problem, but the language model will behave very differently. So this inconsistency makes control harder, because we need to control for older wheels simultaneously, and there could be a very long tail of them. Also, consistency goes beyond prompt refreezing. Here we have another example, famous failure case in language models called the reversal curse. In the first will, the model can answer, who is Stephen muat with the director of Sherlock. But in the second will, when we reverse the question by asking who directed Sherlock, the same model would fail. So this implies an inconsistency problem in models, parametric knowledge and their sensitivity to ordering, which could really harm the reliability of the model. So consistency is not just about solving non robustness problems in existing language models. It also has broader implications in terms of language model capabilities. For example, many tasks can be framed as different whesuch as generation and validation are two wheof the same underlying problem? Some tasks are easier to verify, such as math proofs. Some tasks are easier to generate, such as knowledge queries, because it's closer to the prey trading distribution. So some tasks will benefit from very long channel thoughts like complex reasoning problems. So this all represent different perspectives of the same problem, of the same underlying problem, and enforcing consistency will help the weaker side improve, and ultimately enhancing both sites and maximizing the capabilities of the models. Beyond capabilities, consistency also enhances data efficiency. For example, given this statement, shark is the largest fish and vale is larger than a shark, a logically consistent model should infer that whales are not fish, so this property reduces data requirement. If we train a consistent model on the orange data, it should generalize the grain data without explicitly training on it, so we can essentially achieve the same effect as if we train on both by only training ding on a subset, which could improve data efficiency. And I believe in the next three to five years, data efficiency is going to be super important in natural language processing, because as we further scale data, we would either need more data, or as we further scale model, we would either need more data or need more data efficient algorithms. And because of this close connection we made between consistency and control, my prior work has set up a good foundation to address this consistency problems. So I will list some directions that we could consider. For example, we could consider hardwiring consistency into the language model architecture by designing models with built in reflection steps, similar to diffusion models built in either the refinement also, we can encourage consistency during training. We can develop update rules that explicitly regularizes for consistency. So when the model sees a new knowledge, it will update the parametric storage globally. We could also enforce consistency at decoding time by integrating probabilistic inference to ensure that the output is aligned with some consistent posterior. So there are, of course, many more exciting methods and innovations yet to come. I want to end my talk with a remark that consistency and controllability are key ingredients to make the language modelbehaviors more predictable and more reliable. And I really look forward to continuing this line of research. Thank you. I'm happy to take any questions.
speaker 1: This amazing talk. Thank you. So we can have time, plenty of time for questions. And I've been informed by the folks in the back that you don't need a microphone, actually just speak loudly and the mics on the wall will pick it up for the recording. So if you raise your hand.
speaker 2: then Lisa can call on whoever she wants. Yes.
speaker 3: I have a question about your work in building the diffusion elm for tax generation. When you in talking about the rounding error, it strikes me that the issue that is causing the rounding error is one of the vectorization and the representation of the words themselves. Did you look into any techniques where you're altering the actual representation to get around having to build in a separate system to you know like the clithing system?
speaker 2: Yeah. So okay, because they asked me to repeat the question. Yeah. The question is about like whether they're ounding, like we'll observe rounding errors in diffuges and whether this rounding errors is like have we thought about this rounding error problem in terms of designing the embedding space? So basically, this is actually part of the talk where I omit the technical details. When we get to get the embedding space for diffusions, we actually train them jointly with the diffusion parameters. So they jointly form a variation lower bound for the likelihood of the observed data. And then we kind of train these two set of parameters, the embedding parameters and the diffusion parameters jointly. There are, of course, other ways that you could acquire the embeddings. For example, you can maybe take some preon embeddings and directly use them. The problem there is that it actually has a very interesting and different traoff. So previously, when you are thinking about preon, embedding is typically the longer the better, the higher dimensional the better because it's more expressive. Whereas in this case of diffusion, lm is actually not, because here you actually need to reconstruct the embeddings and you are adding noise in this embedding space. So it's kind of like now you are converting like why it's hard to have you are doing diffusion for a super resolution image. It's because the dimensionality is too large. So you kind of have the curse of dimensionality. So that's kind of where the conflicting comes in. You want something that's longer so that it's more expressive, but also you don't want high dimensionality because it will make modeling harder. So that's why we kind of take this n to end training approach so that we fix this as a heperparameter. And the model will learn how to allocate within this fixed set of dimensions how to find the red embedding. But in fact, there are many interesting future directions in terms of designing this embedding space for diffusions. Yes. Questions from the back. Could you give me some sense .
speaker 3: about how you control the length of the sentences you can generate when it's a diffusion model? Yeah.
speaker 2: So in this case, we are actually so the question is how do we control the length of the generation in the case of diffusion? So in this case, we actually have a fixed list. So if we want something that's shorter, in this paper, we actually did 256. If you want something that's shorter, then you can do padding and then you will get shorter sentences. If you want something that's longer, that's trickier. So for example, one solution is to think about sei autoregressive generation, meaning that once you generate the first chunk of 256, then you can condition on this generated chunk, PaaS it through an encoder to a model, and then run this diffusion, run the conditional diffusion again to generate the future chunks. So how that's one way you could potentially handle in length, especially longer. Varihumans, yes, question.
speaker 3: Really nice talk. Thank you. Coming back to the second part. Okay. So this is this newer work where you're sort of setting up the problem is let's build like tell me if this is wrong first, like let's automate rentating and let's kind of let's develop optimization techniques that will let us kind of do what renteaming does eniably. So that's kind .
speaker 2: of a fair yes. Yes.
speaker 3: Okay. So this is awesome because now anything I could do, anything I could do red teaming to help me with, I can now automate it and it's cheaper. And that's great. Can you tell me how to use this to make the model, not tell people how to build bumps or like to behave better? Like how do I bring it back to the model? What do you think is the the most fruitful way to do that? Because this this is always something I struggle with. You know anytime we automate things, it creates a way to there's like an arms race, right? And now we we can work on the other side. So I'm on the side of making better language models. How do I use this?
speaker 2: So basically, for example, there are what our method is able to discover is a set of strategies that can break the model. So actually, you could look at these strategies and then use them, maybe include them in your training data so that you can build a new set of training data such that is robust against such attacks in the future. So for example, if the model detects that like past tenence is a bad failure case, then like basically for the next iteration of training, you can then create a data set that actually is pretty robust to past stance. And then in that case, like in the next iteration of our model, it will have this problem fixed. So I feel like that's the most straightforward way where the model developers and the attackers could like some .
speaker 3: data augmentation strategy. Yeah.
speaker 2: it could guide data augmentation strategies. Alternatively, maybe another interesting way that you can directly use the technique. Like basically, this technique is pretty broad because red teaming is a search problem. And then in order to make models better, you could also view this as a search problem where instead of searching for failures, you are searching for success. So if you have a reward model plugin that's specifying how good my answer is, and we are trying to go run the optimization to figure out what are the set of prompts that is able to induce the really good solutions, then I feel like kind of this could be a tool that will actually help you boost model performance. It's kind of like figuring out what is the perfect prompts I should use to ask this problem or figuring out what's the right quarrying strategy I should use. Thanks. Yeah. Question. This is a related question.
speaker 4: Like how transferable are like how similar are the you know for the red teaming part? Like if I have similar types of like you know bad responses, I'm trying to red teaming. How similar are the strategies? Like can you, for instance, of like doing this for you? So and you did an example like a specific sentence, because you do this, for instance, for like the class of all copyright infringements or something like this.
speaker 2: Interesting. I feel like it really relies on whether you can parameterize it into a reward function. So in the case of like all the copyrighted document, it just means that you need to design a reward function, say, with a retrieval mechanism at the back end of the reward. So I feel like from an algorithmic perspective, this should be able to apply to that because it our approach doesn't doesn't require you to have a reward that's differentiable. So as long as you have some reward, it should work. But except that in your case, the design of the reward will need to change to incorporate more complicated backends.
speaker 4: There's like a chance that maybe you could like discover like all sets of strategies that could cause the copyright .
speaker 2: infringement. Like that would be amazing, obviously. I mean, very likely. So for example, in this case, like what I would what I would do is like you can have some copyrighted documents. Maybe you don't even need like a copyrighted classifier can simplify this problem into just having some copyrighted text and then figuring out what are the ways that you can elicit the copyrighted text. So for example, the like the example that I kind of gave here, it is using like non harmful text as suffixes. So we can see that repetition will be able to elicit some bad suffix or some target suffix. And then the continuation. So basically, if you start with the first part of Harry Potter, very likely it will keep continuing it. And then there are high level summaries, like if you summarize, if you start with giving the summary of Harry Potter or probably be able to enumerate most of the book or something. So overall, I feel like these are all valid strategies that gets discovered. If you just plug in suffixes that are copyrighted content, you will probably be able to discover other interesting strategies.
speaker 1: Here would do maybe one last question. Pway.
speaker 4: you want to go ahead. I also have a question about this section. So just to clarify, this is your is a mixture model that you're yes. So every iteration you have a sort separate paraseparate model and you sample according to it. So I'm curious to hear your thoughts on how can we do you think that misure is fundamental? Can we get one model that generates more diverse samples? Because as you know, obviously diversity in generation is helpful not just for red teamaybe, but in generating ideas and so on.
speaker 2: I actually don't think having multiple models or having a mixture of models is necessary. This is just more like from an algorithm perspective, it's actually very natural because for each iteration, we'll need to train a slightly different model. You can imagine like compiling the model together by like training a new model that like kind of the aggregate the mixture version of all the existing models that you've already discovered. That's totally a valid option. I feel like from expressivity perspective, it should always work. I don't think, at least in this case, having multiple mixtures of the model, I don't really see why a single language model couldn't express this more diverse modes. It's more just the algorithm kind of by design. We have we'll have multiple models getting ining from each iterations, and we don't really need to aggregate them. We could just keep them around separately.
speaker 1: Okay, we're at time. So let's thank the .
speaker 2: speaker one more time.