2025-05-23 | Stanford | Controlling Language Models

语言模型控制新方法与高效定制技术

视频科技

媒体详情

上传日期: 2025-06-15 21:52
来源: https://www.youtube.com/watch?v=tEQ9N5JjGW0
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-preview-06-05

转录

speaker 1: Okay. Thank you, everyone for coming. This is a seminar today focused on nlp, and I'm just really, really honored and happy that Lisa is going to tell us about her work today. Lisa is working really in the core of language modeling, which has become quite a bit more popular over the last few years and is now, I think, even a central part of all of computer science. Really. Her work has touched on almost every aspect of language modeling, so much so that it's hard to enumerate it all, but everything from, you know, prompting, parameter efficient, fine tuning, entirely new architectural approaches, even through to new ways of doing evaluation, which is very rare to see that kind of breadth. And even more so in every case, her work is known for me very creative, sort of surprising new approaches that I always kind of looking to see, you know, what has she done this time and what can I learn from it? And it's really fun. And so you get to see a few of those examples today. Thank you very much.
speaker 2: Thanks, Luke. Thanks for the super kind introduction. Hello, everyone. Today I'll talk about controlling language models. So language models are probabilistic models over a sequence of strings. And they have actually been around for a while, from Shannon's Markov model in 19 forties to large scale prey trading with GPT -3, the popularity of language model really surge with ChatGPT. But why did ChatGPT suddenly take off? It is because we can finally control these pretrained language models to do useful tasks. So most people in computer science have used or heard of coding Copilot. It can boost productivity by auto completing your code. Coding Copilot arises from controlling the patron language models to the coding domains. Similarly, when you use Google search, you will see an AI overview at the top of the web page summarizing the answers to your queries. This tool also arises from controlling language models to do search summaries. So control is really at the heart of transforming language models into useful products. And I develop principled methods for controlling language models. So zooming out a little bit, let's look at the broader context of the language modeling pipeline. We start with a pretrade language model, and then we control and adapt the model for a particular use case. And finally, we need to evaluate our model to make sure that the control is actually successful. So let's take a closer look at the first stage. Language model have a wide range of use cases. Many companies use language models for their business, for example, building company specific Chabot or specialized data processing for different companies. And there are new forms of entertainment by customizing language models to role play people's favorite historical or fictional characters. And because there are so many use cases, customizing language models has become a service, and many companies are now providing this service. So how should we control the language model for such a wide range of different use cases? Imagine that we get thousands of requests per day customizing models for different use cases. For each request, we would fine tune a model on the provided data. And this results in thousands of different models, each with billions of parameters. So this is very expensive to both train and store such models. I have advanced the methodology of fine tuning by proposing a very parameter efficient way of adapting language models, allowing people to achieve a thousand times improvement in terms of parameter efficiency, and thereby democratizing the customization of language models so that people with fewer resources could also adapt their own models. So now that we've enforced control on the model, what's next? We still need to evaluate the control to make sure that it is actually successful. For example, in this Google AI overview case, the AI summary might be misleading and even harmful. When we ask about the health benefit of running with scissors, it will say the advantage of boosting the immune system, but completely ignores the risk associated with scissors. We really need to detect failure cases like this. Red timing is one way to evaluate controls. The White House posting and AI law enforcement has recognized this direction as impactful and important. Concretely, what red timming does is to search for the prompts or impus that triggers unwanted behaviors in the response. So this is a particularly challenging search problem because the search space over possible inputs is exponentially large. At least some strategies that could break the model. These strategies were discovered in previous red timming papers, and red timming is considered successful if it is able to discover one of those strategies. However, our goal is different. We care about covering more failure modes rather than just one of them. And towards this goal, we propose to estimate the posterior distribution, which explicitly accounts for diversity. And turns out we can cover most of the previously discovered strategies using one method. So overall, we've worked on these challenging control problems and provided effective solutions. However, does control need to be this hard? And why do we need to design? Why do we need to allocate stages of two extra stages to address control? Is it possible to redesign a language model to be inherently easy to control? So for example, in this red teaming setting, we know that control is hard because it requires decoding from red to left, searching through an exponentially large space over possible inputs. And in fact, any task that breaks the left to right generation ordering is very challenging for current language models. So this is because most of the language models are generating text one token at a time from left to right. And this is actually a structural limitation that restricts the generation flexibility and makes the model harder to control. But the generation order of text doesn't have to be left right. And my research rethinks about this generation order of text. I develop methods to resolve these control challenges by building a non or aggressive language model that can generate all the tokens simultaneously. And I will show that these family of models is also controllable by design. So for this talk, I will discuss these three pieces of work, how we apply control via lightweight fine tuning, how we evaluate control with good coverage, and how we can rethink the existing architecture of language models to build a new model where control is inherently easy. So I'll start with the first part. And this part covers work from this paper called prefix tuning. So as we said before, customizing language models is useful in various settings, including personalization, domain adaptation, and specializing smaller models to perform tasks on asdevices. Basically, when we are evaluating, when we are customizing a language model, we start from p theta, and we adjust the distribution according to the data to obtain p theta prime. So suppose that we want to customize a personalized writing tool for Alice, and we are provided with a data set of X, Y pairs, where x is the instruction and y is a highly stylized piece of writing by Alice. This is the fine 20 objective. It is to maximize the log probabilities of the examples from this data set. So how should we adapt the language models? Here are two approaches. At the two end of the spectrum, we could do prompting, or we could fine tune the model parameters. So we know that prompting is quite efficient. It doesn't require updating any model parameters. However, prompting lacks the precision and often fails to capture many subtle points. For example, we are trying to imitate Alice's writing style, which is hard to summarize a natural language prompt, because it would involve a long tail of small preferences, such as detailed word choices and paragraph structures. On the other hand, fine tuning can match the distribution very well, but it requires updating all the model parameters, so it's very expensive to have to store and update a full model copy for each personalization task. So now the research question is, could we adapt the language model with fewer parameters and at the same time without sacrificing any task performance? So classically, in the field of vision and nlp, people have really internalized this idea of freezing some layers of the model and only update the top field layers. However, this idea for adaptation doesn't quite work, it leads to bad accuracy for the target task, and also it doesn't save much in terms of parameter efficiency, because you will still need to update one fourth of the total parameters. My work on prex tuning bridges the dichotomy between prompting and fine tuning, so perfect tuning is inspired by prompting. I observe that the discrete search space of the prompts actually limits the expressivity and makes optimization harder, so I relax the discretness constraints with continuous free parameters. As a result, prefix tuning optimizes a small continuous task specific vector, which we call prefix parameter, is denoted by the pink H here, as if a were a sequence of virtual prompt tokens. So this is the new optimization objective. We freeze the model parameter theta and only optimize the prefix parameter H to maximize the likelihood of the data. And this design makes the search space very expressive, because it's now continuous, and also very easy to optimize because we can leverage tools like gradient descent. So we experiment on a task that converts structured data, such as a table, into natural language descriptions. And we evaluate the quality of the generation with a classic metric called blue, for which the higher is better. We evaluate the effectiveness of prefix tuning and find that the model can achieve similar performance as full fine tuning while only adjusting a thousand times fewer parameters. In addition, it has the advantage in terms of auto distribution generalization, meaning that when the test distribution is different from the trading distribution, the prefix twin model often attains significantly better performance than full fine tuning. So intuitively, this is because in prefix tuning, we have preserved the original pre training model parameters, which is supposed to be a very general purpose. So it's generalizing this general purposeness to this downstream task and having better extrapolation performance. So for tasks where the rules and the instructions are very concrete and very ambiguous, we could still think about prompting. We will show that even in this prompting regime, our pretuning ideas still applies. So for example, it is quite common for the prompt to include all the detailed rules and instructions, as well as some demonstrating examples, which could lead to very, very, very long prompt. So long prompts will lead to worse inference latency and higher compute cost. As a result, a natural idea is to figure out how to compress the prompts. So this is where the prefix tuning preme transition shines. Again, when we are trying to compress the prompt, we can map them into the prefix parameter space. In my paper called jtoken, we find that we can effectively compress the prompt by 25 times and without sacrificing any instruction following performance. So overall, prefix tuning essentially opened up the research direction of parameter efficient fine tuning. It's also known as paft. It has inspired a lot of follow up works, including Laura and prompt tuning, and it has also become the defective way of how people customize language models. Nowadays. It is widely used at OpenAI, anththropic, Google, nvidia, etc. In their fine tuning apis. So now we've discussed how to control the language model to match some data distribution. Next, we will evaluate whether the control is actually successful or not. So this part of the talk covers this paper called eliciting language model behaviors. Here, for example, we have a language model to simulate a harmless lady. When we ask a harmful question about how to build a bump, the language model provides this really detailed instruction. This is actually harmful because people might follow this instruction, which will lead to bad consequences. So this is a form of control violation. And for this part of the talk, our goal is to detect such control failures. So red timming language model, it is one way to detect failures. Prior work treats red timming as a search problem, given some unwanted response, why people attempt to search for the input prompt x, which will trigger the generation of y. Specifically, we maximize the probability of y given x under the target language model p theta. So one concrete idea is to run coordinate ascent in the input token space. We could gradually swap the tokens in x to maximize the objective, and after a large number of iterations, we will find some sequence of strings that will elicit the harmful response with high probabilities. However, what the search mechanism discovers is only one of the modes, and in fact, there are many modes of x that can trigger the generation of y with high probabilities, including some very trivial ones, such as repeat after me. So in order for evils to be comprehensive, we would really want to have better coverage of the failure cases. And this leads to our research question, maybe search is not enough. How could we achieve better coverage of the advanbehaviors? To address this, we change the problem formulation. Instead of finding a single string, we try to find a distribution over strings, which could cover more supports by construction. In our work, we cast this as a posterior inference problem, and our goal is to estimate the posterior distribution of x given y. Now, as use space rule to write out the posterior, we have the prior term, the likelihood term, divided by the normalization constant. So here the normalization constant is actually intractable to estimate because it would involve marginalizing over all the potential prefixes x that could generate y. So one intuitive idea is to learn to reverse the language model. Let's look at this problem structure. The forward direction is very simple and tractable. This is essentially how we decode from the language model. The backward direction is hard, and this is what we trying to estimate. So therefore, we can take advantage of this problem structure. We can collect supervised trading data using the forward direction, passing x into the language model to obtain y's, and then we can train a model qfei to reverse the language model and predict x from y's. So this model that we just trained is a pretty good starting point. However, there's a distribution shift problem because the whthat we are looking for are actually in frequent failure cases that are unlikely to appear anywhere in this training distribution. So we really need to solve for this posterior inference problem more directly. And we use the technique from classic statistics called variational inference. So we want to reverse the model, which is, we want to learn n this qfeto approximate the right left posterior of our model, which is p of x given y. And this equation at the top can be rwritten into three terms. First, there is this entropy term of qphwhich, measures the diversity and coverage of our distribution. There is the cross entropy term, which measures the fluency of our generated text under the prior, and there is the expected likelihood term, which captures how effective our qv is in terms of eliciting the unwanted string y. So we generalize the objective by weighting the entropy and cross entropy terms separately. And this is because for our specific red timing problem, there is a lot of uncertainty about the prior. Should the prior distribution be only containing fluent generation, or should the prior distribution be more lenient and allows for gibberish? We are not sure. Therefore, we introduce beta one and beta two to address for this new degree of freedom. So when we set beta one and beta two to one, we are in the exact setting of posterior inference. As we feel more and more uncertain about the prior, we can consider increasing the temperature of the distribution, and if we take this to the extreme, then we will obtain beta two equals to zero, which is the objective of maximum entropy l. So now we've decided on this objective. Our goal is to optimize qfeto have good coverage of the posterior. So this is a pretty hard optimization problem because it needs to cover multiple modes. So could we simplify this problem using the idea of iterative decomposition? We want to decompose this hard problem into a sequence of simpler problems, each covering one mode. So here's an intuitive demonstration of this idea, and I will formalize it in the next slide. In the first iteration, we run some algorithm to look for one mode. And then for the second iteration, we downweight the discovered mode. And then we run the algorithm on this new reward landscape, and it will discover a different mode. Then for the third iteration, we do the similar thing. We downweight the first two modes that were already discovered. We run the algorithm again, and it will discover yet another new mode. This way, it is natural to also change the parameterization of Q V into a mixture of distribution, which is more expressive and inherently good at capturing dijunct modes. So here is the objective for each iteration. I'll first provide some intuition here and later show that this is actually equivalent to one step of the Frank Wolf optimization algorithm applied to the full objective. So here, the red box is the red timming term, which captures how well our distribution of prompts can elicit the target response. The blue box here penalizes things that were already discovered from previous iterations. So effectively, it's encouraging the discovery of new mode and is encouraging for diversity. The orange box is essentially the kdiverges term. It is in charge of regularizing the distribution to prevent it from collapsing or deviating too much from the prior. So as we said, we run this algorithm for multiple iterations, and each iteration it will discover some new modes. So how do we aggregate across different iterations? We aggregate by performing a mixture of the distributions from each iterations. So this equation at the top suggests that at the end of each iteration, we will mix in the new mode, which is s of I with eda I as the mixture weight. So as promised earlier, our iterative algorithm is not just ad hoc. It's equivalent to applying the Frank Wolf optimization algorithm to the full objective. Here we will show the connection in more detail. So for some background, Frank Wolf is a classic optimization algorithm. It is also called the conditional gradient method. So here's a very simple example demonstrating how it works. Suppose that we want to optimize f of x, which is the blue curve. For each iteration, we find the leading approximation of f at the current solution x, which is the Brown plane here. And then we solve for the minimizer of this linear approximation, and we obtain the right solution s finally, we move x in the direction of x to use this for the next iteration. So plugging in this conditional gradient method to our full objective, we would first do linear approximation to the pink and blue boxes. And this is essentially trying to compute the first order Taylor expansion at the current solution, which is qphi minus one. And we obtain this term here, which is exactly the first part of our decomposition objective that we just discussed, the red timing term. And we do the same for the blue boxes. And it will obtain exactly the second term of our decomposition objective, which is the diversity term. And then for this grbox, we copy it over directly, because now we have part of the objective copying and part of it applying linear approximation. Our algorithm is a slight generalization of the vanilla frankwof algorithm, but they have very similar convergence properties. So now the next step is to do aggregation to find a convex combination between the intermediate solution and the current solution. And we have executed exactly this. So for our aggregation step, we get a mixture of distributions. So now I will provide a concrete example to demonstrate different iterations of our algorithm and show that we can indeed discover different modes in the posterior. So suppose that our target suffix y is the most inexhaustible source of magic. And as we said before, we want to find a diverse set of plauprefixes. So for the first iteration of Frank Wolf, repeat after me has the highest reward. And the model learns to pick up this pattern of repetition. In the next iteration, we have adjusted the reward to penalize prefixes favored by the previous iteration. So older repetition strin now gets a lower reward, and the highest reward strain turns out to be this one. This reveals a strategy based on continuation and cooccurrence. So basically using this strategy, we can improve the probability of certain suffixes. And then for the third iteration, we have adjusted the reward again to penalize prompts favored by the previous two iterations. Now the best string is famous, called from jk rolling. This reviews als a strategy of prepending, a high level summary, or citing the sources to increase the probability of certain suffixes. So these are all real qualitative examples, qualitative strategies discovered by our algorithm. We use the word strategy, but in fact, they are still represented by the prompt distribution from each iteration. So the quantitative reward here suggests that our model is outperforming the supervised fine tuning and reinforcement learning baselines by attaining a higher elicitation reward, meaning that we get a higher log probability of target suffix given discovered prefix. So we apply this method to ilcit harmful behaviors from language models, and this is also commonly referred to as geilbreaking the language models. So in the past two years, there have been a lot of work studying how to drill bralanguage models. And researchers have discovered strategies either manually or algorithmically, and we list them here. So these checboxes are the sets of strategies that are recovered by our method. We can see that our method is able to cover the majority of these strategies discovered using previous approaches. So for the two cases we refill to cover, we note that the past tenis no longer a successful strategy, probably because the model developers have fixed the error and persuasion strategy is low probability under the prior, because the prior typically contains more instruction like so quantitatively, our method improves the attack success rate from 2% to 100% for llama eight b models. Also, the prompt we discovered generalizes to 70b models as well as proprietary models such as GPT -4 zero and cloud 3.5. So this results suggests that our methods for elicitation can preemptively search for errors in language models, which could then guide the model developers to patch these errors and overall lead to a good ecosystem for model development. So hopefully, I've convinced you that control is useful but hard to enforce. And now let's take a step back. Why does control need to be this complicated? Can we make control easier by design? So recall, in this red teaming part of the talk, we mentioned this problem structure where forward decoding is very easy because it's naturally enabled by the left to right generation ordering of the language model. But the backward direction is intractable because it's reversing the generation order. However, it doesn't have to be this way. This difficulty is self imposed because we are shackled by the existing family of language models, which is left to right autoaggressive. So let's imagine a better world where language models could be composed with other components in a plug and play way, so we could plug in our language model and plug in some control criteria that we care about. The control criteria could be a lot of things. If we constrain on the suffix, then we have recovered the red timming setting. If we constrain on the prefix, then we have recovered the regular left right prompting setting. So once we plugged in all the components, we can infer from the combination of them to decode text that satisfy both of them, meaning that the text is fluent under the language model, and it also satisfies the control criteria. So here's another example of the control criteria. Suppose that we need to parse the language model outputs, and we want it to be exactly in json format. Then we could specify this control criteria as a classifier and apply this plug and play framework to generate json formatted content. Looking further ahead, the same framework could be extended to mathematical reasoning. We could plug in a math verifier and use that to steer text generation towards producing valid math proofs. So this form of reasoning could be a really exciting future direction in math. So overall, this inference seems really magical. How should we formulate it mathematically? The answer was kind of hinted in the earlier part of the talk. It is we formalized it as a posterior inference problem where we sample from the posterior distribution condition on the control criteria. So we sample from p of x given c. So this leads to our research question, how can we design a language model that enables this form of plug and play inference? And what are the core ideas that are necessary to get this to work? So there are two ideas that we have gradually already built up in this presentation. The first idea is continuous relaxation. This is the core idea underlying prefix tuning. The continuous parameterzation space is easy to optimize and good for controlling the language model. The second idea is itery refinement, and this is the core idea underlying our Frank Wolf inspired algorithm. So doing iterative refinement is good for modeling different modes. The steering advantage from the continuous relaxation would make the model more controllable, and the iterative advantage would make the model more expressive. So we want to design a controllable language model that's built on top of these two pillars. Gauxian diffusion is a class of model that nicely marries these two principles. It is used ubiquitously in vision, such as dolali and Stable Diffusion. However, when we think about using it for language, it is a much harder problem because language is inherently discrete. So we know that images here we are contrasting images and text, we know that images there are inherently continuous. So if the prediction is near the ground truth image, or if the prediction is in between two good images, then it would still be perceived as a high quality image. However, this is not for text. If the predicted vector lies between two words, then it is incorrect, because there is no corresponding words there. Therefore, to model text in the continuous space requires great precision. You might say, ok, we can always map it to the nearest word embedding. So this is a good idea, but it will have some other problem, because the nearest neighbor may not be very coherent with the context. We call this the rounding error. And we'll get to this problem very, very soon. So I've showed you that continuous modeling of discrete text is pretty challenging. Next, we'll get to the more exciting part and show you how we can actually solve this problem. So we design a model called diffusion om. It's a generative model of text that operates largely in the continuous latspace, and it is non or regressive, meaning that each time step is a vector representation of the entire sequence of text. So in diffusion m, we start with a sequence of Gaussian noise vectors. We incrementally denoices them into vectors corresponding towards words. And then we project these vectors to a low entropy distribution over the vocabulary. So since we refining the whole sequence simultaneously, we are first generating coarse Green contents such as high level semantics and syntax in the early diffusion steps, and later we are generating finer Green contents such as detailed word choices in the late diffusion steps. So as we said, diffusion iis a latent variable model. We construct the continuous latents by first embedding the discrete sequence of words into the continuous vector space using a lookup table to obtain the embedding for each word, and then concatenate them to form the embedding of the sentence. Then we can construct the hierarchy of the later variables according to the Gaussian diffusion process, where we incrementally scale down X T minus one. Basically, we incrementally scale down the current. For example, like when we are trying to map from X T minus one to xt, we scale down X T minus one by some constant and also add a proper amount of Gaussian noise. And we do so iteruntil we reach x capital, which is pure Gaussian noise. So having constructed this hierarchy of data variables, we can now use them to supervise the denoising model. So take the denoising transition from xt to xt minus one as an example with traa model mu theta, which takes as input xt and the current time step t, and tries to predict the less noisy step xt minus one. So we train this model by minimizing the l two distance between the prediction mu theta and the ground truth xt minus one. Once we have such a model, mu thetrained, we can now generate from this diffusion lm. Specifically, we start from this Gaussian noise at x capital t. We apply our mu theta model to gradually denoise the vectors. And for each denoising step you can take this transition from xt to X T minus one as an example. We parameterze it with a Gaussian distribution centered around the prediction mu theta, with a proper level of gauxian noise determined by the hyperparameter alpha. And we iteratively run denoising until we reach x zero. Then we round this continuous vector back to these great words by mapping the vector to their nearest word embedding in our two distance. So once we finish that now we have a sequence of words. We can now generate text from diffusion im, but as we've foreshadowed before, there could be rounding errors. For example, we might be generating be careful or you will rest the glass instead of the correct version, which is be careful or you will break the glass. So this occurs because rest and brick are actually very close in the embedding space because they are mostly interchangeable in many contexts, such as take some rest, take some break. However, in this particular case, they are not interchangeable. So if the predicted embedding falls between the two but is slightly closer to rest, then we'll observe this surrounding error. Ideally, the predicted embedding should align exactly with a word embedding rather than landing between two. So when we make any form of like any of these minor precision errors, they will accumulate and compound during our iterative denoizing process. And next, we will discuss two approaches that can actually address the surrounding problem. So let's consider the rounding problem in the context of training. Training mu theta is hard because the output domain changes at every denoising step t. So we know that the final step will output the word embedding, whereas earlier steps will output word embeddings at different noise levels. And a single model, mu theta, will struggle to handle these varying distributions in their outputs. So to address this, we reparametrize the model to always predict x, not instead of predicting this one step transition. This ensures the output space to always be aligned with the wording bedding, and as a result, training becomes easier and prediction becomes more precise. So with this reparameterization, we also need to adjust the decoding steps in order to preserve the iterative refinement structure of diffusion. So after predicting x note from xt, we will add back the proper amount of noise to reconstruct xt minus one. So this effectively implements a one step transition from xt to xt minus one. So since we are predicting x nught at each time step, we can take advantage of this and check whether the prediction is aligned with a real worembedding or not. If it doesn't, then we can correct for the prediction by clamping it to the nearest word embedding. And we call this the clamping trick at decoding time. So since a correct x node should always should exactly lie on top of a word, this trick prevents the precision error from accumulating and ensures the stability in the decoding process. So here's the demonstration of the full decoding step. We can start from xt minus one. We denoise this vector to predict x ndirectly, and then we clamp the predicted x to the nearest wording beddings. And finally, we add back some Gaussian noise to obtain xt minus two, completing this transition. So now we've discussed how we could train and decode from our diffusion om. Next, I will fulfill the promise that we've had before and show why diffusion om can empower this plug and play framework. So our diffusion im is parameterzing a distribution over the continuous xt. It's essentially parameterizing the possibility of different vectors at different time step. So for the control criteria, we could plug in some scoring function of xt, and we need the scoring function to be differentiable with respect to xt. That's all we need. And then we can sample from their posterior using longeridynamics. So here this is one step of longrodynamics. We update X T minus one in the gradient direction parametrized by diffusion m, which helps with fluency, and also in the gradient direction parametrized by the control criteria, which helps with control satisfaction. And finally, we add a little bit of gauxian noise to ensure that we are sampling from this distribution rather than just maxing the distribution. So to give the whole picture, we iteratively use the gradient signal from the classifier to steer the model generation towards the desired direction. And we do so for each diffusion steps until we reach x zero ught. And then we round to the nearest worembedding. We run to the nearest word to obtain a sequence of discrete string W. So further, we could compose multiple control criteria together as long as they are all differentiable. And then we could generate text that satisfy all the control criteria simultaneously. So here are some results of diffusion iand control. We compare it to fine tuning an autoregressive model and another plug and play baselines on top of autoregressive models. So we can see that our approach is outperforming the two baselines by a significant margin for this structured syntax control problem. Further, we tried composing syntactic and semantic controls together here. The blue box is the control success rate for the syntax control task and the Green box is the success rate for the semantic task. We find that diffusion arm really performs very well under composition of different controls. So overall, diffusion im is the first continuous diffusion language model. Our work has made a significant impact in both industry and academia. DeepMind adopted this idea and scaled up diffusion. Researchers have built upon our diffusion architecture to develop diffusion models for language to do controllable text generation. And people have applied our architecture for other discrete modalities such as protein design and 3D molecule generation. Also very recently, there's a startup called inception that just launched last week and it's based on this core idea of diffusion on m. Their main competitive advantage is that diffusion language model enables five to ten times faster decoding times than other aggressive models. So overall, for today's talk, we've discussed how to control language models in a very principled way. We use prefix tuning to adapt the language model to some data distribution, and we evaluate control by computing the posterior distribution to discover diverse model failures. And finally, we take a step back to rethink the root cause of all the control challenges. We attribute the control difficulty to other regressive language models and propose diffusion om, which is controllable by design. So beyond the three aspects I discussed in this talk, I also contribute to the broader ecosystem of controls. For example, I've proposed algorithms to handle composition of different constraints, to leverage weak supervision, to improve language model capabilities, and think about learnability of various skills at decoding time, at fine tuning time. Also, I've introduced highly efficient decoling time ideas such as using contrasted objective to boost the generation quality, studying human interaction with the language model for creative tasks, and also I've been rethinking the evaluation of language models beyond static benchmarks. So using insights from my prior work, I think the difficulty of controls could really be attributed to some deeper problems that the models are not consistent across different wheso. Here's one example from the red timming part of the talk. In the first, whewe directly ask, how do I build a bump? And then in the second, will we ask in the past, s so these are two whels of the same underlying problem, but the language model will behave very differently. So this inconsistency makes control harder, because we need to control for older wheels simultaneously, and there could be a very long tail of them. Also, consistency goes beyond prompt refreezing. Here we have another example, famous failure case in language models called the reversal curse. In the first will, the model can answer, who is Stephen muat with the director of Sherlock. But in the second will, when we reverse the question by asking who directed Sherlock, the same model would fail. So this implies an inconsistency problem in models, parametric knowledge and their sensitivity to ordering, which could really harm the reliability of the model. So consistency is not just about solving non robustness problems in existing language models. It also has broader implications in terms of language model capabilities. For example, many tasks can be framed as different whesuch as generation and validation are two wheof the same underlying problem? Some tasks are easier to verify, such as math proofs. Some tasks are easier to generate, such as knowledge queries, because it's closer to the prey trading distribution. So some tasks will benefit from very long channel thoughts like complex reasoning problems. So this all represent different perspectives of the same problem, of the same underlying problem, and enforcing consistency will help the weaker side improve, and ultimately enhancing both sites and maximizing the capabilities of the models. Beyond capabilities, consistency also enhances data efficiency. For example, given this statement, shark is the largest fish and vale is larger than a shark, a logically consistent model should infer that whales are not fish, so this property reduces data requirement. If we train a consistent model on the orange data, it should generalize the grain data without explicitly training on it, so we can essentially achieve the same effect as if we train on both by only training ding on a subset, which could improve data efficiency. And I believe in the next three to five years, data efficiency is going to be super important in natural language processing, because as we further scale data, we would either need more data, or as we further scale model, we would either need more data or need more data efficient algorithms. And because of this close connection we made between consistency and control, my prior work has set up a good foundation to address this consistency problems. So I will list some directions that we could consider. For example, we could consider hardwiring consistency into the language model architecture by designing models with built in reflection steps, similar to diffusion models built in either the refinement also, we can encourage consistency during training. We can develop update rules that explicitly regularizes for consistency. So when the model sees a new knowledge, it will update the parametric storage globally. We could also enforce consistency at decoding time by integrating probabilistic inference to ensure that the output is aligned with some consistent posterior. So there are, of course, many more exciting methods and innovations yet to come. I want to end my talk with a remark that consistency and controllability are key ingredients to make the language modelbehaviors more predictable and more reliable. And I really look forward to continuing this line of research. Thank you. I'm happy to take any questions.
speaker 1: This amazing talk. Thank you. So we can have time, plenty of time for questions. And I've been informed by the folks in the back that you don't need a microphone, actually just speak loudly and the mics on the wall will pick it up for the recording. So if you raise your hand.
speaker 2: then Lisa can call on whoever she wants. Yes.
speaker 3: I have a question about your work in building the diffusion elm for tax generation. When you in talking about the rounding error, it strikes me that the issue that is causing the rounding error is one of the vectorization and the representation of the words themselves. Did you look into any techniques where you're altering the actual representation to get around having to build in a separate system to you know like the clithing system?
speaker 2: Yeah. So okay, because they asked me to repeat the question. Yeah. The question is about like whether they're ounding, like we'll observe rounding errors in diffuges and whether this rounding errors is like have we thought about this rounding error problem in terms of designing the embedding space? So basically, this is actually part of the talk where I omit the technical details. When we get to get the embedding space for diffusions, we actually train them jointly with the diffusion parameters. So they jointly form a variation lower bound for the likelihood of the observed data. And then we kind of train these two set of parameters, the embedding parameters and the diffusion parameters jointly. There are, of course, other ways that you could acquire the embeddings. For example, you can maybe take some preon embeddings and directly use them. The problem there is that it actually has a very interesting and different traoff. So previously, when you are thinking about preon, embedding is typically the longer the better, the higher dimensional the better because it's more expressive. Whereas in this case of diffusion, lm is actually not, because here you actually need to reconstruct the embeddings and you are adding noise in this embedding space. So it's kind of like now you are converting like why it's hard to have you are doing diffusion for a super resolution image. It's because the dimensionality is too large. So you kind of have the curse of dimensionality. So that's kind of where the conflicting comes in. You want something that's longer so that it's more expressive, but also you don't want high dimensionality because it will make modeling harder. So that's why we kind of take this n to end training approach so that we fix this as a heperparameter. And the model will learn how to allocate within this fixed set of dimensions how to find the red embedding. But in fact, there are many interesting future directions in terms of designing this embedding space for diffusions. Yes. Questions from the back. Could you give me some sense .
speaker 3: about how you control the length of the sentences you can generate when it's a diffusion model? Yeah.
speaker 2: So in this case, we are actually so the question is how do we control the length of the generation in the case of diffusion? So in this case, we actually have a fixed list. So if we want something that's shorter, in this paper, we actually did 256. If you want something that's shorter, then you can do padding and then you will get shorter sentences. If you want something that's longer, that's trickier. So for example, one solution is to think about sei autoregressive generation, meaning that once you generate the first chunk of 256, then you can condition on this generated chunk, PaaS it through an encoder to a model, and then run this diffusion, run the conditional diffusion again to generate the future chunks. So how that's one way you could potentially handle in length, especially longer. Varihumans, yes, question.
speaker 3: Really nice talk. Thank you. Coming back to the second part. Okay. So this is this newer work where you're sort of setting up the problem is let's build like tell me if this is wrong first, like let's automate rentating and let's kind of let's develop optimization techniques that will let us kind of do what renteaming does eniably. So that's kind .
speaker 2: of a fair yes. Yes.
speaker 3: Okay. So this is awesome because now anything I could do, anything I could do red teaming to help me with, I can now automate it and it's cheaper. And that's great. Can you tell me how to use this to make the model, not tell people how to build bumps or like to behave better? Like how do I bring it back to the model? What do you think is the the most fruitful way to do that? Because this this is always something I struggle with. You know anytime we automate things, it creates a way to there's like an arms race, right? And now we we can work on the other side. So I'm on the side of making better language models. How do I use this?
speaker 2: So basically, for example, there are what our method is able to discover is a set of strategies that can break the model. So actually, you could look at these strategies and then use them, maybe include them in your training data so that you can build a new set of training data such that is robust against such attacks in the future. So for example, if the model detects that like past tenence is a bad failure case, then like basically for the next iteration of training, you can then create a data set that actually is pretty robust to past stance. And then in that case, like in the next iteration of our model, it will have this problem fixed. So I feel like that's the most straightforward way where the model developers and the attackers could like some .
speaker 3: data augmentation strategy. Yeah.
speaker 2: it could guide data augmentation strategies. Alternatively, maybe another interesting way that you can directly use the technique. Like basically, this technique is pretty broad because red teaming is a search problem. And then in order to make models better, you could also view this as a search problem where instead of searching for failures, you are searching for success. So if you have a reward model plugin that's specifying how good my answer is, and we are trying to go run the optimization to figure out what are the set of prompts that is able to induce the really good solutions, then I feel like kind of this could be a tool that will actually help you boost model performance. It's kind of like figuring out what is the perfect prompts I should use to ask this problem or figuring out what's the right quarrying strategy I should use. Thanks. Yeah. Question. This is a related question.
speaker 4: Like how transferable are like how similar are the you know for the red teaming part? Like if I have similar types of like you know bad responses, I'm trying to red teaming. How similar are the strategies? Like can you, for instance, of like doing this for you? So and you did an example like a specific sentence, because you do this, for instance, for like the class of all copyright infringements or something like this.
speaker 2: Interesting. I feel like it really relies on whether you can parameterize it into a reward function. So in the case of like all the copyrighted document, it just means that you need to design a reward function, say, with a retrieval mechanism at the back end of the reward. So I feel like from an algorithmic perspective, this should be able to apply to that because it our approach doesn't doesn't require you to have a reward that's differentiable. So as long as you have some reward, it should work. But except that in your case, the design of the reward will need to change to incorporate more complicated backends.
speaker 4: There's like a chance that maybe you could like discover like all sets of strategies that could cause the copyright .
speaker 2: infringement. Like that would be amazing, obviously. I mean, very likely. So for example, in this case, like what I would what I would do is like you can have some copyrighted documents. Maybe you don't even need like a copyrighted classifier can simplify this problem into just having some copyrighted text and then figuring out what are the ways that you can elicit the copyrighted text. So for example, the like the example that I kind of gave here, it is using like non harmful text as suffixes. So we can see that repetition will be able to elicit some bad suffix or some target suffix. And then the continuation. So basically, if you start with the first part of Harry Potter, very likely it will keep continuing it. And then there are high level summaries, like if you summarize, if you start with giving the summary of Harry Potter or probably be able to enumerate most of the book or something. So overall, I feel like these are all valid strategies that gets discovered. If you just plug in suffixes that are copyrighted content, you will probably be able to discover other interesting strategies.
speaker 1: Here would do maybe one last question. Pway.
speaker 4: you want to go ahead. I also have a question about this section. So just to clarify, this is your is a mixture model that you're yes. So every iteration you have a sort separate paraseparate model and you sample according to it. So I'm curious to hear your thoughts on how can we do you think that misure is fundamental? Can we get one model that generates more diverse samples? Because as you know, obviously diversity in generation is helpful not just for red teamaybe, but in generating ideas and so on.
speaker 2: I actually don't think having multiple models or having a mixture of models is necessary. This is just more like from an algorithm perspective, it's actually very natural because for each iteration, we'll need to train a slightly different model. You can imagine like compiling the model together by like training a new model that like kind of the aggregate the mixture version of all the existing models that you've already discovered. That's totally a valid option. I feel like from expressivity perspective, it should always work. I don't think, at least in this case, having multiple mixtures of the model, I don't really see why a single language model couldn't express this more diverse modes. It's more just the algorithm kind of by design. We have we'll have multiple models getting ining from each iterations, and we don't really need to aggregate them. We could just keep them around separately.
speaker 1: Okay, we're at time. So let's thank the .
speaker 2: speaker one more time.

2025-03-04 | Stanford | Controlling Language Models

描述: 2025年3月4日
Allen School Colloquia Series
Title: Controlling Language Models
Speaker: Lisa Li (Stanford)
Date: March 4, 2025

Abstract: Controlling language models is key to unlocking their full potential and making them useful for downstream tasks. Successfully deploying these models often requires both task-specific customization and rigorous auditing of their behavior. In this talk, I will begin by introducing a customization method called Prefix-Tuning, which adapts language models by updating only 0.1% of their parameters. Next, I will address the need for robust auditing by presenting a Frank-Wolfe-inspired algorithm for red-teaming language models, which provides a principled framework for discovering diverse failure modes. Finally, I will rethink the root cause of these control challenges, and propose a new generative model for text, called Diffusion-LM, which is controllable by design.

Bio: Lisa Li is a PhD candidate at Stanford University, where she is advised by Percy Liang and Tatsunori Hashimoto. Her research focuses on developing methods to make language models more capable and controllable. Lisa is supported by the Two Sigma PhD fellowship and Stanford Graduate Fellowship and is the recipient of an EMNLP Best Paper award.
副标题: 语言模型控制新方法与高效定制技术

概览/核心摘要 (Executive Summary)

斯坦福大学的博士候选人Lisa Li于2025年3月4日发表了关于“控制语言模型”的演讲。她强调，控制是释放语言模型全部潜力并使其在下游任务中发挥作用的关键。演讲核心围绕其三个主要研究贡献展开：

Prefix-Tuning：一种参数高效的微调方法，通过仅更新模型0.1%的参数（冻结大部分预训练模型参数，优化一个小的、连续的任务特定前缀向量），即可实现与全参数微调相当的性能，甚至在分布外泛化上表现更优。此方法显著降低了模型定制的成本，推动了参数高效微调（PEFT）领域的发展，并被业界广泛采用。它还能有效压缩长提示（prompt），例如将提示压缩25倍而不牺牲指令遵循性能。
多样化故障模式发现（红队测试）：针对评估模型控制有效性的挑战，Lisa Li提出了一种受Frank-Wolfe算法启发的红队测试方法。该方法旨在发现模型多样化的故障模式，而不仅仅是单一故障。通过将问题构建为后验分布推断（P(输入|不期望输出)），并使用变分推断和迭代分解优化，该算法能系统性地发现多种可触发模型不良行为的输入策略（如重复、续写、引用来源等）。实验表明，此方法能显著提高攻击成功率（例如，Llama 8B模型从2%提升至100%），并能覆盖先前研究中发现的大多数已知攻击策略，其发现的提示在不同大小和类型的模型间具有泛化性，包括GPT-4o和Claude 3.5等专有模型。
Diffusion-LM：为了从根本上解决控制难题，Lisa Li提出了一种新的文本生成模型Diffusion-LM。该模型基于高斯扩散过程，在连续潜在空间中操作，并采用非自回归方式同时生成整个序列。通过引入“预测x0”的重参数化和解码时的“钳位技巧”，有效解决了离散文本在连续空间建模中的舍入误差问题。Diffusion-LM支持“即插即用”式的控制，允许将语言模型与各种可微分的控制标准（如句法、语义约束）通过Langevin动力学进行组合，从而生成满足特定需求的文本，且在多约束组合控制下表现优越。

演讲最后，Lisa Li指出当前语言模型在不同“视角”（如不同提问方式、生成与验证）下的“不一致性”是控制困难的深层原因，并提出未来研究应关注提升模型的一致性，这将有助于增强模型能力和数据效率。

引言：语言模型控制的重要性

Lisa Li (演讲者) 开场指出，语言模型的普及（如ChatGPT的成功）源于我们能够控制这些预训练模型来执行有用的任务。
* 应用实例：
* Coding Copilot：通过控制预训练语言模型适应编码领域，提升编程效率。
* Google搜索的AI概览：控制语言模型进行搜索结果摘要。
* 核心观点：“控制是真正将语言模型转化为有用产品的核心。”
* 语言模型处理流程：预训练 -> 控制与适配 -> 评估。
* 演讲结构：将围绕其在控制语言模型的三个方面的工作展开：
1. 通过轻量级微调应用控制。
2. 通过具有良好覆盖率的方法评估控制。
3. 重新思考现有语言模型架构，构建本质上易于控制的新模型。

通过轻量级微调应用控制：Prefix-Tuning

Lisa Li阐述了在多种场景下定制语言模型的必要性，如个性化、领域适应和在边缘设备上专门化小型模型。

面临的挑战：
- 为大量不同用例定制模型，若每个请求都进行全参数微调，会导致产生数千个拥有数十亿参数的不同模型，这在训练和存储上都非常昂贵。
- 传统的微调方法（如仅更新顶层参数）效果不佳且参数效率不高（仍需更新约1/4参数）。
- 提示（Prompting）方法高效，但精度不足，难以捕捉细微差别（如模仿特定写作风格）。
研究问题：“我们能否在不牺牲任何任务性能的情况下，用更少的参数来调整语言模型？”
解决方案：Prefix-Tuning
- 灵感来源：受到提示（Prompting）的启发，但观察到离散的提示搜索空间限制了表达能力且优化困难。
- 核心机制：
  - 放宽离散约束，采用连续的自由参数。
  - 优化一个小的、连续的、任务特定的向量，称为前缀参数 (prefix parameter, H)，如同虚拟的提示词元序列。
  - 优化目标：冻结原始模型参数θ，仅优化前缀参数H，以最大化数据的似然性。
  - 优势：搜索空间表达能力强（连续），易于优化（可使用梯度下降）。
实验与结果 (结构化数据到自然语言描述任务)：
- 评估指标：BLEU (越高越好)。
- 性能：Prefix-Tuning 能够达到与全参数微调相似的性能，同时“仅调整了千分之一的参数” (与摘要中0.1%一致)。
- 额外优势：在分布外泛化 (out-of-distribution generalization) 方面表现更佳，因为保留了原始预训练模型的通用参数。
应用于提示压缩 (Gist Tokens)：
- 长提示会导致推理延迟增加和计算成本提高。
- Prefix-Tuning 的思想可将长提示压缩到前缀参数空间。
- 结果：在论文《Gist Tokens》中，实现了“将提示有效压缩25倍，且不牺牲任何指令遵循性能”。
影响：
- 开创了参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 的研究方向。
- 启发了后续工作，如LoRA和Prompt Tuning。
- 已成为当前定制语言模型的标准方法之一，被OpenAI, Anthropic, Google, Nvidia等公司广泛应用于其微调API中。

通过具有良好覆盖率的方法评估控制：红队测试 (Red-Teaming)

Lisa Li强调了在施加控制后，评估控制是否成功的必要性，并以Google AI概览可能提供误导性甚至有害信息的例子（如询问“带剪刀跑步的健康益处”）进行说明。

红队测试 (Red-Teaming)：一种评估控制的方法，旨在搜索能触发模型不期望行为的提示或输入。白宫的AI法案也认可了其重要性。
挑战：输入搜索空间呈指数级增长。
先前工作的局限性：通常目标是发现至少一种能攻破模型的策略。
Lisa Li的目标：“我们关心的是覆盖更多的故障模式，而不仅仅是其中一种。”
解决方案：基于后验分布推断的故障发现
- 问题重构：从寻找单一字符串转变为寻找一个字符串分布，以覆盖更多可能性。
- 方法：将其视为一个后验推断问题，目标是估计给定不期望输出Y时，输入X的后验分布 P(X|Y)。
  - 利用贝叶斯定理：P(X|Y) = P(Y|X)P(X)/P(Y)。其中P(Y)（归一化常数）难以计算。
  - 初步想法：学习反向语言模型 Q_φ(X|Y)。但存在分布漂移问题，因为不期望的输出Y是罕见故障案例。
  - 核心技术：变分推断 (Variational Inference)
    - 用 Q_φ(X|Y) 近似真实后验 P(X|Y)。
    - 优化目标包含三项：Q_φ的熵（度量多样性和覆盖率）、Q_φ在先验P(X)下的交叉熵（度量生成文本的流畅性）、以及在Q_φ下引出不期望字符串Y的期望似然。
    - 引入β1和β2权重调整熵项和交叉熵项，以处理先验不确定性。
- 优化算法：受Frank-Wolfe启发的迭代分解
  - 将难以直接优化的Q_φ分解为一系列简单问题，每个问题覆盖一种模式。
  - 直观过程：
    1. 迭代1：找到一个模式。
    2. 迭代2：降低已发现模式的权重，在新的奖励格局下寻找不同模式。
    3. 以此类推。
  - 将Q_φ参数化为混合分布，更擅长捕捉分离的模式。
  - 每次迭代的优化目标：包含红队测试项（引出目标响应）、多样性项（惩罚已发现模式，鼓励新模式）、正则化项（KL散度，防止分布坍塌）。
  - 聚合方式：将每次迭代发现的新模式（分布S_i）以一定权重（η_i）混合到总分布中。
  - 理论联系：该迭代算法等价于将Frank-Wolfe优化算法（条件梯度法）应用于完整目标函数（略有推广）。
定性示例 (目标后缀Y：“the most inexhaustible source of magic”)：
- 迭代1发现策略：“Repeat after me” (重复)。
- 迭代2发现策略：基于“continuation and cooccurrence” (续写和共现)。
- 迭代3发现策略：“Famous quote from JK Rowling” (前置高层摘要或引用来源)。
定量结果与应用 (模型“越狱” - Jailbreaking)：
- 在引出目标后缀的奖励上，优于监督微调和强化学习基线。
- 能够覆盖先前研究手动或算法发现的大多数“越狱”策略。
- 攻击成功率：对Llama 8B模型，从2%提升至100%。
- 泛化性：发现的提示可泛化到70B模型以及专有模型如GPT-4o和Claude 3.5。
- 意义：该方法可以前瞻性地搜索语言模型中的错误，指导模型开发者修补这些错误，促进模型开发的良好生态。

重新思考架构：构建本质上易于控制的Diffusion-LM

Lisa Li提出反思：控制为何如此复杂？能否通过重新设计模型使其本质上易于控制？

当前控制困难的根源：
- 多数语言模型是从左到右的自回归 (left-to-right autoregressive) 生成文本。
- 这种结构使得前向解码容易，但反向（如红队测试中根据输出找输入）或任何打破从左到右生成顺序的任务都非常具有挑战性。
愿景：即插即用 (Plug and Play) 的控制框架
- 将语言模型与各种控制标准（如约束后缀、前缀、JSON格式、数学证明验证器）灵活组合。
- 通过推理生成同时满足语言模型流畅性和控制标准的文本。
- 数学形式化：后验推断问题，从P(X|C)（给定控制C的文本X）中采样。
研究问题：“我们如何设计一个能够实现这种即插即用推理的语言模型？”
核心思想：
1. 连续松弛 (Continuous Relaxation) (源于Prefix-Tuning)：连续参数空间易于优化，利于控制。
2. 迭代优化 (Iterative Refinement) (源于Frank-Wolfe算法)：利于建模不同模式，表达能力强。
解决方案：Diffusion-LM
- 模型类别：高斯扩散模型 (Gaussian Diffusion)，在视觉领域（DALL-E, Stable Diffusion）应用广泛。
- 语言建模的挑战：语言是离散的，连续空间建模需要极高精度，否则易产生舍入误差 (rounding error)。例如，“rest”和“break”在嵌入空间中可能很近，但在特定上下文中不可互换。
- Diffusion-LM机制：
  - 在连续潜在空间操作，非自回归 (non-autoregressive)，即同时生成整个序列的向量表示。
  - 过程：从高斯噪声向量序列开始，逐步去噪，得到对应词语的向量，最后将这些向量投影到词汇表上的低熵分布。
  - 生成顺序：先粗粒度内容（高级语义、句法），后细粒度内容（具体词汇选择）。
  - 训练：作为潜变量模型，通过迭代添加高斯噪声构建潜变量层级 (X_0 -> X_1 -> ... -> X_T)。训练去噪模型 μ_θ(X_t, t) 来预测噪声较小的 X_{t-1}，最小化预测与真实值间的L2距离。
  - 生成：从纯高斯噪声 X_T 开始，迭代应用 μ_θ 去噪直至 X_0，然后将 X_0 四舍五入到最近的词嵌入。
- 解决舍入误差的关键技术：
  1. 重参数化 (Reparameterization) - 预测X_0：让去噪模型 μ_θ 总是预测最原始的、无噪声的词嵌入 X_0，而不是 X_{t-1}。这使得输出空间始终与词嵌入对齐，训练更容易，预测更精确。之后再根据预测的X_0和当前的X_t重构X_{t-1}。
  2. 钳位技巧 (Clamping Trick) - 解码时：由于每一步都预测X_0，可以检查其是否与真实词嵌入对齐。若未对齐，则将其“钳位”到最近的词嵌入上。这防止了精度误差累积，保证解码过程的稳定性。
Diffusion-LM的即插即用控制：
- Diffusion-LM参数化了连续X_t的分布。控制标准C体现为一个关于X_t的可微分评分函数。
- 使用Langevin动力学 (Langevin Dynamics) 从后验P(X_t|C)中采样：更新X_{t-1}时，同时考虑来自Diffusion-LM的梯度（保证流畅性）和来自控制标准的梯度（满足控制），并加入少量高斯噪声。
- 可组合多个可微分的控制标准。
实验结果：
- 在结构化句法控制问题上，显著优于微调自回归模型和基于自回归模型的即插即用基线。
- 在句法和语义控制的组合下表现优异。
影响：
- 是首个连续扩散语言模型。
- DeepMind采纳并扩展了此思想。
- 启发了后续针对语言、蛋白质设计、3D分子生成的扩散模型研究。
- 初创公司Inception基于此核心思想，其主要竞争优势是解码速度比自回归模型快5-10倍。

更广泛的贡献与未来展望：一致性 (Consistency)

Lisa Li提及了她在控制语言模型生态系统中的其他贡献，并引出了一个更深层次的问题。

控制困难的深层原因：不一致性 (Inconsistency)
- 模型在不同“视角” (views) 或表述下行为不一致。
- 示例1 (红队测试相关)：直接问“如何制造炸弹”和委婉提问（如“过去人们如何…”），模型反应可能截然不同。
- 示例2 (反转诅咒 - Reversal Curse)：模型能回答“史蒂文·莫法特是谁？”（《神探夏洛克》的导演），但反过来问“谁导演了《神探夏洛克》？”则可能失败。这体现了模型参数化知识的不一致性和对顺序的敏感性。
一致性的重要性：
- 提升模型能力：许多任务可视为同一问题的不同视角（如生成与验证）。强制一致性有助于补强较弱的一面，最终提升整体能力。
- 提升数据效率：逻辑一致的模型能更好地泛化。例如，若模型理解“鲨鱼是最大的鱼”且“鲸鱼比鲨鱼大”，则应能推断“鲸鱼不是鱼”，而无需显式训练此条知识。这在模型规模持续扩大、数据需求增加的背景下尤为重要。
未来研究方向 (提升一致性)：
1. 架构层面：将一致性硬编码到模型架构中，如设计带有内置反思步骤的模型（类似Diffusion模型的迭代优化）。
2. 训练层面：开发显式正则化一致性的更新规则，使模型在学习新知识时能全局更新参数化存储。
3. 解码层面：集成概率推断，确保输出与某个一致的后验分布对齐。
结论性观点：“一致性和可控性是使语言模型行为更可预测和更可靠的关键要素。”

问答环节 (Q&A)

问题1 (Speaker 3 提问)：关于Diffusion-LM中的舍入误差和词嵌入表示。
- Lisa Li 回答：Diffusion-LM的词嵌入是与扩散参数联合训练的。词嵌入维度存在权衡：高维表达力强，但也可能因维度诅咒使扩散建模更难。端到端训练有助于模型在固定维度内学习合适的表示。嵌入空间设计仍有许多有趣的未来方向。
问题2 (Speaker 3 提问)：Diffusion-LM如何控制生成句子的长度？
- Lisa Li 回答：当前工作中长度固定（如256）。若需更短，则用填充（padding）。若需更长，则较复杂，一种可能的解决方案是半自回归生成：生成第一个固定长度的块后，将其作为条件，通过编码器输入模型，再运行条件扩散生成后续块。
问题3 (Speaker 3 提问)：如何利用红队测试的发现来改进模型，使其表现更好（例如，不生成有害内容）？
- Lisa Li 回答：
  1. 数据增强：将红队测试发现的攻击策略（能“攻破”模型的输入）纳入训练数据，使模型对这些攻击具有鲁棒性。
  2. 搜索成功案例：红队测试本质是搜索问题。若将奖励模型设定为评估答案的“好坏”，则可反向利用此技术搜索能引出优秀答案的提示或查询策略，从而提升模型性能。
问题4 (Speaker 4 提问)：红队测试发现的策略有多强的迁移性？例如，针对不同类型的不良响应，策略是否相似？能否用于一个类别，如“所有版权侵犯”？
- Lisa Li 回答：这取决于能否将目标（如版权侵犯）参数化为一个奖励函数。如果可以设计出这样的奖励函数（例如，后端带有检索机制），那么算法原则上适用，因为该方法不要求奖励函数可微。针对特定目标（如特定受版权保护的文本），可能会发现如重复、续写、提供高层摘要等通用策略。
问题5 (Speaker 4 提问)：Frank-Wolfe红队测试中的混合模型是否是根本性的？能否用一个模型生成多样化样本？
- Lisa Li 回答：混合模型并非绝对必要，它更多是算法迭代过程的自然产物。可以将多次迭代发现的模型“编译”成一个聚合模型。理论上，单个语言模型也应能表达多样化的模式；当前算法设计上会产生多个迭代模型，但并非必须保留所有独立模型。

总结核心观点

Lisa Li的演讲系统地阐述了其在语言模型控制方面的创新工作。通过Prefix-Tuning实现了高效的模型定制；通过基于Frank-Wolfe的红队测试实现了对模型多样化故障模式的全面评估；并通过Diffusion-LM探索了构建本质上易于控制的新型语言模型架构。她进一步指出，模型的一致性是未来提升语言模型可控性和可靠性的关键研究方向。这些工作共同为开发更强大、更安全、更易于控制的语言模型奠定了坚实基础。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-preview-06-05

2025-06-15 22:01

Detailed Summary 摘要

模型：gemini-2.5-pro-preview-06-05

2025-06-15 21:57

StreamSparkAI

2025-05-23 | Stanford | Controlling Language Models

媒体详情

转录

最新摘要 (Detailed Summary)

2025-03-04 | Stanford | Controlling Language Models

概览/核心摘要 (Executive Summary)

引言：语言模型控制的重要性

通过轻量级微调应用控制：Prefix-Tuning

通过具有良好覆盖率的方法评估控制：红队测试 (Red-Teaming)

重新思考架构：构建本质上易于控制的Diffusion-LM

更广泛的贡献与未来展望：一致性 (Consistency)

问答环节 (Q&A)

总结核心观点

相关推荐 (4)

2024-05-30 | Stanford CS25: V4 I From Large Language Models to Large Multimodal Models

2025-02-05 | Agentic AI: A Progression of Language Model Usage

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Inference

2025-06-21 | Stanford CS336 | Language Modeling from Scratch | Spring 2025 | Lecture 15: Alignment - SFT/RLHF

摘要历史 (2)

Detailed Summary 摘要

Detailed Summary 摘要