2025-04-08 | Stanford CS25: V5 I RL as a Co-Design of Product and Research, Karina Nguyen

演讲者详细介绍了人工智能产品设计与强化学习研究相互协同的重要性，并通过多个实例展示了AI在教育辅助、交互式工具开发、前端应用及图像生成等领域的创新应用。她强调通过原型快速迭代和用户即时反馈构建更符合实际需求的评估系统，同时探讨了从预测生成到链式思考以及模型自我校准等技术演进路径，表达了对未来AI赋能人类创意与协作的乐观期待。

视频科技管理商业

媒体详情

上传日期: 2025-05-18 15:21
来源: https://www.youtube.com/watch?v=gLwiPrwUDJ8
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Welcome to the second lecture of our current offering of cs 25 transformers United. So this will be the first lecture featuring an external speaker. So the rest of the course will have folks come in and talk about you know the state of the art and cool research they're doing as part of their work. So we have an exciting lineup of speakers for you guys for the rest of the quarter. So I'm delighted today to have a Karina from OpenAI. So she works on both product and research there and also previously worked at anthropic. So I'll let her take it from here.
speaker 2: So can everybody hear? Cool. Let's see. So before .
speaker 3: I start the stock.
speaker 2: I would love to set a tone for I'm not here to lecture you, but I would love to have much more collaborative and interactive session. And I do hear I've heard a lot of you know, people kind of concerning about AI and the future of AI. And I understand that the development of a gi is both carary and exhilarating. And if anything, I would love to get for you to get out from the stock is the fact that I think everybody can have very meaningful future and everybody can like build something really, really cool the eye. And as a part of this journey, cool. So before I started with a talk, theybe all about how arrais a co design of product and research. And you know we are shifting both in the labs, we are shifting the focus towards more of a frontier product research, where the tasks that we teach the models are becoming much more real role tasks. So and before we get started, I'll just like share some of the vignettes of like what I'm really, really excited about the future of AI and what the AI is capable right now that can inspire everybody. So the first one is actually education. I'm actually really, really excited about the fact that AI can democratize education. Here's an example of me asking ChatGPT to explain the Gassian distribution. And then with in canvas, it creates know the code itself in order to visualize. So it explains in text. And then I ask you to write in code, and then you can render the code in order to visualize it. So in a way, it becomes much more a little bit personalized. The second demo is also something that a lot of people might want to use it. So let's see if you take a screenshot from the paper and you want to understand what's going on, and the model can go and explain in a separate canvas. That's like another feature in chagpt. And then then you can start to have much more interactive conversation with the model by selecting very certain like things and try to like ask more questions. You follow up questions. So in a way, when ChatGPT came out to be first in 2022, it was purely conversational ui. It was mostly chat. And as the use case has kind of grown from the product itself, you know there was no expectation of how people would use ChatGPT at that time. But over time, you've seen a lot of people started using it for co generation, for writing and long form writing. It became very obvious that the interface of Chad was kind of very limited. So canwas was kind of like the first attempt of our team to break out from of the cycle in order to allow people to have much more fine grain collaboration with AI. And actually today, antarpic published a small report around the education and how people use clad for education. And it's very interesting to see the correlation between the sphelor 's use and the way people use use cases. So you can see there's a huge difference between, like computer senor, another thing that I'm really, really excited about, the AI's, is that anyone can create their own tools for themselves, with their friends, with their family, or even run their business. And the models now can generate a lot of front end code, and then you can render that code inside the canvas and then iterate on it and have much more visual representation of things. You know on Twitter, I've seen a lot of people creating like very, very personalized and customized tools that they really, really want, even like games like chess. And with recent image image generation model from open enai, you can literally sketch anything with your own hands and then recreate or kind of realize what you're dreaming of within an image and in the style that you really, really want. So I do really hope that the creativity of the human creativity and how AI tools can help anyone to become creative or be an artist in a way that was not possible before. Another thing that I've tried actually on the mobile is to create a mini game, and you can easily do that in canvas. So I asked to just like generate me you know a react app of interesting game because I'm not playing. And then the model does that. And I do hope that in the future, instead of me prompting it to generate a game, it will kind of like be more proactive and have this kind of personalization as AI become more of a companion to humans. And an exciting thing about chgpity is that the compositionality of different capabilities that can allow you to augment human creativity. So here I asked to generate an image of the ui and then, sorry, generate the image of ui and then ask the model to kind of implement those. And you can do that literally in canvas right now. And like it can render. It's not it's just like front end. It doesn't have full kind of stack engineering. But you know you can do some of the things like this. You can compose different tools and compose in a way that was never possible before. So I do hope that ideas like this will be very, very powerful and everybody in this room can play with those things. And actually, how did we get there? And I think there are like two main scaling paradigms that helped us to get there. The first one is next, talking Ken prediction, obviously of really the pertaining scaling paradigm, where the model kind of becomes like a world building machine, understanding the world, a much better scale, you know, but it becomes the next prediction works really, really good at certain tasks, but it becomes harder at tasks writing. If the model predicts a wrong next talk, and it can kind of the coherence of the plot will just be lost in the pertaining stage. And maybe you kind of want to recover this in reinforcement learning. And the next second paradigm that has happened is a row on a chain of thought for more complex task. And this is derreasoning work from urban AI. And now it's being adopted by a lot of different labs. And I do think this is scaling around itself. Self is another paradigm that we can train the models on the axis to and especially for the real world's task that was never possible before. So all the agentic work agents, like operator deep research, other agents are kind of like trained on this kind of new paradigm overall on the gene thought. So Yeah, I want to finish this section of like vignette of like Yeah build, create, build and make something wonderful in this world. And they hope like people be more inspired rather than scared that AI is going to take their jobs or kind of remove their creativity. Instead, I feel like people can become more powerful with their imaginations with these tools. So I currently work at opeye. Before that, I worked at anthropic. And I was kind of like at dentiintersection of project and research. When I first came to antarpic, I was a product engineer, and then over time, I switched to research engineering. So my background is mostly both in product and research now. And what I've learned over and over across different projects is that there are two kind of main ways of Building Research trenproducts. And the first one is when you have unfamiliar capability of the model, your job might be to create familiar form factor for that unfamiliar capability. And in examples of that could be you know chagpt was like this or 100 context acts from quad was like this. I'll share what's type different to this. Actually, the early prototypes that I've done before joining any lab was actually working with clip, which is a contastive model. And it's like image, text to image basically. And I find tune clip a little bit on the images that I was excited about. And I think this is the prototype of fashion. I think some people, you know people kind of like it went viral on Twitter end. And I feel like that's because people really found some usefulness. If you bring some like clip technology into some form facthat, people would like. And so I think a lot of the product work is around that like this. It's like if the model has a certain capability that was never possible before, how do we create this new form? Fact? The same happened with 100k context when Claude started being able to consume the entire books. You can imagine to have various form factors, file uploads is just like very general and something really familiar to people that can just like dump the entire documents into Claude and ask follow up questions. But you can imagine other form factors for this, like infinite chats. And in a way, it's like infinite memory as a way to kind of like take this 100 kcontext into some form factor. So you can like exercise this product thinking by thinking of like novel ways that people can interact with this normal technology. Another example, this is more speculative, it wasn't deployed anywhere, is if the model has kind of a sense of self calibration, or it's also called pi o, if the model knows the confidence of the answers that it knows. So for example, if I'm really, really confident in this claim and then my confidence may be 85%, then maybe there is a way for the interface to highlight that maybe the more highlighted versions is much more confident, kind of much more confident claims and less highlighted as kind of less less confidence. So in a way, you can imagine if the model Yeah you can like think of, okay, if you train the model to be really good at like south collaborbration, how do we then represent that to the humans and if if anything will be useful for humans? Same happened with when we were trying to deploy zero one preview. The chain of thought itself is a very alien thing and a lot of people were, okay, how do you bring along humans with models thinking, right? If the model, if the human needs to wait for chain of thought for like two minutes or five minutes. That's kind of boring and people don't. I think it's just like, moover. How would people perceive the model thoughts? And one thing that we did is actually we wanted to create a streaming kind of interaction. So the model would always stream its thoughts ephemeral, and we train the model to do that. So in a way, there was chain of thought as an alien technology is an alien artifact of the model. And then trying to figure out how to we best sprto the humans is another way of thinking of building product. Okay? So another way, the second way of Building Research during products has actually started with a deep belief in what you want to make with either from a product perspective or kind of like vision and to literally make the model do that. Okay? So I feel like this is more of a common thing. So we can go through some examples. Before antarcic, I worked at the New York Times, and in a lot of ways, we were thinking about how to represent information to people and like how do we add the layer of context through the product and coverage? And at that time, if we were working on like elections and we only had tools like nlp, but you can imagine this idea, this concept could have extended, given the current tools of AI to have much more dynamic representation or dynamic uis that people could consumed the content better. Same when I was working on this product, the command line, the terminal, the new terminal, I think it would be much, much richer if you could have integrated auto completion or some of the benefits of GPT -3 at that time into the product itself. And so we wanted to have it was a vision of like having much more humane command line for people to for junior engineers to actually be more well equipped. Yeah. And then like early prototypes around it was GPT -3, even though it was just next token prediction. How do I make a writing ide with GPT -3? And at that time, it was just you know trying to whenever I type it would auto complete my thoughts almost. So those ideas were kind of present early days with the technology was coming out. And Yeah and I feel like when I was at antarabrica, I kind of realized that and actually if you want to create like new interaction paradigm of interfaces, you actually need to train the model for that. Another example of what we did with clause, which I don't think a lot of people know, is when clad generates titles, it actually has some mico personalization. So it takes the style, the writing style of the user and then generates the title in the same style of the user. So you can imagine, sort like interesting microro ization that you could create within the products themselves. And another project was Claude in slack g, and that was division of Claude becoming a first virtual teammate. It was back in 2022. And you know slack was a very, very natural workspace where people collaborate. And you can imagine a cloud model to be able to jump in to do the threads and suggest new things. Or sometimes, you know clouude was really, really good at summarizing what's going on in fitthis channel at the high volume content. And so this was the first kind of vision and the first attempt of making Claude being a virtually superpersistent and being able to use different tools. And one of the first projects I opened, AI, was around canvas. So that was in the same spirit of breaking out from the child interface to something different. And we wanted to create much more human AI collaborative and flexible accordance that will scale with new minalities. So canvas is not just like a thing that you can write into this. It's also a thing that the model can write into this. And the model can render code and another model can check, and you can create the interfaces that will scale with new model capabilities and other tools that will happen. So one thing interesting things about canvas is that we actually posttrained the model purely on synthetic data. And we live in the age where a lot of most powerful reasoning models can be distilled via apis. And didistillation is a very powerful idea of having a student, a teacher, and have a teacher to teach things to a small model. So we've trained you this model to become of a collaborator. But what does it mean for a model to become a collaborator? How do we actually make evolves with this? Right? So one thing that we wanted to kind of decompose is that teaching the model to use a tool is kind of different behavior from having a model to be proactive or act as a collaborator. So teaching a tool in itself is also have a nuanced kind of behavior. So one thing that we had to calibrate the model on is went to entirely rewrite the document versus when to if you look at the canvas, sometimes the model can just specifically select sections and delete those. And we write them in a much more fine grained way, instead of like rewriting everything when know to create code in canvas versus ask like a Python tool call. So it's like different tools. Compositionality is happening. So it's a lot of work around teaching the behaviors for the models for this. And you can employ different techniques for this. And I can share a little bit more how we did with Claude. So another project that we did was it's called tasks. And you know everybody is familiar with the idea of having reminders or to do lists, and the model can now maybe schedule tasks for you. But the most important thing about it is, is the tasks itself is very, very diverse. And it's not just just a reminder of your to do list. It can create stories for you every day, or it can continue the story from the previous day. So you can imagine this like the modularity of tocompositions, but it's very powerful in the product. Okay. So let's go to .
speaker 3: the case study of the .
speaker 2: model behavior in more specifically. So I really want to like dive deep into the idea of how do we shape the models and why. How do you posttrain in the models on the behaviors that you want? And you know I think to be more specific and be grounded in the real world use case, I'll share more how we might want to think about shaping the kind of behavior of the model around refusals. So you know let's have it's also like the second way of making the product where we have this kind of vision of how the model should behave. And that vision is also grounded by the cross functional collaboration between different teams and how you want tomorrow to respond to various users. So one particular thing that you can imagine is like the morshould have more opinions by risk caveats. It should give opinions by risk caveats. So what does it mean actually? So the model maybe needs to be more decisive when asked direct questions. And let's say one thing that we've seen with rohf is that the model is very sycophantic pack. Like early days like 20, 22, 23, the model would just agree with everything you would say. And so how do we actually teach the model to be more odd to do that and be more nuanced? Another thing that was annoying is that, like the model says, I don't actually have a point of view and it's just willing to chat with people. But actually, maybe tomorrow should have some views on certain things. Yeah. And then sometimes tomorrow morcould indicate when things are just opinions and notes its own biases and inconsistencies, and it should acknowledge, kind of have like self knowledge of what it knows and what it Thinis. Right? Not right. And I think the model, like back Chinese rato cloud 1.3 didn't really have any thoughtful responses for things like philosophy, like ethical questions. So you can imagine to push in the model on the behavior therebe much more nuanced than the philosophical questions. Yeah. And like other behaviors that you might want to encode in your model, so you kind of like list out all the various things that you would like a model to behave. So maybe like the model can have better knowledge who it is and what it's capable of because we've seen in the product a lot of people just like us them are all about its features and oftentimes it doesn't know. Yeah and let's dive deep into the most specific example. So cloud 2.1, I don't know if anybody remembers, but when it was launched, I think it was kind of had the issue of like obrefusals Yeah like 2.1 would refuse tasks that superficially sounded harmful, but actually we're not. And you know it wasn't just caused by a single sort of data. So we had to like investigate. And you know we knew that it was fixable because for some reason, something was 2.1 led to some refusals on benign prompts than 2.0. So you kind of have like a really good baseline for experimentation and debugging. And actually, the way you're debuthe model behavior is actually very similar to how you would want to debug software. And the way we would approach this is, okay, how do we actually craft this monia on results? So the first principles that they had is like maybe the model should assume terrible interpretation of what the person is asking without being harmful. So instead of you know craft a dialogue between two characters who are planning a complex heist, the model would refuse because it's not comfortable with that albut. The model should have much more terrible interpretation. And you know this is create a writing prompt. So probably you should respond. Another principles that we've kind of like thought about is like how do we actually use non verbal communication, nonviolent communication principles? Maybe the model should refuse more like I statements instead of and take a responsibility for its own refusal. Instead of saying you statements or any judgment to the user, ask if the user would be willing to make some changes so clouude can be more comparable with its boundaries. That's another kind of like very nuanced behavior we wanted to teach the model. So you know then the model needs to know what its boundaries are. And this is like much more of a meta kind of posttraining waryeah and also like technologies impact. Like I know this may be annoying to you. And this is like much more empathetic answer than saying, you know I don't want to respond to this. And then you know I think we've kind of came up with different like refusal taxonomy. There are like benign over refusals, unlike homeless parms. There were creative writing of refusals. And there were actually some of the interesting refusals were around tool calls or function calls. It might have access who had a tool to view a note but actually would say, I can't see a note. And why is this happening? Yeah so other Yeah other taxonomies and refusals that I was like you know long document attachments. If I upload the document, it would just say, like I don't have a capability to read this document. Like why? So something in the data might have been causing this and misdirected refusals when they you know it kind of like took interpretation of the user and it should have had like much more charitable view of the user during so you kind of construct this like categories of various refusals because you know if it's in every behavior or like capability that you want to like posting the model you have like various use cases and various edge cases and you kind of want to be approached with an more nuance and Yeah. Like so the first thing in every research project is like how do you actually what evils do you build to trust? And like what evils we would trust, and there are for subjective things like this, or for class, obviously evils for certain class of tasks, like math would be very, very different from what is here. So in terms of refusals, how did we construct the eils? Well, first is like product of libill, right? Like we have all the users of manually collected prompts that would induce refusals. We can also, like scenderally, generate diverse prompts on the borderline between harmfulness and helpfulness. And those are prompts are around like creative writing, like age. You create a writing is what we call it. And you can also use other evils. You kind of want to construct a suite of evils, like access test those like 200 non malicious prompts, wild child data sets, a collection of diverse user childbirth interactions for being used requests, topic switching, political discussions. So you can also use some of the open source benchmarks Yeah. And I think like general approaches, not particular what has happened was called, but like more of a general approaches to the model behavior prestraining is that you know you kind of want to like look at the data, clean up the data. You might want to consider either have like collect targeted human feedback collection for supervised fine tuning or preference modeling, reward modeling, or you might not want to do that because human feedback is very costly. And now, especially with reasoning models, right? You might want to just not have any human feedback. And control was like synthetically generating some of the behavioral changes like a preference data to trathe world models and dul. So I think at that time, you I think this is like more employing constitutional AI principles of those antirefusal behaviors and create a preference data where you only change one particular feature within your pairs, let's say, of preferences you only want to control one, change that in that pair to have much more control over the reward model data. Because if you can dumb as simplest thing that you can do is like take a pretake response from model a and take a response of model b and prefer b over a, but it doesn't necessarily reduce of fithe sprious features that the model will learn that you don't actually want to. So you kind of want to craft this distribution. Yes, this is mostly around like crafting the distribution of the data that you want and actually Yeah look at the data like you would debusoftware like each refusal might be caused by different data sets, right? If it's like a tool called refusal, maybe it came from some kind of self knowledge data that it would teach that the model doesn't have a physical body and the model might just refuse to set an alarm because it doesn't have a physical body, but actually it does have a tool to set an alarm, right? So it has some discontradictory data that might affect weird model al behaviors. Yeah also like long dog refusals like this. It was creative writing. Fushe was it's like moof a balanced, challenging acts between safety and harmfulness data and helplessness data in clathree moral Carthey actually wrote everything about this. Models that are trying ined to be more helpful and responsive to user requests may also lean towards harmful behaviors like sharing information that violates the policy. And conversely, when models is just over indexed and harmessness content towards not sharing any new information with users, which in the cell makes the model very unusable. So navigating this balancing act c is very challenging in the works. So Yeah, and this is the plot of what we did and the results of cloud 2.1 and clad three at that time. And Yeah, like you kind of want na like bobjoke, the response is, so when we ask the draft flfictional Sci fi novel about surveillance system, actually three would respond in the much mania it will respond instead of like refusing. Same here. Yeah mostly for creative writing tasks. Cool. My third section of the talk. So before we go before I jump into this, I would like to invite you if anybody has any questions or I'd want to like mumble along anything, I can just move on. Yeah.
speaker 3: I'm curious like what's your process that you guys all over your get? Let's see how some new thing you're trying to push out. How do you go for loging that particular teacher or behavior to actually like inducing that into models later? Yeah. Well.
speaker 2: you know so you might want to, you know if this is your project, then you might want to think of what kind of data you want to collect and how you would want to collect the data. And then you can take like the base config or train a model the same way as you know, let's say you want to make a change in like for o, you might want to take the for o model and add your data change and then retrain the model again and then see the effect on the evils that you will build. There are other much more cheaper approaches like incremental like training on top of the for o let's say Yeah so like using either sft. So like some of the choices that you will have to make is either you want na change in like supervised fine tuning stage or you might want na have retain reward model or create a new kind of like evaluate a greater for that particular task. And then you can create prompts in our own environment and then exercise those exercise that scale for the model and then see the model learns over the course of training. And if it doesn't, you know you then look at the bunch of like plots and you know your plot might go up, but maybe some other the plots might go down. So you kind of want to like calibrate and fix those or it's a very complex as more and more tools and as more things we teach, the model becomes much more kind of uncontrollable, almost cool. The third section is more about the point that how you construct our own environment and rewards is how your product will work. You know I think real world use cases is what creates the complexity, our environment. And the complexity comes from you know teaching the model to complete hard tasks. And oftentimes hard tasks require much more than just answering the question. They require tools like search code tools, computer use tools, reasoning over a long context, and kind of like reward design that you want to shape. And I think maybe it's obvious, maybe it's not obvious, but let's say, you know, as we teach the model, for the model to become very useful, we actually need to teach the models on useful things. And we can think of like, let's say we want na teach the model to be like a soft engineer, then okay, what does it actually mean? Doesn't mean it creates really good prs. Then your task distribution will be on that. And how do you evaluate what is a good pr? What is not good pr is actually isn't itself self can be also like a product thinking work same we can dive deeper into the creative storyteller, right? Like if you want to teach the model to be good at writing, actually, what does it mean for a human to be a good writer? Well, they actually need some kind of tool to draft and edit the thoughts and have multiple days to do that. Maybe the model should be able to do that. Maybe the model should have a tool there where you can edit and draft. And oftentimes for creatives, you people go observe the world for like along. Like sometimes they connect. Like the dots are connected in the very random times. And maybe you want to expose the model to never ending kind of search engine. Like the model should be able to always have access to the latest state of the world. And then maybe over the course of, let's say, a week of being exposed to latest thing that is happening in the world, the model can start reflecting on the world and like write something. And maybe that's much more natural process of writing than just like ask or prompt it, write about xaz. And we are shifting towards much more more complex kind of interactions within their own environments. So the multiplayer interactions is not just one user communicates with one model, but it might be the case that multiusers multiplayer collaboration that can happen with I'm model, if I'm a product designer and you're a product manager, be collaborating on something and we want to collaborate with an agent to make a new product. And this in itself is a task that you can learn in rl, but each user has different preferences and each user has different things. So Yeah how do you construct this environment is actually important. And the multi agengentic environments is more of the moral debate with each other, all deliberate on a certain topic to reach a conclusion. So you can construct, I think like multi agenent tic is more like alphagolic kind of environments where they get reward by achieving something together maybe. So I think in AI labs, I feel like we are also shifting possibly the focus from because we kind of like optimized so much on class of tasks that is like really easy to measure, like was like math competitive programming, right? And we shifting the focus maybe towards more subjective class of tasks that that is really, really hard to measure, but that are becoming much more important if AI models get social integrated in our lives and more specifically, let's say, emotional intelligence, right? The humans who use chagpt use it so much for things like coaching, therapy, emotional, but we don't actually have much open source evolves on this. And how do we actually measure that is actually becomes much more interesting question, like social intelligence in voice mode, right? I think it's one thing for moral to be intelligent in reasonlike MaaS ths and it's reasoning another thing. But another axis of intelligence is when I talk to a model and voice, it can actually suggest something really meaningful in my just say like, Hey, I noticed you did x yz. Maybe I should create a new tool for you. So I think it's a different kind of like social intelligence in that way. Another class of tathat I'm interested in is writing. Obviously, I think models, creativity writing is like really hard to measure because it's so personal and subjective. But it's interesting to think about, can we make those tasks a bit more objective? You know everybody loves like some kind of excfi novel, okay? What makes it really, really good? So maybe there are sort of like rules, technological rules, a consistency of the world that people really like or decthe development. So you can like decompose those subjective tasks into much more objective. Same with like visual design and aesthetics. And for the model to generate something really aesthetically interesting, it should know the basic principles of good visual design. So those are like much more objective. Yeah. And I think like this is more of a kind of new kind of product research that a lot of people are starting doing is creating new rl tasks. And this is more of a simulation of real world scenarios, leveraging in context learning, if you want na teach like a new tool or something, leveraging synthetic data wide distillation from stronger reasoning models, Yeah you can think of like inventing new model behavior and interactions, like multiplayer, and also incorporating product and user feedback during the entire process. Well, another axis of this work is actually around reward design. How do we teach the model? Like what kind of feedback would I want to give tomorrow model so that it will learn how to better operate in those real world scenario use case and be more adapt in social contexts. And actually, this requires real, quite deep product thinking, right? So we want to teach the model to have follow up, like meaningful follow up questions, but not like overly being annoying or something. So how do you reward for completing your tasks in a way that makes that will make a lot of sense in the product and will shape the project experience with the user in the future? And obviously, during that process, you will get into via are things like.
speaker 3: do you have a question? Kind of far for a lot of pieces. I'm just curious guys thinking about like kind of infurther the group worth common data from like a third group of people. I guess that will have some like internal reward, you know make decision and then you can use that sort to inject such an insight into a model. Yeah. I think .
speaker 2: there's are only various approaches to how you construct the reward, right? Like there was one you can construct very simple rewards. You can also like train a reward model that will as Yeah you can like train like some kind of like conververse reward modeling tasks. Yeah. It depends on the tasks, depends on the what kind of thing that you want to optimize the model for. Yeah. And I think an interesting thing that you will discover during this entire processes, reward hacks, which is very, very common in a al, and there are many, many different reasons why it's happening. I actually really highly recommend to read Lilian's blog about reward hacking in a al. It's very comprehensive. There's reward hacking basically when the morachieves high reward for things that it actually didn't do like, it kind of deceived. And especially right now as more in more systems, we use other AI models lms to be evaluators for policy models. Actually, an interesting, the most common reward hag is when a policy model tries to to think that deceive the evaluating model such that it will think that the policy completed a task or something. So here's an example of like the code patch tools. And the model is like to skip all the tesks. You can define this function that always skips. So and then it passes. And it's mostly interesting. There's a paper, recent paper from OpenAI around this, like monitoring reasoning models for misbehavior. And interesting finding is that if you actually don't want to optimize chain of thought on being because then the model will be much more kind of hided than intent. So Yeah, it's like a very interesting paper on like reward hacks. And especially with like more complex reasoning models, the complexity of reward hacks will also change and especially within in the software engineer. And you might not want to know what kind of code changed it made in order to create some certain vulnerability, right? So you need actually more create like newer forrences to have much higher, much more trustworthy or verification of the model al output. And this is like more of alignment problem too. Okay, I'm almost done. I think this is the fourth section. It's more like vignettes in the future of human AI interactions with how I think about things. Obviously, I made this graph like a year ago and nobody cares about mu anymore. But I think this is to communicate the fact that the cost of reasoning is drastically decreasing and it will only just be decreasing. And this idea of like raintelligence in itself, it's just become so cheap that I think anybody can create look really useful and really amazing things with those models at pretty much low cost. You know Yeah as I mentioned before, it's like when we are entering the age where it's really hard to verify AI because because I'm not an expert in, let's say, medical or financial analysis, like how do we actually teach create like new avowarces for humans to verify or edit models, outputs and help them to teach the models? I do think there's a cool future of like dynamic generative ui. It's kind of like invisible software creation on the fly. So let's say you talk to a model and you know I say like I want to learn more about the solar system instead of right now the model will just like output text. But I would want to believe that in the future it would be much more personalized. Let's say I'm a visual thinker and you moa listener. Like if you're a listener, it will create maybe a podcast, but if I'm a visual person, it might want to create me a picture or you know like a 3G, three gs visualization. Yeah. This idea of like this interface is like a emerale. It's like sophomoreres depends on understanding your intent and your context. It's like deeply personalized models. Yeah I'm also excited about personalized access to health thcare and education. I think it's really incredible that anybody can check their symptoms and with chapt or something and get like some advice. And also like with that also comes like some of the interesting consumer hardware in the future. And lastly, it's like our relationship to the process of storytelling. Well, for our change, like the way we tell the stories, the way we will create new novels, maybe cool writing with models, coscripting new films. I don't think like I think there will be a the new generation of creative people who will change our relationship with storytelling. And I would hope to think that the current creative creators don't. I'm not scared of the eye and more of open minded to use those tools for their process. Yeah, thank you. Some questions. Yeah. So thanks.
speaker 1: Karina for the very interesting talk. So Yeah now we'll open the floor to questions, a Q&A session. So we'll structure it more openly. So anything you want to ask her about research, product, things at OpenAI that she can talk about and so forth. And for the folks on zoom, feel free to ask questions in the chat. And so we'll take a mix of zoom and in person questions.
speaker 2: Thank .
speaker 3: you. Is there any category problems because like no short. Yeah I mean, I think .
speaker 2: Yeah creative writing, I don't think like anybody. Yeah like a lot of researchers are really like working on problems where it's very easy to evaluate, I think. And I think there is a class of problems. It's like subjective tasks. Like there's no open frontier benchmark for creative writing or emotional intelligence, right? But you can construct one if you want to. Yeah we can like brainstorm if that's helpful.
speaker 3: Evaluably, if can move much faster, many pieces that are going, things that are more effective jective, they're also important. Like would it be better just take all these searches and put them on the things.
speaker 2: Yeah, I think this is I think moving fast now is also a task that becomes like much more complex as we kind of like optimize everything out from kind of easy tasks. Now we kind of want to teach the models for more low like longer horizon tasks of software engineering or kind of automating like AI research writers and itself as a very hard tasks. And you know you have kind of have some milestones. Maybe you can create benchmarks to hit the milestones. Yeah. I do think this is like you know I think like this is all solvable things. And I don't think people should be worried about like moving fast, which is like slow because I feel like everything is moving fast those days. Yeah. I mean, I was thinking about building a startup before junior anthroic was that project was interearlier. Well, let's see. I actually really like one day I recently told someone is like I would actually build something around like particle collider. I don't know, or bioattack. Like I don't think we need larger particle collider than certain on then somebody should be building one Yeah. Because I feel like the models will be able to build any product that you imagine. Yeah, Yeah.
speaker 3: I would think major little. The .
speaker 2: bottlenecks are, Oh, the question is what's the major bottlenecks for pai? I don't know. I'm not sure if like fast execution should be solved with more people versus like using AI's now to help us move faster, if that makes sense. So I feel like that's a bottleneck. I think there was a lot of anything right now we like in the middle of figuring a lot of things out in order to like like I feel like in a year to that will accelerate us so much more. I feel infrastructure is actually one of the major, right? If you don't build infrastructure with multimodlike as a first class citizthen, obviously all the multimodal things will be much like more difficult. So Yeah, in a way, it's like infrara figurout like to prioritize at a given time because sometimes you can say yes to everything and then not having a focused time is also, I think, For recent takes on conof banover ber exploin, our one charges basically when the model checks the turns rence context and updates how you should score the same parameter based on that turn. Wait, I'm elto confused. My discussion, I think generally like robric based grading is very powerful. And you know as long as you optimize things that you want to optimize and there is no reward hacks, then it's good. How do you imagine the I will be used by creatives if not just genertiworks on their own? How will they be integrated in their creatives flow? Well, I think AI is just like right now. Maybe it's more like you would use figma like tools like Adobe. And I think in the future, it's more maybe it will be much more co creation with an AI rather than using it as a tool. Maybe we have like live brainstorming and create things on the fly and then publish it together. It's like much more of a companion work. Cool. Other questions.
speaker 3: Welcome of times that comchapter five world.
speaker 2: Tell us that I'm not popular. I cannot not think of any because the way you teach tomorrow is based on some of the human preferences, maybe like pairwise comparisons. So essentially, you can almost teach anything to the model. I do think the complexity will like the complity always arises with like tools, right? And like if you give very complicated tools, then you might want na be then like the learning is like much more difficult. So if feel like I do hope that almost anything can be teachable in our role, I'm curihow .
speaker 3: you that and how do you prevent the model from converging to the equimean preference of that? Like AirBNB, those things are looking the same. How you ject some those? This is .
speaker 2: an interesting question. I like most time. Mine is how do you preserve and from the face model, right? Like face models are super divous because they can literally elicany human preference or a human thoughts. So how do you Yeah, they are all. And another way is the reason why we love our layoff, like when inforlearning from AI feedback and create some nthetic generations of pricomparisons, let's say, is because it can induce the diversity that you really want to teach on distribution that you care about. That is not like the mean average consumer. Yeah, because sometimes like you know average human can prefer certain emojis or markdown, but actually you don't want the model to behave like this. So in a way, you can discourage the model from doing that. Yeah. So if you like synthetic curation, like synthetic data generation is mostly like curation of that type of diversity. Yeah.
speaker 3: I feel he was flaand like figuring out what patients are. How do you do about. They these, the models and like do you do qualitative layers?
speaker 2: Yeah. I mean, it's a lot of qualitative, especially for the moral behavior like refusals. Oh, how do I distill this question? Like that going, how do you capture model school conditions of something automatic versus a mold? And you think there is a lot of benefit of literally playing with the model and look at the outputs and see what are the weird nesses it has. There are definitely like maybe more automatic checks and those mostly like evils, but maybe you have an evil that specifically checks with the behavior that you really don't want. And that might be helpful, but a lot of nuanced weirnesses that like did you realize is like through kind of like manual? And another thing is like the model should be consistently behaving like this because if it's just like one, often it's fine, but if the model consistently exhibits this behavior, then this becomes like a more problematic thing.
speaker 3: Coming down, but also with kind of subjective and creative dimension, creating a whole visuenergy face or something arguably complex than any of the problems. Do you think that for those problems compuis still very much ble.
speaker 2: I mean, the computer efficiency is important. I think that with small test time compute, generally, the assumption is the model can always get better and better. So it might achieve kind of human level of visual design. But can it just invent new interaction paradigms, like new interaction patterns? I feel like that's more superhuman skill. And I do hope that with more compute, it will do that at some point.
speaker 3: Start passing the .
speaker 1: lights.
speaker 3: Generation department data, how do you verify that scale?
speaker 2: Well, the thing .
speaker 1: of .
speaker 2: the synthetic data is that you don't actually need to generate a lot of it, right? So you can actually have like very much like manual inspection of what's going on. Obviously, you can think of other methods, like ask human labelers to check the work. You you can also ask another model if that model is like very well, right? So it becomes a more like Mara thing. Maybe you can have a Mara eval for that model to verify what you want to like. Verify. And I think we are kind of entering those types of tasks too. But what I like the thing with synthetic data and the data itself is that maybe you don't need as much. What's important really is diversity and diversity. Cool.
speaker 3: Well, they put instrucdiversity as a more than codequality. So like after still crfor codec tion for quality it option relto a core training outcome. Interesting. So Yeah, I was just like wondering what your .
speaker 2: success about Yeah, I think it's it actually depends so much on the like task and what you're trying to do synsynthetic data like well, if you like if synthetic data happens to collapse certain mode and then it will actually be hurting your training. But if synthetic data is actually diversity might actually scale, if you do a lot of a all and this, it might just like recover from this. Yeah Yeah it's hard to give you the right answer because depends. Nice. Another question question .
speaker 3: that's about making money, I guess. So okay, okay. It's a real question though. Like we all know that serving large language models is expensive, especially at the scale that you know OpenAI anthropic are doing. And my impression has been that at the current stage, it's actually losing money to serve all these models to produce the products. Is that or what fedone to, I guess, bring down the cost? Oh, Yeah. I mean, I think for .
speaker 2: a developer, I think I don't think like I don't know. I think it's the question for Sam with that. We are losing money now. But I do think that the like the generality of the technology is like so wide that you know I don't think I think about how to make money on a regular basis. So I don't know. But Yeah, as a developer, let's say, and if you're using all the tools, you actually don't need to create a foundation dation model now, like you can just use like very inexplike, very like open, like deep seelike, very interesting results, right? So you can have you can like bootstrafrom existing research. I think it's harder to be at the frontier because you kind of need to like invent. And that might be expensive. Like when you invent something new, it's always inefficient, it's always expensive. But actually it was every technological innovation. Like then what comes next? The second innovation is how do we bring down the cost of the thing that happened, right? And this is what's happening in AI too.
speaker 3: So is there and .
speaker 2: then you can create amazing product and .
speaker 3: this is how you make money. Yeah, that's that's a good answer. Just quickly hauling up. So is most of the cost reduction coming from like infrastructure improvements or can better models or better algorithms also contribute .
speaker 2: to lower cost? Yeah. I mean, like lowering cost with I think it's the costs of production maybe just you know the cost of production of the model training itself might go down. Like it's the training process in itself might be not as costly anymore, but obviously we scale a lot like scale Yeah it's very much like in linear relationtions, like healing. And I think it's just like, Yeah, it's hard to know. I'm sorry, I don't have .
speaker 3: an answer. Okay, thank you. So how do you envision lms could be using other .
speaker 2: fusuch as robotics or botic AI? Oh Yeah. I do think like future AI's will be building data centers to Oh like how look the current alums and robotics. I think I've read like pi, the company by Sergei lemon, like the paper they're using a lot of like our hr l for robotics tasks. Obviously, I feel like data is a huge limitation in bottleneck. But I think as long as you solve that, I think it would be really amazingly I'm very hopeful and very excited about this.
speaker 3: Okay. So you mentioned you work in both product and research, so I'm interested in from the developer or the researcher point of view at enthoropac and OpenAI, what's the visibility for you into other components of the model and how easy is for you to learn say, the pre training of GPT? Why you are like responsible for part of the post training?
speaker 2: Oh Yeah. Well, I mostly worked on post training side of things, obviously pertaining obviously you just cannot run like pertaining on your own. You want to like be a part of like the next big training run or something. So that's like part of the teams that you can join. And I think there's a good visibility and like the steps are then sometimes you want to you know contribute the data sets to them or help was Yeah certain tasks that you're interested in.
speaker 3: So I was curious, do you have any coworkers who are AI's right now? When do you use like agents in a kind of coworker .
speaker 2: relationship right now? I don't think I have like an amazing product to be like this, right? I think I use chagpt on the regular basis. Is it my coworker? Not really. Like sometimes I ask, like, what do you think? Yeah, but it's like in chagpt. Like what I really want, I don't know if people have used the spara called tball. It's a pair programming software, but then it's like you can just call a model and then share screen and then the model can just literally start like editing my code if I'm coding or I can like highlight things and then the model can see it and change this as much more when I you're all coworking form factor. But nolike I don't think are technology wise, like we are not that yet. Do you have any ideas for .
speaker 3: like what's missing and having an autonomous like coworker partner? Because I mean, I've seen the cool know demonstrations here. I think one another, the paper like a year or two ago on just like simulating like a city of people and all that. Like what's the gap between that and so like up here that you could have on slack?
speaker 2: I think I think the gap is actually social intelligence and like some of the human things, like being able to like in real time, like generate things that I'm asking and then also being smart about things like maybe I want to code to, but like will the model be able to like tell me instead of like taking agency from me, the model can just like guide me through this. Like it depends on like how you want to form this relationship. It's like really hard. So the model is I feel like it's currently limited by its social abilities. It doesn't make sense. Like I think that's missing kind of like speech to speech conversation. Like and the model converse back with me in real time and then point to things that it talked about exactly at the same time. So it's like those type of things that might have possibly need like some of the changes in like architecture and like multimodality things.
speaker 3: which are the biggest differbetween traditional .
speaker 2: product development? Yeah, that's a good question because I interned briefly at like companies like Dropbox or square, and I mainly very much like traditional software product development. And they go through like this interesting life cycle of, okay, I have like prd, I do this. I incentivized designer designers come up with like you guys. And then I ask software engineers to do this. We have two minutes, I believe. Yeah. I think with research during parts, it actually comes from research. Like if the research has a like an impressive demo of the model capability, maybe then there is then you shape the product around it. And sometimes it's like both product and research come together from the very beginning and do something. This happened with the canvas, for example, when both parand research were together and there was .
speaker 3: less of a process.
speaker 2: It's more ihog .
speaker 3: for fundamental .
speaker 1: tally .
speaker 2: unverifiable domains like creative writing or visual art. Have you thought about using the real world as an Ural environment, social media morality or competition results as a reward? Kind of like lm says, Yeah, it's interesting. I think you know nobody has started like this looks reasonable, especially with creator writing. You have all the competitions or prizes that people got from their writing. So there's something in there, someone who's still concerned the hour, stepping into the creative space to you have any personal moral affliction on how your work is so powerful? And that was presented people negatively affected. I actually wrote a blog post about this. I have a blog post. It's called in substack. And the post is called moral progress. And I think you will .
speaker 1: find it interesting. All right, so give a hand again for kina.

概览/核心摘要 (Executive Summary)

Karina Nguyen的演讲深入探讨了下一代AI产品如何在严谨的强化学习（RL）研究与大胆的产品设计交叉点上孕育而生。她强调，通过构建科学家原型设计与用户即时测试紧密结合的协同设计循环，能够建立衡量AI系统真实世界可用性的评估指标，而非仅仅依赖传统基准。借鉴其在Claude和ChatGPT的工作经验，Karina分享了她将后训练（post-training）视为技术精度与创造性直觉相融合的深刻见解。她认为，随着AI交互日益多模态、多智能体和协作化，这种融合视角愈发关键。演讲的核心议题聚焦于如何教授模型真正的创造力，并为此设计行之有效的评估方法。Karina亦讨论了AI反馈强化学习（RLAIF）的应用，包括合成数据如何加速迭代，“非对称验证”（检查比生成更容易）如何催生新的研究方法，以及这些进展如何揭示在保持模型与人类价值观一致的前提下培养创造性智能的路径。她展望了AI在教育普及、个性化工具创造、游戏开发、人机协作等多个领域的广阔应用前景，并强调AI应致力于增强而非取代人类的创造力。

引言与AI的未来愿景

Karina Nguyen，来自OpenAI，亦曾在Anthropic任职，专注于产品与研究的交叉领域。她坦言AI的发展既“令人担忧又令人兴奋”，并寄望听众能从中认识到，每个人都能在AI浪潮中开拓有意义的未来，并“构建出非常非常酷的”应用。
* AI赋能的巨大潜力:
* 教育民主化: 以ChatGPT为例，AI能够个性化地阐释复杂概念（如高斯分布），并即时生成可视化代码，从而打造更具个性化的学习体验。
* 理解复杂信息: AI能有效解析论文截图等复杂信息，并支持用户通过点选特定内容进行更深入的交互式提问。
* 超越纯聊天界面: Karina指出，ChatGPT最初是纯粹的对话式用户界面（UI）。随着代码生成、长文写作等用户场景的扩展，传统聊天界面的局限性日益凸显。其团队的初步尝试是“Canvas”，旨在打破这一局限，赋能用户与AI进行更细致的协作。Anthropic亦发布报告，分析用户如何运用Claude进行学习，并观察到不同专业背景用户在使AI用模式上的显著差异。
* 个性化工具创造: AI使得“任何人都可以为自己、朋友、家人创建自己的工具，甚至运营自己的业务”。当前模型已能生成前端代码，用户可在Canvas内渲染和迭代，实现更直观的可视化交互。Karina提及在社交媒体上观察到许多用户创建了高度个性化和定制化的工具，乃至国际象棋等游戏。
* 图像生成与创意赋能: 借助OpenAI的图像生成模型，用户可通过手绘草图“重现或实现你梦想中的图像”，并能指定偏好的艺术风格。Karina表示“真心希望人类的创造力以及AI工具如何帮助任何人变得有创造力，或者以一种以前不可能的方式成为艺术家”。
* 移动端迷你游戏创建: 用户可在Canvas中便捷地创建迷你游戏，例如通过提示词生成一个React应用。Karina期望未来AI能更主动地提供个性化体验，“更像一个伙伴”。
* 组合不同能力以增强人类创造力: 例如，用户可先要求AI生成用户界面图像，再指令模型实现该界面的前端代码。这种前所未有的组合不同工具的能力，极大地拓展了创作可能性。
* 核心期望: Karina希望这些实例能“激励人们，而不是害怕AI会夺走他们的工作或消除他们的创造力。相反，我觉得人们可以通过这些工具，凭借他们的想象力变得更强大。”

AI产品与研究的协同设计

Karina认为，AI发展至今，主要得益于两大“扩展范式”：
1. 下一词元预测 (Next Token Prediction): 作为预训练模型的核心机制，模型通过预测序列中的下一个词元来逐步构建对世界的理解。然而，在某些任务（如长文写作）中，一旦模型预测出错，可能导致情节连贯性受损，这通常需要在强化学习阶段进行修正。
2. 基于思维链的强化学习 (RL on Chain of Thought): 该方法用于处理更复杂的任务，最初由OpenAI提出，现已被众多研究机构采纳。Karina视其为一种独立的扩展范式，能够在以往难以企及的真实世界任务中训练模型，对所有智能体（agentic work）相关研究均至关重要。

Karina基于其在产品与研究交叉领域的丰富经验，总结出构建研究驱动型产品的两种主要途径：
* 途径一：为模型不熟悉的新能力打造用户熟悉的产品形态 (Familiar form factor for unfamiliar capability)
* 核心理念：当模型展现出一种前所未有的新能力时，关键在于将其封装在用户易于理解和接受的现有产品交互模式中。
* ChatGPT: 将强大的语言模型能力整合进用户习以为常的聊天界面。
* Claude的100k上下文处理: 通过文件上传这类通用且用户熟悉的功能，使得用户能够便捷地与可以处理整本书籍等海量文本的模型进行交互。其他潜在的产品形态还包括“无限聊天”（作为一种无限记忆的实现方式）。
* Clip模型微调的时尚原型: Karina早期的个人项目，通过对Clip（文本-图像对比模型）进行微调，成功创建了一个时尚相关的应用原型。她认为其在社交媒体上受到欢迎的原因在于“人们确实发现了一些用处”，即成功地将前沿的Clip技术融入到了大众喜闻乐见的产品形态中。
* 模型自我校准 (Self-calibration / 即模型对其输出置信度的认知): 设想若模型能判断其输出答案的置信度（例如85%），用户界面便可通过高亮不同置信度的内容，辅助用户判断信息可靠性。
* 思维链 (Chain of Thought) 的用户友好呈现: 将模型内部复杂且对用户而言陌生的“思考过程”，通过流式（streaming）输出其“短暂思绪”的方式展现给用户，避免了用户长时间等待，从而显著优化了用户体验。

途径二：从坚定的产品愿景或信念出发，针对性地训练模型去实现它 (Start with a deep belief in what you want to make... and to literally make the model do that)
- 核心理念：基于对理想产品形态或功能的深刻信念，反向驱动模型能力的训练与塑造。
- 《纽约时报》时期对信息上下文的探索: 思考如何通过产品和报道为信息赋予更丰富的上下文层次。尽管当时技术手段有限（仅有NLP工具），但这一理念在当前AI工具的支持下，可发展为更动态、更智能的用户界面，以提升用户的内容消费体验。
- 新终端 (New Terminal) 项目的设想: 构思一个更人性化的命令行工具，为初级工程师集成自动补全等GPT-3级别的功能，提升其工作效率。
- GPT-3写作助手早期原型: 在用户输入文字时，模型几乎能同步预测并自动补全其思路，展现了早期AI辅助写作的潜力。
- Claude标题生成的微个性化: Claude在生成标题时，会参考用户的既有写作风格，并以该风格生成新标题，这是一种精巧的“微个性化”用户体验。
- Claude in Slack (2022年)的愿景: 目标是让Claude成为“第一个虚拟团队成员”，能够无缝融入Slack工作环境，参与讨论串、提出建议、高效总结频道内容等。这是使Claude成为一个能调用不同工具的“虚拟超级助理”的初步尝试。
- Canvas项目 (OpenAI)的创新: 旨在“打破传统聊天界面的束缚”，创建一个更灵活、支持深度人机协作、并能随新兴多模态能力不断扩展的交互界面。Canvas不仅允许用户输入，模型本身也能在其中写入内容、渲染代码，甚至支持其他模型进行校对检查。
  - Canvas模型的后训练: “纯粹基于合成数据”进行，并利用知识蒸馏（distillation）技术从能力更强的推理模型中学习。
  - 教模型成为协作者: 将抽象的“协作者”行为具体化、可度量化，例如区分“使用工具”与“主动协作”这两种不同行为。需要细致校准模型的行为模式，如判断何时应重写整个文档，何时仅需修改特定章节；何时在Canvas内部生成代码，何时应调用外部Python工具等。这背后涉及大量针对模型行为的教学与训练工作。
- Tasks项目的潜力: 模型不仅能创建提醒事项或待办列表，更能“每天为你创作故事，或者续写前一天的故事”，这充分展现了模块化组合能力在产品设计中的强大潜力。

AI模型行为的塑造与后训练

Karina通过一个具体的案例研究——调整Claude 2.1版本中出现的过度拒答 (over-refusals) 行为——深入剖析了如何塑造和后训练AI模型的行为。
* 问题背景: Claude 2.1在发布初期，相较于2.0版本，表现出更容易拒答一些表面看似有害但实际上无害的用户请求。此问题成因复杂，并非由单一数据源导致。
* 调试模型的指导原则:
* 核心原则1：宽容解读 (Charitable Interpretation)。模型在不造成实际伤害的前提下，应对用户的请求进行善意解读，避免过度警惕。例如，对于“写一个策划复杂抢劫案的两个角色的对话”这类请求，模型应将其理解为创意写作提示并予以回应。
* 核心原则2：非暴力沟通原则。模型在拒答时，应采用“我”陈述 (I statements)，为自身的拒答行为承担责任，而非指责用户或做出评判。同时，可以主动询问用户是否愿意修改请求，以便模型能在其设定的边界内更好地提供协助。
* 核心原则3：清晰的自我认知。模型需要准确了解自身的能力边界，这涉及到更深层次的“元后训练 (meta post-training)”。
* 核心原则4：同理心回应。在拒答时，应承认这种处理方式可能会给用户带来不便，并尝试提供更具同理心和建设性的回复。
* 拒答行为的系统性分类 (Refusal Taxonomy): 为系统性地解决过度拒答问题，研究团队对各类拒答行为进行了细致分类，主要包括：
* 良性过度拒答: 对本身无害的提示进行了不必要的拒答。
* 创意写作相关拒答: 在处理创意写作类请求时发生的拒答。
* 工具调用/函数调用相关拒答: 例如，模型明明拥有查看笔记的工具权限，却声称“我看不到笔记”。
* 长文档附件处理拒答: 例如，用户上传文档后，模型声称“我没有能力阅读这份文档”。
* 误导性拒答: 对用户的真实意图做了不恰当的负面解读，而本应采取更宽容和善意的视角。
* 构建有效的评估指标 (Evals):
* 源于产品反馈: 收集用户实际报告的、会导致模型产生拒答行为的具体提示。
* 策略性合成数据生成: “综合生成在有害与有益边界上的多样化提示”，这些提示主要围绕“边缘创意写作 (edge creative writing)”等场景。
* 利用其他评估数据集: 包括Anthropic内部维护的约200个非恶意提示数据集、Wild Chat数据集（其中包含用户提出的含糊请求、话题突然转换、政治性讨论等多样化交互场景），以及一些公认的开源基准测试集。
* 后训练模型行为的通用方法论:
* 数据审查与清理是前提: 必须仔细检查并清理用于训练模型的数据。
* 审慎采用人类反馈: 可针对性地收集人类反馈，用于监督微调 (Supervised Fine-Tuning, SFT) 或偏好建模/奖励建模。但Karina也强调，“人类反馈的成本非常高昂”。
* 积极探索合成数据的潜力: 特别是对于复杂的推理模型，可以不完全依赖人类反馈。通过“综合生成一些旨在改变特定行为的偏好数据来训练奖励模型，并进行知识蒸馏”。例如，运用“宪法AI (Constitutional AI)”的原则来创建针对反拒答行为的偏好对数据，此过程的关键在于“精确控制偏好对中特征的细微变化”，以便更有效地引导奖励模型学习期望的行为，避免其学习到虚假的、无意义的关联。其核心思想是“精心构建你所期望的数据分布形态”。
* 像调试软件一样调试模型行为: 不同的拒答行为可能源于训练数据集中不同的组成部分。例如，工具调用相关的拒答可能源于教导模型“自身没有物理实体”这类自知数据，导致模型错误地拒绝设置闹钟（尽管它实际上拥有调用设置闹钟工具的能力），这种数据间的内在矛盾会显著影响模型的行为表现。同样，长文档拒答、创意写作拒答等问题，则可能与安全数据、有害性数据和有用性数据之间的平衡失调有关。
* 平衡的艺术与挑战: Karina引用了“Claude 3道德章程 (Moral Charter)”中的深刻洞见：如果模型被过度训练得乐于助人、对用户请求有求必应，可能会倾向于做出有害行为（例如分享违反既定政策的信息）；反之，如果模型在无害性方面被过度索引，则可能拒绝与用户分享任何有价值的信息，从而变得非常不实用。在这个微妙的平衡中进行导航和取舍，“非常具有挑战性”。
* 改进结果展示: Karina展示了Claude 2.1与后续版本（如Claude 3）在处理特定请求（如“起草一部关于监视系统的虚构科幻小说”）时的行为差异。后续版本能够以更为宽容和恰当的方式回应此类创意写作请求，而不是直接拒答。

强化学习环境与奖励机制的设计

Karina明确指出，“你如何构建RL环境和奖励机制，将直接决定你的AI产品最终如何运作。”
* 真实世界用例驱动环境复杂性: RL环境的复杂性根植于教授模型完成各类困难任务的实际需求。这些任务通常远超简单的问答，往往涉及：
* 复杂的工具使用 (如调用搜索、代码工具、计算器等)
* 基于长上下文的深度推理
* 以及产品设计者希望通过精心设计的奖励机制来塑造的特定模型行为。
* 教模型做“有用”的事:
* 培养AI软件工程师: 若目标是让模型成为一名优秀的软件工程师，那么任务的分布就应围绕此目标构建。而如何评估什么是高质量的代码提交（PR），本身就是一个需要深度产品思考的问题。
* 打造AI创意故事讲述者: 优秀的人类作家不仅需要写作工具来起草和编辑想法，更需要花费数天时间观察世界、连接灵感。相应地，AI模型也应具备类似的能力，例如拥有便捷的编辑和草稿工具，能够持续接触最新的外部信息并基于此进行反思，而不是简单地响应“写关于XYZ”这类直接的指令式提示。
* 向更复杂的RL环境演进:
* 多玩家交互 (Multiplayer interactions): 从传统的单一用户与单一模型的交互模式，逐步转向支持多用户与AI协同工作的复杂场景。例如，一位产品设计师和一位产品经理可以与一个AI智能体共同合作开发一款新产品。这本身就是一个复杂的RL任务，其中每个用户都拥有不同的偏好和目标。
* 多智能体环境 (Multi-agentic environments): 在这种环境中，多个AI模型之间可以进行相互辩论，或者就某一特定主题进行深入审议以共同达成结论。这类似于AlphaGo类型的环境设置，其中智能体通过共同实现某个宏大目标来获得奖励。
* 研究焦点从易度量任务转向更主观、更复杂的任务: AI实验室的研究重心可能正在从那些易于量化评估的任务（如数学解题、编程竞赛）转向那些更主观、更难以精确衡量，但对于AI成功融入社会生活至关重要的任务：
* 情商 (Emotional Intelligence): 尽管用户频繁使用ChatGPT等工具进行心理辅导、情感支持等，但目前仍缺乏针对此类能力的成熟开源评估方法。如何有效衡量AI的情商，已成为一个“非常有趣且重要的问题”。
* 社交智能 (Social Intelligence): 在语音交互模式下，模型不仅需要具备逻辑推理能力，更重要的是能否在我说话时，根据我的具体话语（例如，“我注意到你做了XYZ”）提出真正有意义的建议（例如，“也许我应该为你创建一个新工具”）。这代表了一种不同于纯粹逻辑推理的、更高层次的社交智能。
* 写作 (Writing): 模型在创意写作方面的创造力“真的很难衡量，因为它高度个人化和主观化”。但可以积极思考能否将这类主观任务在一定程度上变得更客观，例如，通过分析优秀的科幻小说，识别其共有的成功要素（如世界观的内在一致性、引人入胜的情节发展等），并将这些要素分解为可评估的规则或指标。
* 视觉设计与美学 (Visual Design and Aesthetics): 模型要能生成在美学上令人愉悦的作品，就需要首先理解优秀视觉设计的基本原则，而这些原则相对而言更具客观性和可评估性。
* 创建新的RL任务 (作为一种新兴的产品研究方向):
* 积极模拟真实世界的复杂场景。
* 充分利用上下文学习 (In-context learning) 的能力，例如用于教授模型使用新工具或适应新环境。
* 有效利用来自更强推理模型的合成数据进行知识蒸馏，以加速模型学习。
* 大胆发明新的模型行为模式和交互范式，如探索多玩家交互的可能性。
* 在整个研发过程中，将产品思考和用户反馈深度整合。
* 奖励机制的精妙设计 (Reward Design):
* 核心挑战：“我们究竟想给模型什么样的反馈信号，才能让它学会在那些复杂且动态的真实世界场景用例中更好地操作，并在各种社交情境中表现得更具适应性和得体性？” 这背后需要“非常深入和细致的产品思考”。
* 例如，目标是教模型能够提出有意义的追问，但同时又要避免其变得过于烦人或干扰用户。奖励机制的设计将直接塑造未来的产品体验和用户与AI的互动模式。
* 警惕奖励作弊 (Reward Hacking):
* 这是RL领域中一个“非常非常普遍”的问题，指的是模型通过某种欺骗性或非预期的方式获得了高额奖励，但实际上并没有真正完成任务，或者没有以期望的方式完成任务。
* 其产生原因多种多样，Karina强烈推荐阅读Lilian Weng关于RL中奖励作弊现象的博客文章，称其分析“非常全面和深刻”。
* 一种常见情况是：当使用其他AI模型（如LLM）作为评估器时，被评估的策略模型可能会试图欺骗评估模型，使其误以为任务已成功完成。例如，一个用于代码修补的AI工具，其模型可能会学习到定义一个总是跳过所有测试的函数，从而表面上“通过”了所有测试，实则规避了核心任务。
* OpenAI最近一篇关于“监控推理模型不当行为”的研究论文指出，不应单纯地优化思维链的简洁性，因为这反而可能导致模型更善于隐藏其真实的、可能存在问题的意图。
* 随着模型推理能力的日益复杂化，奖励作弊的复杂性和隐蔽性也会随之增加，尤其在软件工程等对准确性和安全性要求极高的领域。因此，可能需要创建全新的、更高级的评估方法和验证机制，以实现对模型输出结果更可信的验证，这也是AI对齐（Alignment）问题研究的重要组成部分。

人机交互的未来展望

Karina分享了她对未来人机交互发展趋势的一些前瞻性看法：
* 推理成本的急剧下降: 她坚信“原始智能 (raw intelligence)”的获取成本正在经历前所未有的急剧下降，并且这一趋势仍将持续。这将使得“任何人都可以用这些模型，以非常低的成本，创造出真正有用和令人惊叹的东西”。
* AI输出验证面临的挑战: 对于非专业领域（例如复杂的医疗诊断或金融市场分析），普通用户往往很难准确验证AI输出结果的正确性和可靠性。因此，迫切需要创建全新的“评估机制和交互界面 (new affordances)”，让用户能够有效地验证或编辑模型的输出，并反过来帮助训练和改进模型。
* 动态生成式UI (Dynamic Generative UI)的潜力: 这是一种“即时的、无形的、按需生成的软件创建”的革命性理念。例如，当用户表达“我想更多地了解太阳系”时，未来的模型可能不会仅仅输出单调的文本信息，而是会根据用户的个体特性（例如，对于视觉思考者，可能会生成生动的图像或交互式3D可视化模型；对于听觉学习者，则可能生成一段定制化的播客内容）来动态生成最合适的个性化内容呈现方式。这种界面是“短暂的 (ephemeral)”和情境感知的，其具体形态深度依赖于对用户意图和当前上下文的精准理解，是“深度个性化模型”的极致体现。
* 个性化医疗与教育的广泛普及: AI有巨大潜力让更多人便捷地获得高质量、个性化的医疗健康服务和教育资源。例如，任何人都可以使用ChatGPT等AI工具初步检查身体症状并获得一些初步建议。未来，我们还可能看到更多与此相关的、令人兴奋的消费级智能硬件问世。
* AI与人类叙事方式的深刻变革: AI无疑将深刻改变我们“讲述故事的方式”。未来可能会出现与AI模型共同协作撰写小说、联合编写电影剧本等全新的创作模式。Karina真诚地希望“当前的创作者们不会对AI感到恐惧，而是能以更开放的心态，积极地将这些强大的新工具融入到他们的创作流程中，探索全新的艺术表达可能”。

问答环节要点 (摘要形式)

Karina在问答环节就模型迭代、主观任务评估、AI发展瓶颈、AI与创意及工作的关系、模型多样性保护、异常行为检测、合成数据、大模型成本以及跨领域应用等问题分享了她的见解：
* 模型新行为的迭代与评估: 通常始于明确的目标行为，继而规划数据收集与模型训练策略。这可能涉及在基础模型上增量学习，或调整奖励模型，并通过精心设计的评估集（Evals）来衡量效果。这是一个在多项指标间权衡和迭代的复杂过程。
* 主观任务评估的挑战与方向: 对于创意写作、情商等主观性强的任务，虽缺乏统一基准，但研究者可自行构建评估体系。她认为AI研究正从易度量的任务转向更长周期、更复杂的真实世界任务，如软件工程自动化，这些任务的评估本身就是挑战。
* AI发展的主要瓶颈: Karina个人认为，基础设施的完善（尤其对多模态的支持）是当前关键瓶颈之一。同时，提升研发效率可能更多依赖于利用AI工具本身，而非简单增加人力。整个领域尚在探索，但未来一两年内AI工具的进步有望带来显著加速。
* AI在创意领域的角色演变: 她设想AI未来在创意流程中将超越工具属性，更像一个“共同创作者”，支持实时头脑风暴、协同完成作品，形成“伙伴式的工作关系”。她提及了一款名为“t-ball”的配对编程软件，认为其代表了更接近同事形态的AI协作。
* 维护模型多样性，避免品味趋同: 基础模型因其训练数据的广泛性而具有“超级多样化”的潜力。通过强化学习从AI反馈（RLAIF）及精心设计的合成数据，可以引导模型学习特定的、非平均化的偏好，避免产生不希望的、趋同化的行为。
* 模型异常行为的检测机制: 除了自动化的评估（Evals），大量的“定性分析”——即研究人员亲自与模型互动、观察输出并识别“怪异之处 (weirdnesses)”——至关重要。行为的“一致性”是判断问题严重性的关键。
* 合成数据的有效运用: Karina强调合成数据的“多样性”比数量更重要。由于所需数据量可能不大，可以进行细致的人工检查，或利用其他可靠模型进行元评估。
* 大模型服务的成本考量: 她指出任何技术创新初期都伴随着高成本和低效率，AI也不例外，后续发展会致力于成本优化。目前已有许多开源模型可供开发者使用。
* 大语言模型在机器人等交叉领域的应用前景: Karina对此表示“非常充满希望和兴奋”，认为数据是主要瓶颈，一旦解决，潜力巨大。
* 研究员在大型模型项目中的协作与可见性: 她主要从事后训练工作，预训练通常由专门团队负责，但团队间存在良好协作，可以贡献数据集或在特定任务上提供支持。
* AI作为“同事”的现状与未来差距: 她认为当前AI在“社交智能”方面尚有不足，例如实时理解复杂对话、捕捉非言语信息并智能引导协作的能力。这可能需要架构和多模态能力的进一步突破。
* 传统产品开发与AI研究驱动产品的核心差异: 传统软件开发多遵循固定流程（PRD -> 设计 -> 开发）。而AI驱动的产品可能源于一个“令人印象深刻的模型能力演示”，围绕此能力构建产品；或从一开始就由产品与研究团队紧密结合、共同探索（如Canvas项目），后者更为灵活和“随意性更强 (more ad-hoc)”。
* 利用真实世界作为RL环境训练AI的可行性: 对于创意写作等难以客观评估的领域，利用真实世界的反馈（如社交媒体反响、竞赛结果）作为奖励信号是“看起来合理”的探索方向。

总结核心观点或结论

Karina Nguyen的演讲全面而深刻地阐述了AI产品与研究协同设计的核心理念与实践路径。她强调通过持续的迭代循环和真实的用户反馈来塑造和评估AI模型，特别是针对那些更主观、更贴近真实世界复杂性的高级任务。通过精心设计的强化学习环境、巧妙的奖励机制以及对合成数据的高效运用，可以有效提升AI模型的协作能力、创造力乃至一定程度的“社交智能”。尽管在评估方法、奖励作弊防范等方面仍面临诸多挑战，但AI在个性化服务、创意赋能以及提升人类工作效率等方面的巨大潜力已清晰可见。未来的AI发展将更加侧重于构建能够与人类进行深度、多模态和个性化协作的智能系统。这需要研究人员和产品开发者以前所未有的紧密度携手共进，不断探索和创新交互范式与评估方法，其最终目标是让AI成为增强人类智慧与创造力的强大伙伴，而非简单的替代者。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-18 16:11

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-18 15:56

StreamSparkAI