Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 11 - Benchmarking by Yann Dubois

该演讲主要讨论了机器学习领域中基准测试和评估的核心作用。主讲人Yann Dubois指出，评估贯穿模型从训练、开发、选择到部署及学术发表的整个生命周期，但不同阶段对评估方法（如速度、成本、可微性、可信度、任务相关性、指标绝对性）有不同侧重。学术基准测试（如MMLU）对推动领域进步至关重要，强调可复现性、标准化及长期有效性，而非指标的短期完美。演讲进一步区分了NLP中的两类评估任务：封闭式任务（如情感分析、文本蕴含），其答案固定，评估方法成熟；以及开放式任务（如文本生成），其答案多样，评估更复杂。

视频科技

媒体详情

上传日期: 2025-05-16 20:35
来源: https://www.youtube.com/watch?v=TO0CqzqiArM
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

下载为TXT

speaker 1: Great. So I think let's get started because we have a lot to cover today. So my name is Jan. For those who don't know me, I'm a third year pg student advised by tattoo and Percy. And today I'll be talking about benchmarking and evaluations. So benchmarking and evaluations are honestly something that I think not enough people look at in academia. But if you really want na put something in production and you really care about, let's say, real world machine learning, evaluation is really key. So let's talk about that. So overview of what we'll talk about first is different reasons for measuring performance. Then I'll talk about text classification and how you measure performance there. Then text generations and how you measure performance there. And finally, how do you evaluate current large language models and some issues and challenges with the ways that we actually perform evaluations. Okay. So my mental model of how you actually develop a machine learning model is at first you will be training your model. So here, measuring performance is really key because you need to have a loss that you need to know when basically how to optimize. Then once you are optimizing your loss, the second step is basically development. So usually this is high proparto tuning. Or for example, if you have early stopping during your models, like if you see that your model is not performing that well, you might or that there's some overfitting happening, you might decide to stop or you might decide to like change the learning rate during the training of your model. So development is kind of the second step. And here you need to measure performance because you need to know how to do actually models, sorry, high prito tuning and like changing high parameters. Then the third step is essentially model selection. So if I have a task that I really care about, which model performs best for my task, that might be a model that I have trained. It might be a model that another group has trained. And finally, at least in the real world, you would decide to deploy your model. And here, measuring performance is really key, because you need to know whether your model is good enough to put in production in the parallel universe that we live in. There's also the publishing. So you basically need to test, like evaluate your model on standard benchmarks. And the reason why we do that is essentially for communicating to different groups the quality of our model. So at every step of this pipeline, you really need to measure performance. And that's what we will talk about today. But what is key to understand is that at different steps, you need to measure performance in different ways. So it is really not a single way of not a single ideal way of measuring performance. So for example, on the left, when you train your model for evaluating performance, you really need to have a way of measuring performance that is super fast, super cheap and differentiable. Because usually, I mean, with neural networks, you're basically back propagto the loss. So it needs to be differentiable. And finally, you really cannot have a way for your model to optimize some shortcuts to optimize the loss, even though it's not really what you want to optimize. And as you move more to the right, basically you're allowed or like you will measure performance less often. So it's fine if it's more expensive, but you really like the risk that you really need your evaluation metrics to be higher quality because the issues if you put a model in production are higher. So during the development stage, you need a way of measuring performance that is fast, cheap and also kind of avoiding shortcuts. Because when you do high prtuning, you're essentially also optimizing over a certain objective model selection. It can be a little bit less fast and less cheap, but still you will have to do it at many times. And most importantly, when you deploy your model, you really want the way to evaluate performance, to be trustworthy, because once you put something in production, there's kind of no way to go back for what happened during that time when it was in production. You also want things to be very task specific. So if I care about a certain task when I put my mom in production, it you really need to evaluate on that specific task. I don't care about other tasks. And finally, you need your metrics to be absolute. So the reason why I'm highlighting that is that in the three other steps, you really just care about comparing between things today is very different than if you want na kind of have a threshold which says, if I have less than 95% accuracy, I'm not putting my model in production. Okay. And now let's talk about publishing. This is a little bit different than honestly, evaluation in the real world. But when you basically do academic benchmarking and when you evaluate your models and academic benchmarks, you want the benchmark to be reproducible and standardized. And the reason why is basically because for the next five or six or ten years, everyone will be evaluated on that one benchmark. And you want papers in three years to be comparable to yours. So it's really important that your evaluations are reproducible. Honestly, you don't really care about that in the real world. You also want things to be easy to work with because researchers are usually a little bit they don't want to do additional work that they need to. And also, they usually don't have that much resource. So it needs to be fast and cheap. And finally, one thing which I really wanna highlight is that for the academic benchmarks that we usually have, it's fine if the metrics that we use are not perfect because really what matters is that over ten years, the direction that your metrics is showing you to go into basically how the fuel is moving, really, if the metric is saying that it's slightly better sorry, it's better over ten years that in reality, the field has made some progress. So at a meta level, it's fine if we use crude metrics in in academia. And also, you kind of need to to balance between difficulty and simplicity. And what I mean by that is that if your benchmark is way too complicated, then basically all methods will have essentially random performance. So no one will use your benchmark. And if your benchmark is too simple, then the baseline will be so good that no one will use your benchmark because no one can be the baseline. This is really something that is specific to academia in the real world. You're not gonna to be able to change tathat. You're performing based on like how good your model is. So that's why I kind of just want to highlight that because usually people talk about evaluations, but there's really different different ways of evaluating and different reasons why we evaluate. Does that all make sense? Also, feel free to ask questions.
speaker 2: Great. Okay.
speaker 1: So benchmarks in academia. This is really the way we drive the field. So this is the mmlu benchmark. I think arit briefly mentioned it, but I'll talk about it later. Again, so this is the most standard benchmark right now. And you basically see that in the last four ish years, it has gone from 25% accuracy, which is essentially random because it's multiple choice and there are four choices, to around 90 ish percent accuracy. So Yeah, benchmarking is really what drithe progress in the field. And again, you see what I mean here? What I meant here is that it's not really the differences between small points that matter. At least in academia. You have to take a step back and you have to think what matters is like how your models will perform over ten years and making sure that the model on the top right here is better than the model on the bottom left, even if the benchmark is not perfect. And I think mmlu is a pretty good one in that sense. Okay. So there are two main types, at least classically, of tasks in nlp, close ended tasks. So I'll talk about it later. But essentially, you can think about classification, where you know exactly the label, the correct label for the tathat you're performing. So here, this is the imdv data set where you're asked to say whether a sentence has positive sentiment or negative sentiment. So the texis read the book, forget the movie. So this is about a sentiment classification of the movie. So here it's basically negative. And then there's open ended evaluation. So think about ChatGPT, like how do you evaluate something like that where really there's no correct answer and are many or there are many possible correct answers and they all have different qualities. So we're going to distinguish between those two. So close ended evaluation. So as I just said, close ended Teit's basically defined one as the task where there's a limited number of potential answers, think like less than ten, and often there's just one or maybe a few correct possible answers. So this really is standard machine learning. If you think about standard classification, you can just do accuracy. You can look at your precision, your recalls. There's nothing special here about nlp. That is not to say that it's simple, it's just that there's nothing special about an lp here. So some tasks, some close ended ded tasts, I already told you about sentiment analysis. So usually this is a binary mic classification task where you just have to say whether the sentiment is positive or whether it's negative. Another task is entailment. Also for sentiment analysis, the typical benchmarks, I always put it next to the task is IMDb and ssc from Stanford. Entailment is snli, also from Stanford, where basically you have some text. So here, soccer game, we have multiple males playing. And then you have a hypothesis. Some men are playing sport, and you have to say whether the hypothesis is implied or entailed by the text. So here it is, other tasks, part of speech, typical benchmark, pantry bank and name entity recognition, which is the kal benchmark. A few other tasks. You don't need to know all of them, but just to give you a brief overview, co reference resolution. So this actually a pretty challenging nlp task where you have to say what pronoun which pronoun refers to what noun. So you have here the sentence, Mark G. Told Pete many lies about himself, himself, which Pete included in his book. He should have been more truthful. And now you have to say, what does he refer to? Whether he refers to Pete? And then this question answering, where you basically have a long text and you ask a question and sorry, the test a question, and you're supposed to provide an answer based on the text that you have before. So those are some examples of close ended tasks. And again, the key here is that the way we evaluate those is just standard machine learning can look at accuracy, precision, recall, F1, schopefully. You all know about these type of metrics, but if you don't, you should look at Chris peach's class. I think it's cs 224. You but his lecture is online and it's actually really good on different metrics. So the ways that people evaluate some of these benchmarks is usually by looking at many of them concurrently. So the most common, I would say, like super or multitask benchmark is called super glue. So here you see on on the columns here, you have all the different tasks in super glue. So I think are eight or nine. And then you really just look at the average performance in each of these benchmarks and you get a ranking on that. And that is kind of an attempt to measure general language capabilities. This is what people used to do, I would say, until maybe two years ago. I will tell you about what people do now around the end of the lecture. But Yeah, super glue is definitely something you should at least be aware of. And example of tasks that are in super glue, one is bull queue, which is simply you have some text, you have some question, and you have to say whether the answer is yes or whether it's no. So that's very easy to evaluate. You just look at accuracies or precision, regal entailment we already talked about. And then there are other ones like co reference resolution, which we also talked about, and meaning of words, which is something where you have two sentences with the same words and you have to say whether they actually mean the same thing in this sentence. For example, if you have bank, it could mean bank like water and bank like money, and you have to say whether in these two sentences they refer to the same concept. And there are some question answering tasks too. So this is about super glue. Are there any questions? Cool. So again, although I said many times that this is essentially just classical machine learning, I want to emphasize that it doesn't mean that it's simple. And you really have to think carefully about what you do when you use those type of cloended ded tasks. And particularly, you're gonna to have to choose whether you look at accuracies, Precision Recall, F1 scrock curves, auc curves. If you don't know these names, you should really check out the slearn documentation or the lecture from criris speech that I linked above, both of which are really good. But depending on which metric you will choose, you will decide on very different type of algorithms. And the usual example is that if, let's say, you look at spam, you want na do classification of whether a email is spammed or not. Most emails are not spammed, thankfully, at least I hope so. Let's say that 90% of emails were actually not spammed and only 10% of them are spam. If you look at their accuracy, then just a random classifier that predicts the most likely label will get 90% accuracy. And that seems, if you don't know really about your data set, like 90% accuracy seems good. But in reality here, it means that you're not classifying anything. So that's why you want to look at Precision Recall and F1 scanyways. I will not talk too much about that because again, this is not specific to nlp, but it doesn't mean it's easy. Another issue is that once you have multiple different tasks, there's a question of how do you aggregate these metrics. So right before I told you, Oh, you just take the average between all of these things. This honesty is a really terrible thing to do, but that's actually what people do. But these columns actually mean very different things. Some of them are accuracies, others are F1 scothers are correlations. And you just average everything. I can't remember which benchmark, but I remember a few years ago, there was one benchmark where actually one of the columns was you were better. Basically, you had better performance if the value was lower. And you still took an average of these things until someone realized that they were like, maybe we should take him, put a miners there. So Yeah, be careful and don't always think that what people do in academia or Yeah what people do is correct. You should think a little bit about that then. Now some other questions I want you to think about. Where do those labels come from? I said that is usually a real answer, but how you actually get those labels is unclear. So I will tell you about some issues in the next slide. And also related to that, there might be some spuous correlations, and that's what we're going to talk about right now. So we already talked about sni. So entailment. So here you have again, your premise the economy could still be better, and the hypothesis the economy has never been better. And you have to say whether the hypothesis is implied by the premise. And what this paper from 2019 found is that actually all the different models were performing really well. But if you just classified based on the hypothesis, you could also perform really well. So even if you did not look at the premise, which seems like something that you need to take into account because it's part of the task, you could perform well. And the reason why is because they realized that when the humans actually wrote the hypothesis, they were asked or write a hypothesis which is not entailed by the premise. And how humans usually do that is by adding a negation. So if you only look at the hypothesis and you see that this negation, it's very likely that it's not entailed by the premise. So again, even though this is standard machine learning, be really careful about what metric you use and where do the labels come from. And don't do everything. Like don't just use what people do thinking that if there was an issue, people would have realized. So Yeah, so that is spous corlations. Any questions on close ended tasks? Cool. Okay. Open ended evaluations. So I'm going to mostly talk about that because this is what is specific to nlp. So open ended evaluation or open ended task is essentially the opposite of the close end the task, which is to say that there are many possible correct answers and you cannot enumerate all of them. So you really can't use standard machine learning metrics and more. Oops, even more than the fact that you cannot enumerate all the possible answers. Usually there are different levels of correctness. So if I ask you to write a book or if I ask chagpt to write a book, like it might be a decent book, but it might be a better book that it could have written or that another model could write. So it's not just right and wrong. There's like it's a continuum. Yeah, it's a continuum. So stand examples for open ended tasks, the two most common ones are summarization. So summarization, you have a long piece of text and you just has to summarize it in less than x characters. Standard benchmark is the cnn and digmail benchmark. So the way they actually collected that data set is that they took a lot of cnn articles. And you know at the top of cnn articles, you have bullet points that kind of say, like, what are the most important things in the article? So they use this as essentially the gold summary. So this is the classic one. And for summarization, for translation, you basically have sentences in two different languages and you have to translate from one to the other. So those are the classical ones. The way that people currently do it is, I would say the most standard task right now is instruction following. Instruction following is kind of the mother of all tasks in the sense that you can view any previous task as just A Chabot or like some question that you asked, who basically ChatGPT. You can think classification, I could just ask ChatGPT to do that. You can think summarization, I could ask ChatGPT to do that. So essentially, you could just view a chatbot as the most general type of task, and you can ask it to perform any possible task, and it should just provide the answer for that task. So this is what we call instruction following. So as you might think, evaluation is very hard in that domain. And that's what we'll talk about later is how do you evaluate something? Hgpt, okay. So types of evaluation methods for text generation or open ended tasks, the classical ones are content overlap metrics, which I'll talk about then. So that's really comparing just the words between a reference answer, a gold answer that humans wrote, and the actual generation that you got from your model. Then there are model based metrics where you basically turn evaluation into machine learning. So ask you train a model to basically become an evaluator, and then this human evaluation, which is usually see as the goal standard for open ended tasks. So content overlap metrics. So as I just said, this is really just comparing word by word or group of words between the generated sequence and some reference. So here I have the generated sequence being the woman went to the hardware store, and the gold reference, which is the reference written by humans. I actually don't even know what the task is, but the reference here is they walk to the grocery, we saw. And then what you do is that you just compare the two different sentences by looking at the lexical similarity between those two texts. And this is super fast and efficient. And the way you usually do that is by using ngram overlap metrics. So what I mean by this is that the simplest possible thing is just to say whether for every word in the generated sequence, whether it appears in the reference sequence, and if it does, then you kind of increment your performance. So n gram is is essentially the same thing. But instead of looking at a single word, you basically look at ibigrams, trigrams and kind of multiple words next to one another. So the usual overlap matrix, the most common ones are blue. And the blue means blue and the roumeans red. That's not what they stand for, though, and I always forget what they stand for, but basically blue. What it is, is that it's an n gram overlap metric that tries to look at precision. Well, Rouge is what looks. It looks at the recall. So as I said before, as I luted before, what is important, even if you turn everything into kind of sentence classification, you have to think about whether you care about precision or recil. So those metrics are not ideal. But until I would say two years ago, they were a gold standard for translation and summarization. For translation, people use blue because you really want, na Yeah, you basically look at the, let's say I'm translating from French to English. I want to look at the generated sequence in English and the actual reference sequence in English, and I want to know whether every bigram that I generated appears or like how many of the bigrams that I generated appears in the reference sequence. There's one additional thing, which is that they don't only look at precision because you could get a very high precision by actually predicting something very small. For example, if you always predicted the word the only generated the word the, you would most likely get very high precision because there usually appears in every sentence or like let's say, a full stop. So there's also like some length penalty and who just kind of deopposite it just looks at weacall. So those are the common content overlap metrics. And just to illuate why those are not ideal, well, they have many issues, but one of them is that they don't really take into account the semantic relatedness between words. So imagine that Chris asks you, are you enjoying the cs 224 and lectures? Of course, the gold answer is heck yes. So that's the reference answer. So now let's say that the model just generates yes. So here what you're going to have is if I look at the blue score, I will have 67% essentially blue score because two of the words that I generated or two of the unigrams that I generated are in the reference, in the gold reference, if I generate you know it, then I will only have a single token in the generated sequence that appears in the reference sequence, which is the exclamation point. So I get a much lower blue scull. And if I just say Yep, then that doesn't appear at all in the generated, sorry, in the reference sequence. So I get zero. Blue scwhich is a false negative because really it literally means the same thing as hikso. Hopefully you see that these metrics really have issues. Also, you could have false positives. For example, if you say heck no, then most of the words are the same. So you get 67% blue skore. But it really means something completely different. Does that make sense? Any questions? Cool. So very naturally, now that you know everything about what embeddings, what you might ask is, Oh, why do we look at words if what we could do is looking at like learned representations, which really kind of maintained the semantic similarity between words? So this is exactly what people have done around 2:19, I think, is that they took some, even before actually 2016, they took some wide embeddings. They associated every word in the reference sequence to a word embedding, every word in the generated sequence to the corresponding word embedding, and they basically started comparing the word embeddings. So a very simple way of comparing word embeddings is just to take the average between the word embeddings in the reference sequence and the average between the word embeddings in the generated sequence. And you maybe look at cosine similarity. I mean, there are smaller ways of doing it. But honestly, at this point, it's not that important. So you can think about averaging another one. As you know, at this point, what embeddings don't really take into account the contextual or like the context of where the word basically appears. So a better way of getting good represenstations forward is by looking essentially at edbirso. What you could do is you could take a birmodel, you could PaaS the generated sequence to it, you get some embeddings, and then you can take boagain, the same boat. You PaaS the reference sequence to it, you get some other embeddings. And then you do, again, some comparison. I mean, this bird scpretty famous paper, they do like some smart comparison, but it's not that important to understand what exactly they do. What is important is that they take some smart averaging between those words. Cool. Any questions? Okay, so that was the simplest type of learning methods, which is word matching. Another slightly more complicated one is called blurt, also pretty famous, which is a mix between blue and bird. So the way that they did that is that basically they took a pre trained bird and then they do some continual pre training by trying to predict the blur score and some other metrics and then they fine tune. That's the important part, is that they fine tune their ptrain model to actually do the evaluation that they care about. So let's say that I have a lot of different sequences, and I have some human annotations of how I should be evaluating it. I could just treat that as a normal machine learning task, and I just fine tune my Birt to do the evaluation. So this is blurred. Any questions? Yes. Curious if you ptrain on blue.
speaker 2: wouldn't it cause .
speaker 1: the same problems? Ms, as if you're appreciating fas blue, then your would learn the ability to model languages semantically in the first place. Yeah, that's a very good point. So actually, I also find it kind of surprising. So they did two things. First they do the real pretraining of Birand. Then they do continual pretraining for predicting blue. And the reason why is because usually they say we have a lot of sequences in our data set that are unlabeled. So we have like some reference sequences and some generated sequences, but we don't have the human annotation of whether this is good or bad. So we will treat that as an unsupervised learning objective. So what do you use for the unsupervised learning objective while you have to use something? And they basically use blue and they also use actually bird score. So they use like many different tasks, and they basically do multitask learning. Okay. So one important issue with all these methods is that really they are, they can only be as good as the references are. And in reality, the references are usually not that good. So this is in, this is a paper that looks at summarization of news. So basically, as I said before, most of the news summarization benchmarks, they usually take the reference summary as being the bullet points that you find at the top of an article. And this is usually not that good. So here, what you see on the left, this is what if you look at the correlation between the x axis being the human rate of the human evaluated performance of every model. And in the y axis, you see the Roul, which is just a variant of Rouge. And you look at whether basically these two are correlated. And what you see is that they are essentially not correlated, which means that Rouge l on standard references is really not correlated to what humans would say is a good summary. That is not to say that Rouge is a bad score. That is to say that actually the references are bad. Because if you look at the exact same thing, but now you ask experts to write very good summaries, then you see that the correlation actually increases by a decent amount. Still not perfect. Rouge is definitely not perfect, but at least it's much better. So this is to say that the metric itself is not always perfect, but not only this, the references are usually actually not great. Cool. So that begs a very natural question, which is, can we just dump and basically move away from reference based evaluation? So as we just said, reference based evaluations are the ones that compare human written references to some model outputs using some different type of metrics. And those used to be the standard metrics for evaluating or the standard benchmarks for evaluating nlp tasks, I would say up to like two or three years ago. Right now, I think papers still have to always show the blue scores, like for example, in translation because reviewers one dose, but I don't think anyone in the real world actually uses them, but I might be wrong with that. So Yeah, so blue Rouge bird score. Oh, and I was mostly talking about blue and Rouge. Birscore is actually still decently used and actually pretty good. Okay. So reference free evaluation. So reference free evaluation is basically you have a model and you ask it to give a call, but don't know your human references. So the way that this used to be done is essentially by taking a model like Bert again, but instead of comparing between a reference reference answer and a generated answer, you could just ask it to take the input and just predict a score. That's like one simple way of doing it that used to really not work well. And I say used to, because until basically chagpt and GPT -4, now what people do, and honestly that works super well, is that you just ask GPT -4 to do the same task as you would ask a human. So you give a very long text, and then you give the generated summary and you ask, how good is it essentially? And that works surprisingly well. So common benchmarks here. I'll pack a evil and empty bench. There are many others. Now honestly, most people start using these type of techniques, but we'll be talking at this about apply. Eva, good. Okay. So let's talk a little bit about human evaluation before looping back to basically GPT -4. So as we saw the metrics until now, they all have some shortcomings and are definitely not as good as if you ask directly human evaluation because they are based on references. So human evaluation is really the goal standard for open ended tasks. And not only is it really the standard way of doing evaluation or like the goal standard for evaluation, it's also the goal standard for developing new automatic evaluations. So every time you develop a new automatic evaluations, you will want to compare to what humans would have basically predicted. Yeah, okay. So doing human evaluation, at first it might seem very simple. You basically ask humans to evaluate the quality of some generated text. Seems simple, right? But actually it's super complicated and it's a real challenge and it has many issues. So first, Oh, sorry, I'll talk about that before. Maybe one additional thing is that you should not only ask the human, you usually ask it also to as them to evaluate across different axes, for example, the fluency of the text or the coherence of the text, or like common sense, or like the style grammaality, redundancy, and like different axes that you might care about. Another thing to note is that you should absolutely never compare different human evaluations. So if there's one paper that says, Oh, humans have evaluated the fluency of our text to be, I don't know, four out of five, and then another paper that says, like three out of five, like they use different humans, different ways of like prompting the humans. So it's absolutely not comparable. Okay, so let's go back to some of the issues. So as I said, human judgments are regarded as the goal standard, but it definitely has issues. First, it's super slow. As you might expect, humans are definitely not as fast as automatic metrics. Second, at least in academia, it's still pretty expensive to do because I mean, when you pay well your workers, it's pretty expensive to do well human evaluation. Another part is internotator disagreement. So if I take two random people in this room and I ask them to evaluate the quality of a genuated text, I can assure you that you will really not agree. So this is, even if especially if it's subjective, it's really bad. But even if you talk for like one hour before about how you should be evaluating generations, I can most likely ignorto you that you will still disagree on many of the evaluations. And to give you an example, when we were doing alpka farm last year, which is something where we basically had to take some inputs and then take two models, think ChatGPT, alpc and these type of models, and you just have the two models predict an answer, and then you ask the humans to say which answer they prefer. This is a very simple task, and this is what I will talk about it later. This is what a lot of people basically use right now for evaluating models like jgpt. So an actual question is whether humans are good at doing that. And what we saw is that, so we were five researchers doing that. And the five of us, we talked for like two or three hours. We wrote extremely detailed rubrics about how to do the evaluations. And still we only agreed 67% of the time. So 50% is like random. And if we just label things independently, we only agree 67% of the time. And we really try to do our best, like we were working on this thing. So it's not as if we were trying to do it quickly. So really people disagree. Of course, if you then allow discussions between the annotators, then agreement actually improves, but then it becomes even like slow and more expensive. Intro, annotator, disagreement. This is something that is extremely annoying, which is that if I ask a human, if I ask myself right now to evaluate something or like in three hours, like after I have dinner or after I want to run, I will actually give different annotations. Yes.
speaker 2: for the samples we might to pay by like .
speaker 1: you mean for validating? Yeah. So this is a very good question. Honestly, there's no good answer. The usual way that people do it is that you look at some statistical like some statistical metrics basically where you're like, okay, I want to compare between these two models. I'm going to look at, I'm going to basically perform A T test, and I'm want to know that my p value is less than a certain amount. What people usually do also when they have human annotations, I unfortunately didn't put a slide on that, but they have metrics for computing the intra annotator, basically agreement, and they try to achieve a certain intranannotator agreement. And if not, they will essentially ask for more humans or for relabelings. Yeah, it's not reproducible. And this is like partly because of the two things that we said before, but also partly because Yeah I mean mostly because of the two things before. So this is an interesting paper. Think I forgot which here. I think it's from 2021, but I'm not sure where. Basically they say, and I read from the abstract here, just 5% of human evaluations are repeatable in the sense that there are no prohibitive barriers to repetition, and sufficient information about experimental design is publicly available for rerunning them. So this is a paper that analyzed, I think, 128 different papers that were published across like five years, I think, between 2015 and 2020, and they found that essentially only 5% of those papers were reproducducible. So honestly, working with humans is hard. That's definitely something to remember. Another part is that humans only basically evaluate precision and not recall. So what I mean by that is that if you show me what the model generated, I can only evaluate that generation. I cannot evaluate all the other possible generations that it could have generated, because then you really have to sample a lot of things that would become way too slow and way too expensive. And finally, usually the incentives are not aligned. So what you want is for the humans to basically do the best possible evaluations. What crowd workers usually want is basically to maximize the amount of money that they get paid per hour. So to give you, again, a concrete example, when we were doing our pka, fari think we were paying relatively well in the sense that we were paying 1.5 times the minimum wage in California. And then we divide, basically, we looked at how much time we would spend to do the thing, basically to evaluate a single example, the best we could. And then we divided by that time to basically know how much we would pay for every example. And what we realized is that they ended up being paid, I think, two or 2.5 times the minimum wage, because they were just doing things like two, three times faster than us. And I don't I mean, we could be slow, but I think what was happening is that they were just trying to maximize the dollars that they were getting per hour. And as a result, they were finding like shortcuts for doing their evaluations. And this is something that you really see in a lot of papers. For example, in our case, you saw that humans really preferred longer answers. And of course, if you give me two very long, like two sorry generations, and you ask me with minimal amount of work to say which one is better, like if I see a longer one, I'm like, probably there are more details. Probably it's better anyways. It's not to say that everyone is like that, but definitely it's the incentives on online. So you have to be careful with this other challenges. First, you have to decide how to describe your task. You really have to give very detailed rubrics for how the humans have to evaluate the task. Then is a question of how do you show the task to the humans? For example, the order in which you give examples are actually really important in our case, because we had two examples side by side. They're actually which one is on the left and which one is on the right is actually also very important. So all these things really matter. Of course, you can randomize these things away, but it is like it adds challenges what metrics to use? I mean, this is not specific to humans selecting the annotators. This is also very complicated. You might think, okay, I have some money now. I can go on Amazon m turkas, and I can just ask them to evaluate or to do some annotations. But in reality, you want to have the good annotators. So how it usually works in Amazon, in mtuk, is that basically you say, Oh, here's a task and want like 30 different people to do these annotations, and then they start annotating. And then if they don't achieve the level that you want, you basically pay for what they are annotated until then and you work with someone else afterwards. So then there's a question of how do you decide whether they achieved the performance that you want? So you probably have to do some good labeling before and then look at like some accuracies of how well and like some intrananator agreement with you and with like the other researchers on your team. So it is very complicated. And not only this, you have to monitor that over time. So there I mean, different ways you can monitor that over time, looking again at the accuracy. So maybe every, let's say, a typical thing is that every batch of example that you label, you give a few examples that are actually ones that you already know what the gold label is and you see how well they are performing on that. Another way to look at is like the time that they take to annotate. Yeah. Okay, so that was about humans. So human evaluation is hard, but it is the gold standard. Okay, now let's talk about reference, free evaluation and chabots. So I already told you about it before very briefly. How do you evaluate something like ChatGPT? This is extremely complicated because basically you could ask it any task you want, and it can answer text that is arbitrary, really long, and that just makes evaluation extremely hard. So as I suggested before, the usual way that it's done is that you take two models, you put them side by side, you ask the same question, and you just ask either some humans or some model, as we will see before afterwards, which one is better. So this is the most common benchmark right now, I would say, for human evaluation. It's called chatbot arena, where basically anyone can go online and just play for free with some of like the best models out there. And all they ask you is to say whether you prefer the one on the right or whether you prefer to love one on the left, essentially. And then once they reach, I think, a crazy amount of data, 200000 human votes, for example, they basically add it to a leaderboard. And the way they added it to a leaderderboard is that they essentially, I don't know if you know how chess works, but they basically look at the elo ratings. So they basically put everything as if it was a tournament so that not every model has to play against every other model, and then they get elo scores. Okay. So what's missing with this side by side human evil? As I said, this is really the gold standard for evaluation of chat lms. But there are still some challenges. First, like it's basically random people online that ask random questions and they provide their preferences. So that might may not be representative, although arguably when you have that many examples, like it becomes actually pretty representative of what people would want. So it's probably better than whatever we have, but it is still not ideal. And then really, the big issue is cathis takes a huge community effort and a lot of people to work on that. Also, it takes a lot of time to get new models on the benchmark and only the notable models. So think like the OpenAI models and the cloud and like the Google ones and the Facebook ones are going to be benchmarked. You will never have for your random model 200, zero people who are willing to annotate it for free. So this is an issue. And again, like as we talked about it in the first slide, even for those big companies, they can definitely not do that for like development of their model. This is something that comes at the end for maybe model selection. Okay. So how do we make it faster? So one very natural solution is basically to ask a large language model to do the evaluation for you. So imagine that I want to compare to GPT with mial. I basically add as GPT -4 to evaluate which one is better. This is surprisingly good, and I will show you some results afterwards. And some common versions of a paca eval and empty bench, probably the two most common ones. So when we started doing that, that's a problem I told you about. We saw that around last year. And we found that using GPT -4 essentially for evaluation is, at least if you look at the prices now, would be 100 times faster and 100 times cheaper than if you use human evaluations. But, and this is very surprising, the agreement with humans actually higher than humans agree with themselves. So what I mean by that is that if I ask so this is what we found. If I ask for humans, let's say I have a pool of four humans, and they take out one human, and they look at the agreement between that humans preferences and the mode of the preferences of the three others. And I do that in an leave one out fashion. And I look at this agreement, this will be lowered. And if I ask for the model to predict, essentially the preference of the mode of the humans, so in some ways, models are more highly correlated with humans than humans themselves, which is very surprising. And I will tell you about it in 2s, a little bit more. When we did that, we actually use that for collecting human preferences for rhf. So that's what we call our aif, as I think Archard told you about these things last week. So going back to this issue or this surprising result that actually models are more highly correlated with humans and humans themselves. The reason why this is, is because humans actually have high intertated disagreement and have high variants. Essentially models, they will always be very consistent, or maybe not perfectly like there's still some stoicticity, but essentially they will always predict the same label. So they have very little variants. So here what you see on this plot is on the sorry x axis, we estimated the variants, and you see that the human has a variance of like around 31 or 33. Well, if you look at a red point, this is basically if you just add GPT -4 to do evaluations, so even though the bias is still pretty high, so bias by definition for humans is zero. For GPT -4, it is like around 32%. The virus is much lower than humans. So this is why you can see that actually sometimes agreement is higher, but that's really because there's no or very little virus in lms. Yeah. Does that make sense? It means being tricurates lens is higher than that human. It means that being triminal Curto lis higher than that. Exactly. So which which is actually a good sign because that means it's, that makes it much easier for for research. The bad sign is that the bias is so high. Okay, so things to be careful with when you work. I mean, this is both with humans and with lms. There will be some spherous correlations. So we already talked about spherous correlations, but you will see a lot of those. One very common example is length. So if you just as I told you before, if you ask crowdworkers which examples they prefer, they are highly biased towards longer output. So here, the bluest humans, it's around, I think, 70% preferences for longer outputs and models are around the same bias. And another example is preference for lists. So usually if you see lists in an output, models prefer these examples and models model, and humans prefer these examples. Another bias, or spurce correlation, is a position I told you, like which one you put on the left? Which one do you put on the right? When compto, when you ask humans to label, there's the same thing with models. But this is usually pretty easy to control for you. Just randomize both. Another issue is GPT for self viras. So very naturally, like you might wonder if I ask GPT for to evaluate itself, like it will probably bias and it will prefer itself than other models. And this is but less than what you might think. I will tell you about it later. Ok. So how pevwill try, wait until what time do I have. Oh, thanks. Great. Okay, alpka eval. So alpka eval is the benchmark that we developed when we were working on alpaca. So as I told you before, we need one thing which is very important is what you use for the development. So basically for hyproprietal tuning. So what we did is that we basically did not trust many of the benchmarks out there at this point for instruction following. So we just developed a very small benchmark for ourselves, and this is what we were doing, five print tuning, and then it kind of became its own thing. So alpaca eval in a few numbers, it has very high correlation with chatbot arena. So the ranking, if you look at the correlation between the ranking and chatboarena and in alpa eval, it's 98%, so very high. And it takes around lar three minutes, $10 to evaluate. And the way it works, I think I already mentioned it, but basically you take an instruction, you generate an output from one model and then from another model that you're comparing it to, and you ask GPT -4 to basically give the probability that it prefers the model that you're evaluating versus the baseline that you're comparing to. And then you do some reweighand. The reason why you do some reweighis because these models, as I said, are very biased towards longer outputs. So you want na rewesuch that if it's a longer output, you give it a slightly less high preferences, high preference, and then you average across your entire data set and you get a rirate. So that's how it works. Any questions? So system level correlation. So here, what you see on the x axis is basically alpkaevva. I mean, a slight transformation of it, but essentially alpkae valscores. And on the y axis is this chatbot arena, which is the gold standard, and you see that things are relatively highly correlated. And on the lower plot, you see basically the correlation between different benchmark and chatbot arena, and you see like empty bench and alpkaevva, which are the two ones that use llms for evaluations, are relatively highly correlated with chatboad arena. And mmlu, which is automated, one that doesn't use in rheis, also very highly correlated. So I told you very briefly about the fact that we had to do some reweso. I'm not going to tell you how we do it, but I want to tell you why we do it. One of the issues that we realized a little bit too late is that if you take something at GPT -4 and you just ask it, you prompt it to be much more detailed, to basically provide much more detailed answers. Its win rate. So its performance on your benchmark goes from 50% to 64.3. So that's this one, 64.3. If you ask it to be more concise, like it decreases to 22.9. And that really doesn't fit like our mental model of what benchmark should be doing. If I just change a tweak a little bit, the prompt, I don't want my model to change completely its ranking. So that's why we have to do some reweighting. And you see that after the reweghting, you basically have that performance after you ask the model to be more of a boss is very close to the performance without any prompt tuning.
speaker 2: Cool. So I told you slightly .
speaker 1: or very briefly before my self bias. I do want to say that I'm pretty surprised about this result, but actually self bias exists, but it's not as high as you might think. So here you see on the x axes the ranking well, like the the different models that you're evaluating. And on the sorry, that's on the rows and on the columns, you see who is evaluating, which model are you using for evaluation and you actually see that regardless of the model that you evaluate with, the ranking will be the same. So even though it's that if I look at mial evaluated by misal, it gives itself a much higher accuracy. It still prefers Claude and GPT foso. It's not as bad as what you may think. It's still bad though. Cool. Okay. So that leads me to talking about current evaluation of lms ms. So I'd say there are three main ways that people currently evaluate lmms. The first one is perplexity, which is essentially just looking at training losses or validation losses. The second one is basically averaging everything, which is actually surprisingly more common than what you may think. And the third one is this arena like or what you basically have comparisons between models and either use humans or you use models to do the evaluation. And usually how it works is that ptrain model, let's say the new like when lama four comes out or like when GPT five comes out, they basically mostly show perplexity and average over everything. And the fine tune models, they usually tend to show average over everything and like performance under arena like models. And the reason why is because models that are fine tuned, usually the log likelihood that they predict is not Yeah it's not calibrated for your data set. So what do I mean by everything? I would say the two most common benchmarks that basically look at everything helm and hugging face, open arm, leaderboard, it's really just a collection of a lot of different automatically evaluated benchmarks and you evaluate across all of them. So what are some of the common benchmarks that we use? One is Yeah measuring like math performance. So gsm 8K, that's a pretty common one. That's basically great school math. Mmlu is multiple choice question answering on like math, science, history. Legal bench is on the legal aspect and you have Med qi. So I believe this is for helmed qa is medical licensing exams. So you basically ask many, many different questions that you can automatically evaluate. And you hope that by taking averages, it will say like how well your model performs. So that's kind of like the newer version of super glue. I would say one data set that I want to highlight, which is probably one benchmark, which is probably the most widely used and the one that people believe the most, is mmlu. So massively multitask language understanding. So this is, I think maybe Archard mentioned it last week, but this is basically multiple choice questions on 57 different tasks. So you have tasks like formal logic, conceptual physics, econometrics and these type of tasks. So here's an example. What is for type one, a supernova? This type occurs in binary system. This type occurs in Young galaxies. And you basically have to say which answer. So that seems very simple. I mean, the task is not simple, but the way you evaluate seems simple. And then like high school biology in a population of giraffes and environmental, and then this is an example of directional selection. So that seems simple. But actually it's also more complicated than what you might think. And I think I will tell you okay, I will tell you about it later, but that's one of the most probably the most common benchmark in what people actually look at. For example, when Mark Zuckerberg said that lama three was out, Yeah, he talked about mmlu scores, which I find kind of crazy. But Yeah, other capabilities that people look at, coding. Coding is a very common one that people evaluate on for two different reasons. One, because coding is usually, if you perform well on code, usually actually these models perform well on reasoning, which is actually pretty cool. So that's like highly correlated with things that people care about. Two, I mean, a lot of us are coded, so we like to have better models for helping us coding. And three, the other point is that it's actually pretty easy to evaluate because you can write test cases. So you basically ask the model to generate very long code or like functions to do something, and then you just run the test and you see whether it succeeds or not. Yes, evaluation, some of them was shcal ansapparently. How would you validate like Shand qa type of thing or where it's like multiple will choice makes sense. But if it's like short answer qa, how would you say something is correct as an automatic method? I think it's specifically to the top. Yeah, I actually don't know. Huh I actually don't know. Yeah, I should check. Sorry. So I don't know specifically for this one, but hopot qa and beer qa are other qa datasets, and they pluget F1 for the impulse. And then they also have an exact match, which is pretty punitive because like if you say President Reagan, and the answer is like President Ronald Reagan like shoting you, but anyway, so they use like an exact match on that. Yeah cool. Thanks. Okay, Samuel U coding. Another one that people start looking at are agents. I think Shacar is going to give a lecon it. So I'm not going to talk too much about it, but like one cool thing that lms can do right now is basically cool apis and then take actions in the real world essentially or like take control of your computer. You should not give it control to your computer. So a actual question is like, how do you evaluate these type of things? This is a real challenge because I mean, the biggest challenge is that if you, for example, if I really wanted to evaluate how good it is at coding or like how good it is at doing things in my terminal, I need to give it access to my terminal. And I really don't want na give my llm access to my terminal. So you really need to sandbox environments for the specific cases of terminal. I mean, it's pretty easy to sandbox, but once you want na do evaluation of like a model that, I don't know, pks people on slack or like writes things in your emails, then you have to write an entire sandbox environment for all the applications that you want your alms to have access to. So this is actually really complicated and something that people really have to deal with in kind of the real world. At least we'll have to because right now, it's still not in production. Okay. The last part is the penultimate one, perplexities. So one thing which is very surprising, or at least the first time you see it, is that really the performance that you have on pre training is extremely highly correlated with basically performance on any downstream task, at least for the current types of llabs. So what I mean by this is that if you just look at your training performance, just predicting the next word, it's extremely highly correlated. So this is the x axis, which is essentially perplexities, and the y axis, which is just the average over many different tasks. What you will see is that tasts that perform well on on perplexities will actually have higher average scores. And as a result, a lot of people actually end up when they develop just looking at perplexities, and they just trust sted enough that they don't need to do the downstream evaluations. I would not recommend doing it, but if you have to have something like quick and dirty, it usually works pretty well. One thing to be careful with, though, is that the perplexities are not going to be comparable across different data sets. So you really have to be careful with what perplexities you're looking at. And two, it will depend on the tokenizer. So if you have like lama three or you compare it to Gemini, even on the same data set, it's going to give different scores and it's not comparable. Yes, the easy answer. I mean, it's not the only answer, but the easy answer is that if the vocabulary changes, the size of the vocabulary changes, then clearly the type of I mean, everything is not on. Like the upper bound is different. Sequence, a sequence. Yeah talking but I'm not talking about that. I'm talking about the fact that, I mean, just think about it, if you have a vocabulary size of one, then I have to always predict the same thing. So basically, your entropy depends. Your entropy is up, abounded by log of the cardinality of your vocabulary size. So you're going to depend on that. Cool. And the last one is arena. Like as I already told you, basically you compare different models. You make them fight essentially against each Yother and you have elo ratings at the end. So that's really a more general way of saying it is. I've really just led the users decide and that works also pretty well. Okay. Issues issues and challenges with current evaluations. First, consistency issues. If you look at question answering, sorry, multiple choice questions, if you just change so you see on the top left and top right, if you just change abcd to like random symbols, the generations that you will give are actually going to be different. And then the rankings between different models will be different. So even things that are very simple, like multiple, like selecting out of four choices, will be very dependent on exactly like how you format these choices. And one real example, that's what I was alluding to before, is mmlu. So mllu seems really simple to evaluate. You just ask it to say like which one of the the model prefers. But actually, for a very long time, I think for nearly one year, there were three main implementation of mmlu. And people were comparing between those who having no idea that, those who gave different scores. And the reason the two main differences were, one, people used different prompts, so that clearly will give different answers. But two, they were using different ways of sampling the actual to get the actual most likely prediction. So one of them, for example, was saying, I have the four choices now to get my most likely, let's say that the correct answer is D, I will just look at the most likely answers out of abcd. Even though, like zzigotwas, another answer that has a higher likelihood, I will not look at it because I will basically do constrained decoding. And if I do constraint decoding here, I will say that the correct answer is d. But if I actually just look at the most likely token, I will not get the correct answer. So like those were two different implementation. And a third different implementation, which seems really different, is that instead of generating the correct token, which is basically the letter abcd, you can look at after this question, what is the likelihood, sorry, that the model would generate this. So you would look at the log likelihood, or like the perplexity essentially of predicting this log likelihood of predicting that. And that gives very different answers. So if you look at the top right, you see that lama 65b mmlu on helm was 63.7 and the original mmlu 63.6. But on harness, which is the thing that actually hugging face uses, is 48.8. So that's like a .
speaker 2: huge difference. Yeah. What is helm harwe can imagine to these three things there?
speaker 1: Yeah. One, I can't remember which one does what but each of them does something different actually now it's not anymore. So the middle column change what they're doing. So they start matching the other two ones. But at that time they want I'm not sure which one my guess would be that did the last one, but I'm not entirely sure. Okay, questions. Cool. Another issue contamination. So here you have harass here. If you don't follow him on Twitter, you should. And he basically says that he was looking at like code benchmarks and he was saying that pre pre 2021, I can't remember which mark, our GPT -4 was getting ten out of ten on questions on code force, but after 2021, or like more recent problems, it was getting zero out of ten, which seems very, very strange. So that really strongly points to the fact that it was contaminated and was probably the model was probably ptrained like that data set or the code force datset was probably in the pretraining datset. And of course, if I mean, essentially you do training on your test set, then you're going to perform really well. And Suzanne said also to follow also said something similar for phi 15, which is a model from Microsoft. So what is challenging here is that we've closed models. I mean, there are two things actually now challenging. One is that those are pre trained on so much data that even if we had access to the data, it would be hard to actually know what like if they were prere trained on your test set. But two, those are all closed source models, so you really don't even have access to the data set. So you have no idea if they were pretrained on that data. Overfitting issues. That's also relatively related, but could be slightly different. So here you see how much time it took for standard data sets to achieve, like Spary quotes, human level performance. And what you see is that on the recent ones where you really have this pre training, in less than like six months, you perform at human level performance. We don't really know if it's because of the contamination or if it's simply that like a lot of people are basically developing and trying to do high proprietary tuning on these test sets. We all know why, but it's clearly an issue with overfitning. So how do you alleviate that one? You can have private test sets. So there's a paper from, I think, two weeks ago that presented gsm 1K, which is the same thing as the gsm 8K that we saw before, which is the math datset, but tries to basically regenerate or resample this data set or recollect this data set. And then they look at how well different models perform on both the gsm 1K and the gsm 8K. And what you see is that at least the open source models, they perform much worse on a new data set than the one that people are able to tune on. This is not dofull like clouud and GPT -4. Another one is a dynabench, or just dynamic test sets. So ideally, every x number of days, you would basically have new instructions or new inputs to the models and your data aset would basically be dynamic. That's essentially also what a chatbot arena does. So that definitely helps. Another way of alleviating contaminators is that you may try to estimate or like to look at whether the models were actually trained on your test set. So one very simple way of doing it, which actually works, I think, relatively well, is just looking at the probability of different answers. And you will see that if your model is really sure by a certain answer, then probably was trained on that answer. Another one, which is also really cool as looking at the order of your test set. So your if a model was trained or pre trained on the test set, then most likely it thinks that example two comes after example one. So if you switch example one and example two and you see drops in log likelihoods, then most likely the model was actually ptrained on that data set. Cool.
speaker 2: Any questions .
speaker 1: here? Okay. So another issue is that I mean, really there's a monoculture of nlp benbenchmarking. What I mean by this is mostly the fact that we all just look at English, and this is a paper from 2021, 2022, I think. But they look at acl 2021, which is probably the most common conference or like conference machine in nlp, and they look at the best papers, so the oval papers. And they saw that out of the 461 papers, 70% of them only look at English and 40% of them only look at accuracies. So essentially just performance. So there are very few papers that look at multilinguand, even like efficiency on interpretability or fairness. And there's a similar paper from that analyzes another conference in 2008, and it was essentially the same finding. So unfortunately, it doesn't seem to improve over time. The thing is, there are actually a lot of benchmarks for mullinguality. I just hired a few here, mega global, bench extreme. Those have at least 30, 40 languages and many, many different tasks. So it's not that we don't have the benchmarks, is that there's no incentives, unfortunately, in academia to actually train or like sorry, to evaluate on those benchmarks. So if you have the chance, use those benchmarks. Another issue is that really we reduce everything to a single metric. So I already told you before, the way we aggregate metrics, this is usually kind of broken in some of these super benchmarks. But also we only look at performance. And in the real world, like we really care about computational efficiency too. We also care about biases, and we care about many other aspects. And like most of these benchmarks, don't consider those. Another part is that we usually average across every example. We just say that every example has the same value, essentially the same weight. So this is definitely unfair for like minoritized groups. But more than this, I think if for example, if you think about like agents, where maybe one example will be like how well it performs on, I don't know, writing code that will actually be put in production versus just like answering your daily question about where I don't know where to buy the best burger, like the value that you will get out of these examples are very different. And right now when we evaluate stuff, we don't actually consider that. So that's, I think, really a real issue. And also we basically, we don't take into account that different people have different preferences. So a few shout outs. One considering computational efficiency. So mlpuhas a great benchmark when basically, instead of trying to maximize the performance on a certain benchmark, they say, I want to achieve that performance in the least amount of time. So now you basically consider both like accuracies and like speed, either for training or for for inference, for biases. Disscreeval is a good data set from antropic where basically they have some templates. And so they try to ask questions like knowing whether someone should keep their insurance or not. And they have templates where they change the race or the gender in the template of the person, and they see how the decisions made by the model would change. And I mean, unfortunately, but unsurprisingly, you will see that some groups are much more discriminated than others. Other biases in our evaluations. I already told you slightly about multilingual issues, but honestly, this issue about English is like much more prevalent than you would think. For example, bluand Rouge scthey really assume that you basically have access to words like, you know how to tokenize and how to get words. So I used to work with Thai and Vietnamese. With Vietnamese, you have spaces in between words, and you have no spaces between words. Like you have no idea how to run like blue or Rouge really. It's much more than just a data. Like all our algorithms are really focused on English or at least western languages, a biased llm based evaluation. So one thing is that I told you about is that it's really cool because now you can use essentially GPT -4 for doing labeling. But that also means that given that GPT -4 is very consistent, if it has some biases, then most of essentially the nlp community will have these biases scaled up essentially. So one benchmark which tries to look at whose opinions llms reflect by default, this is actually pretty cool work that looks at the output distribution of llms on public opinion surveys. So just trying to understand whether llms reflect opinions from which groups. And they find that at least after, when you only do pre training, the model is actually relatively well. They are not too optimized to a single group. But after so this is in red, but after fine tuning, you basically see that the models really start being optimized for certain preferences, which is unsurprising because that's how we actually train the model. And typically, these models actually mostly show preferences from actually the answer as if they were from, I mean, White and Southeast Asian. So I think the Southeast Asian is actually pretty interesting. I think it's probably because a lot of these models were the human data that was used for supervised fine tuning and for lhf was actually labeled by people in Southeast Asia, which would explain why these models have these type of views and usually also highly educated. Okay. So this is the main challenge, the challenges of all challenges. We saw that there are many challenges in evaluation in at least in academic benchmarking, but the biggest one is that really there's no incentives for us to move to anything else. And this is actually pretty an interesting paper that looks at machine translation between all the papers of many papers from 2019 to 2020 in machine translation. And they found that 82% of papers, they only evaluated blue scores. And as we said, like blue scores have many, many issues. And if you see, we know that there are many better metrics, but still people are not incentivized to look at anything else. And actually, reviewers will usually ask you to show performance on blue scores. So it's not even that you're incentivized not to look at something else. You're also incentivized to continue. And it kind of makes sense because you want na be able to compare to methods from two, three years ago. But it also means that it's hard for the academic field to change to other benchmarks. But this is really specific to academia. Like in reality, if you know that your metric is bad, just switch. Okay, evaluation takeaways. So first I mentioned that there were different types of evaluation and different desired properties for different types of evaluation. Then I talked about close ended tasks and how you evaluate those, the fact that it's basically standard machine learning, but that you have to think carefully, even though it's standard machine learning of how you evaluate them. Then there are open ended tasks where you look at content overlap metrics typically, so things like blue and Rouge and birds Scand. Then you have chatboevaluations, which is extremely difficult, but people have start doing with using essentially llm based evaluations. And then we talked about challenges, one of them being consistency, the other one contamination, and the third one biases. In reality, honestly, the best evaluation is just check your outputs. So I think too many people, they just believe numbers in reality, like never just believe numbers. Like I remember when we did initially alpaca, like we kind of believed that alpaca eval, but once we started it, playing with it, that's when we were like, okay, this thing is actually, I mean, at that time, good. Now it would be a pretty bad model. But at that time we're like, okay, this thing is actually pretty good. We should do something about it. Even though on maybe standard academic benchmarks, it was pretty bad. So Yeah, don't rely on numbers and I'm happy to what time is it to take? Any other questions that you may have .
speaker 2: question about? So there's this whole issue of bias, which trying which we're really trying to deal with, but we're sweeping under the rug here. So if we have a problem in which we're dealing with a very specialized domain, and yes, we try and go and run and run reference three valves using like let's say GPT -4, like is it considered bad practice to be checking a subset of these GPT -4 evals, breaking them up ourselves, and then like and then using and then using ourself, like inserting ourand our bias into this process by actually looking at many, many, many data points?
speaker 1: So just to make sure and understand your question, you're saying that if we try to look at ourselves, at the answers, we might be incorporating .
speaker 2: some biases there. Yes, but we should look at the answers to make sure that GPT -4 isn't being biased when it looks at the answers. There's this tension here, and I don't know what the cause in a control lled psyexperiment you would blind yourself to look at these answers. How do you deal with this?
speaker 1: Yeah, that's a good question. I actually don't quite know. But one thing I actually feel less concerned about biases of a single person. My issue of the GPT -4 biases is that it's the same across every model. So things really scale up and kind of it's really it becomes a monoculture. And I think that I think that's less that's much worse than if everyone incorporates a little bit of the biases that they have in their direction. I'm not saying that that's the the best answer, but I think it's slightly better than then just going with whatever they have.
speaker 2: Yeah how does one following up on how do I avoid the situation if we're .
speaker 1: like one .
speaker 2: is trying to solve a problem with a model? Yeah and one evaluates it with GPT chat, GPT -4Yeah and then one starts to to like look at it and say, okay, this this is good and stuff. And then one goes, okay, this is great. And everyone else in the world in GPT -4 thinks it's a terrible, terrible model. And it's just someone being, and it's just some academic being, like pressuring themselves into publishing something that doesn't actually work. How does the field structurally avoid situations like that?
speaker 1: Well, I think that's one reason why they want standardized benchmarks and why every reviewer actually want standardized benchmarks because at least even though everyone knows that they're wrong, they understand how they are wrong. So I think that's like one perspective. Another thing, which is not doesn't completely answer your question, but I think could be a potential solution, is that how I view GPT -4 is just something that is really good at performing what I wanted to perform right now. The thing is, I not very specific about what I wanted to perform. And as a result, it will basically come in with its own biases that come from its pre training data or fine tundata. A potentially better way of doing it is that I could write exactly what I wanted. So right now, when we set, when we do the prompting to GPT five, basically ask a question, simple question, like how good is the summary out of five? But a much better way would probably be writing a very detailed rubric of everything that has to be in this answer for it to be a good answer. And if you think about it, this is exactly what like professors do when they evaluate for class. Like they basically you say, okay, Jan is a okay ta, but I cannot trust him perfblindly, so what I will do is that I will write a very detailed rubric, and I trust that he can apply that rubric. And I think that's also how we should be thinking about gpd four. This is not how we currently do it. Any other .
speaker 2: questions?

概览/核心摘要 (Executive Summary)

本讲座由斯坦福大学博士生Yann Dubois主讲，深入探讨了自然语言处理（NLP）领域中基准测试（Benchmarking）与评估（Evaluation）的重要性、方法、挑战及当前实践。讲座强调，在机器学习模型开发（训练、开发、模型选择、部署）及学术发表的各个阶段，性能评估都至关重要，但不同阶段对评估指标的要求各异：训练阶段需快速、廉价、可微；部署阶段需高可信度、任务特定、绝对性；学术发表则需可复现、标准化。

讲座将NLP任务评估分为两大类：封闭式任务（Text Classification / Close-ended），如情感分析、文本蕴含，其评估方法与标准机器学习类似（准确率、精确率、召回率、F1等），但需警惕数据集偏差和伪相关（如SNLI中的否定词）；以及开放式任务（Text Generation / Open-ended），如摘要、翻译、指令遵循（聊天机器人），其评估更复杂。开放式任务的评估方法包括基于内容重叠的指标（BLEU、ROUGE，存在语义理解不足问题）、基于模型的指标（BERTScore、BLURT）和被视作黄金标准的人工评估。然而，人工评估面临诸多挑战，如成本高、速度慢、标注者间/内不一致性（研究者间一致性仅67%）、不可复现性（仅5%可复现）及激励错位。

当前大型语言模型（LLM）的评估趋势包括：使用如MMLU（准确率从25%提升至约90%）等多任务基准、代码生成（易于通过测试用例评估）、Agent能力评估（需沙盒环境）以及竞技场式评估（如Chatbot Arena）。一个重要进展是使用LLM（如GPT-4）作为评估器（如AlpacaEval），其速度和成本远优于人工（快100倍，便宜100倍），且与人类判断相关性高（AlpacaEval与Chatbot Arena相关性达98%），但需注意其自身偏见（如长度偏好、自我偏好）。

讲座最后指出了当前评估方法存在的普遍问题：一致性问题（MMLU不同实现导致巨大分数差异）、数据污染（模型在测试集上预训练）、过拟合、评估的单语（英语）和单一指标（准确率）倾向（70% ACL论文仅评估英语，82% MT论文仅用BLEU）、忽略效率和偏见，以及学术界缺乏改进评估方法的激励机制。讲者强调，“最好的评估是亲自检查模型的输出”，不应盲目相信数字。

衡量性能的不同原因

Speaker 1 (Yann Dubois) 指出，在机器学习模型生命周期的不同阶段以及学术研究中，衡量性能的目的和所需评估指标的特性各不相同。

模型开发流程中的评估需求：
1. 训练 (Training):
  - 需要损失函数指导优化。
  - 评估指标需：超快 (super fast)、超便宜 (super cheap)、可微分 (differentiable)。
  - 避免模型通过“捷径”优化损失而非真正目标。
2. 开发 (Development):
  - 如超参数调整、早停策略。
  - 评估指标需：快速 (fast)、便宜 (cheap)，避免捷径。
3. 模型选择 (Model Selection):
  - 为特定任务选择最佳模型。
  - 评估指标可：相对不那么快和便宜，但仍需多次执行。
4. 部署 (Deployment):
  - 决定模型是否达到生产标准。
  - 评估指标需：可信赖 (trustworthy)、任务特定 (task specific)、绝对性 (absolute)（而非相对比较）。
  - Speaker 1 强调："你需要知道你的模型是否足够好到可以投入生产。"
学术发表中的评估需求 (Publishing):
- 在标准基准上评估模型，以便与其他研究成果交流和比较。
- 评估指标需：可复现 (reproducible)、标准化 (standardized)。
- 易于使用 (easy to work with)，考虑到研究者资源有限，需快速 (fast)、便宜 (cheap)。
- Speaker 1 观点：学术基准中的指标即便不完美也可接受，关键在于其能否在数年内指引领域向正确方向发展。"在元层面，如果我们在学术界使用粗略的指标，只要它能显示出领域在十年内的进步方向，那也是可以的。"
- 基准需在难度与简易度之间取得平衡：太难则所有方法表现随机，太简单则基线过高，均无法有效衡量进展。

学术基准测试 (Benchmarks in Academia)

Speaker 1 认为学术基准是推动领域进步的关键。

MMLU (Massively Multitask Language Understanding) 基准示例：
- 是当前最标准的基准之一。
- 在过去大约四年中，模型在该基准上的准确率从约25%（随机水平，四选一）提升至约90%。
- 这表明基准测试确实驱动了领域的进步。
- 强调宏观视角：重要的不是微小的分数差异，而是确保长期来看，排名靠前的模型确实优于早期的模型，即使基准本身不完美。

文本分类 / 封闭式评估 (Text Classification / Close-ended Evaluation)

Speaker 1 定义了封闭式任务及其评估方法。

定义： 任务的潜在答案数量有限（通常少于10个），且通常只有一个或少数几个正确答案。
评估方法：
- 属于标准机器学习范畴，可使用准确率 (accuracy)、精确率 (precision)、召回率 (recall)、F1分数、ROC曲线、AUC曲线等。
- Speaker 1 建议不熟悉这些指标的听众参考相关课程（如讲者提及的Chris Piech教授的CS224系列讲座）或scikit-learn文档。
典型封闭式任务及基准：
- 情感分析 (Sentiment Analysis): 通常是二元分类（正面/负面）。
  - 基准: IMDb, SST (Stanford Sentiment Treebank)
- 文本蕴含 (Entailment): 判断假设是否能从前提中推断出来。
  - 基准: SNLI (Stanford Natural Language Inference)
- 词性标注 (Part of Speech Tagging):
  - 基准: Penn Treebank
- 命名实体识别 (Named Entity Recognition):
  - 基准: CoNLL
- 共指消解 (Coreference Resolution): 判断代词指向哪个名词，是一个具挑战性的NLP任务。
- 问答 (Question Answering): 基于给定文本回答问题。
多任务基准 - SuperGLUE:
- 一个包含多种封闭式任务（如BoolQ - 是非问答, CoLA - 语法可接受性, RTE - 蕴含, WiC - 词义消歧）的集合。
- 通过对各任务性能取平均来衡量模型的通用语言能力。
- Speaker 1 指出这种做法（对不同单位的指标直接平均）存在问题，称其为“一个非常糟糕的做法，但这确实是人们正在做的”，并忆及曾有基准中一个数值越低表现越好的指标也被错误地平均处理的情况。
封闭式评估的挑战与注意事项：
1. 指标选择的重要性：
  - 以垃圾邮件分类为例：若90%邮件非垃圾，仅预测“非垃圾”可达90%准确率，但模型无用。此时需关注精确率、召回率、F1。
2. 指标聚合问题： 如SuperGLUE中对不同类型指标（准确率、F1、相关系数）简单平均的做法值得商榷。
3. 标签来源 (Where do those labels come from?)： 标签的获取方式可能引入问题。
4. 伪相关 (Spurious Correlations):
  - SNLI案例： 2019年一篇论文发现，模型在SNLI任务上表现良好，但仅凭“假设”本身（不看“前提”）也能取得高分。原因是人类在构建“不蕴含”的假设时，倾向于添加否定词。模型可能学到了这个捷径。
  - Speaker 1 提醒："即使这是标准的机器学习，也要非常小心你使用的指标以及标签的来源。不要想当然地认为如果存在问题，人们早就发现了。"

文本生成 / 开放式评估 (Text Generation / Open-ended Evaluation)

Speaker 1 重点讨论了开放式任务的评估，因其更具NLP特色。

定义： 存在许多可能的正确答案，无法全部枚举，且正确性常有不同程度之分（连续谱而非二元对错）。
典型开放式任务及基准：
- 摘要 (Summarization): 将长文本缩短。
  - 基准: CNN/DailyMail (使用新闻文章顶部的要点作为“黄金摘要”)
- 翻译 (Translation): 将文本从一种语言转换到另一种。
- 指令遵循 (Instruction Following): 如聊天机器人（ChatGPT），被视为“万能任务”，可涵盖分类、摘要等多种子任务。评估极具挑战性。
开放式任务的评估方法类型：
1. 内容重叠指标 (Content Overlap Metrics):
  - 比较生成文本与参考答案（人工编写）在词汇或词组上的重叠。
  - 快速高效，基于N-gram重叠。
  - BLEU (Bilingual Evaluation Understudy): 关注精确率，常用于翻译。惩罚过短的生成。
  - ROUGE (Recall-Oriented Understudy for Gisting Evaluation): 关注召回率，常用于摘要。
  - 问题：
    - 语义理解不足： 无法捕捉同义词或释义。例如，对于参考答案 "Heck yes!"，模型生成 "Yes" (BLEU 67%) 可能比生成 "You know it!" (BLEU较低) 分数高，而生成 "Yep" (BLEU 0%) 尽管意思相同。
    - 误报 (False Positives): 如生成 "Heck no!" 可能因词汇重叠获得高分，但意义完全相反。
2. 基于模型的指标 (Model-based Metrics):
  - 基于词嵌入 (Word Embeddings): 比较生成文本和参考文本的词嵌入向量的相似性（如平均嵌入的余弦相似度）。
  - BERTScore: 使用BERT等预训练模型的上下文嵌入进行比较，通常比简单词嵌入效果好。
  - BLURT (Bilingual Evaluation Understudy for Generation with Representations from Transformers): 一种学习型指标。先用BERT预训练，然后继续预训练以预测BLEU等指标，最后在人工标注的评估数据上进行微调。
    - Speaker 2 提问：预训练BLEU是否会导致与BLEU相同的问题？Speaker 1 回应，BLURT的持续预训练阶段使用BLEU和BERTScore等作为无监督目标，因为很多序列对没有人工标注。
  - 对参考答案质量的依赖：
    - 一篇研究新闻摘要的论文显示，使用文章自带要点作为参考时，ROUGE-L分数与人工评价的相关性很低。但若使用专家撰写的高质量摘要作为参考，相关性显著提高。这表明“参考答案的质量通常不高”。
3. 无参考评估 (Reference-Free Evaluation):
  - 不依赖人工编写的参考答案。
  - 早期方法：使用BERT等模型直接对输入和模型输出打分，效果不佳。
  - 当前趋势：使用大型语言模型（如GPT-4）进行评估。 给定输入和模型生成的摘要，直接询问GPT-4其质量如何。效果出奇地好。
  - 常见基准：AlpacaEval, MT-Bench。
4. 人工评估 (Human Evaluation):
  - 被视为开放式任务评估的黄金标准 (gold standard)，也是开发新自动评估指标的参照。
  - 通常要求评估者从多个维度（如流畅性、连贯性、常识性、风格、语法、冗余度）进行评价。
  - 重要提示：绝不能比较不同研究中的人工评估结果，因为评估者、标准、提示都不同。
  - 挑战与问题：
    - 缓慢 (Slow)
    - 昂贵 (Expensive) (尤其在学术界)
    - 标注者间不一致 (Inter-annotator disagreement): 即使经过详细讨论和制定规则，不同评估者对同一文本的评价也常有分歧。Speaker 1 举例AlpacaFarm项目中，5名研究者在详细讨论3小时并制定规则后，对模型输出的偏好判断一致性仅为67%（50%为随机）。
    - 标注者内部不一致 (Intra-annotator disagreement): 同一评估者在不同时间（如饭前饭后）可能给出不同评价。
    - 不可复现 (Not reproducible): 一项研究分析了2015-2020年间128篇论文，发现仅5%的人工评估实验设计信息充分，可以重复。
    - 仅评估精确率，而非召回率： 评估者只能评价模型已生成的特定输出，无法评价所有其他可能的优秀输出。
    - 激励不一致 (Incentives not aligned): 众包工作者目标是最大化时薪，可能寻求捷径，而非最高质量评估。例如，AlpacaFarm支付1.5倍最低时薪，但发现工作者完成速度比研究者快2-3倍，可能导致评估质量下降（如偏好更长的答案）。
    - 设置复杂： 任务描述、展示顺序（左右顺序也重要）、指标选择、标注者筛选与持续监控（如通过已知答案的“金丝雀”样本）。

当前大型语言模型 (LLM) 的评估

Speaker 1 介绍了当前评估LLM的主要方法和基准。

Chatbot Arena:
- 目前最流行的人工评估LLM的基准之一。
- 用户与两个匿名模型互动，然后选择更偏好的一个。
- 收集大量（如20万次）人类投票后，使用ELO评分系统（类似国际象棋排名）对模型进行排序。
- 问题： 随机用户和问题可能不具代表性（尽管大量数据可缓解）；需要大量社区努力和时间；新模型或非知名模型难以获得足够关注和评估。不适用于模型开发阶段。
使用LLM作为评估器 (LLM-based Evaluation):
- 例如，使用GPT-4来判断两个模型（如GPT-3.5 vs Mistral）哪个输出更好。
- AlpacaEval:
  - Speaker 1 团队开发，初衷是为Alpaca模型微调寻找可靠的开发集评估。
  - 结果： 比人工评估快约100倍，便宜约100倍。
  - 惊人发现： GPT-4与人类偏好的一致性高于人类标注者之间的一致性。原因是人类标注者内部和相互之间存在较大变异性，而模型预测相对稳定（变异小），尽管模型可能存在系统性偏见。
  - AlpacaEval工作流程：给定指令，模型A和模型B分别生成输出，GPT-4判断更偏好哪个，并进行长度偏好校正（重加权），最后平均得到胜率。
  - 与Chatbot Arena的ELO排名相关性高达98%。
  - 长度偏见问题： 未经校正时，简单提示GPT-4生成“更详细”的答案，其在AlpacaEval上的胜率能从50%提升至64.3%；提示“更简洁”则降至22.9%。重加权校正后此问题缓解。
- 自我偏好 (Self-bias): LLM评估自身或其他相关模型时可能存在偏袒，但Speaker 1认为情况“没有想象的那么糟”，不同LLM评估器给出的模型排名大体一致，尽管具体分数会有差异。
- 常见基准：AlpacaEval, MT-Bench。
当前LLM评估的三种主要方式：
1. 困惑度 (Perplexity): 基于训练或验证集损失。
  - 与下游任务性能高度相关。许多开发者仅看困惑度。
  - 注意： 不同数据集、不同分词器 (tokenizer) 得到的困惑度不可比。
2. 多基准平均 (Averaging over everything):
  - 如HELM (Holistic Evaluation of Language Models), Hugging Face Open LLM Leaderboard。
  - 整合大量自动评估基准的结果。
  - 常见子基准：
    - 数学推理： GSM8K (小学数学题)
    - 多任务问答： MMLU (57个学科的选择题，如形式逻辑、物理、经济学等)。Speaker 1提到扎克伯格发布Llama 3时也引用了MMLU分数。
    - 法律： LegalBench
    - 医疗： MedQA (医疗执照考试)
3. 竞技场式对比 (Arena-like comparisons): 如Chatbot Arena，让用户决定。
其他重要评估维度：
- 代码生成 (Coding):
  - 常用基准：HumanEval。
  - 评估相对容易（通过单元测试）。
  - 代码能力通常与模型的推理能力 (reasoning) 相关。
- Agent能力 (Agents):
  - 模型调用API、控制计算机等。
  - 评估极具挑战，核心在于需要沙盒环境 (sandboxed environments) 以确保安全，特别是当Agent需要访问真实系统（如终端、邮件、Slack）时。

评估中的问题与挑战 (Issues and Challenges with Evaluations)

Speaker 1 列举了当前NLP评估方法面临的诸多困境。

一致性问题 (Consistency Issues):
- MMLU案例： 即使是简单的多选题，评估方式的微小改变（如选项格式A/B/C/D vs 随机符号，提示词不同，从生成选项字母到计算各选项的log-likelihood）会导致模型得分和排名发生巨大变化。曾有近一年时间，MMLU存在三种主流实现，它们给出的分数不同，但研究者常混用比较。例如，Llama 65B在HELM上的MMLU得分63.7%，原始实现63.6%，但在Harness（Hugging Face使用）上仅48.8%。
数据污染 (Contamination):
- 模型在预训练阶段可能接触过测试集数据。
- 案例： 有研究者发现GPT-4在2021年前的Codeforces编程竞赛题目上表现完美（10/10），但在之后的新题目上表现为0/10，强烈暗示数据污染。微软的Phi 1.5模型也曾被指出类似问题。
- 对于闭源模型，由于无法访问预训练数据，极难判断是否存在污染。
过拟合 (Overfitting Issues):
- 模型在常用基准上迅速达到“人类水平”，可能因为污染，也可能因为大量研究者针对这些测试集进行超参数调优。
缓解污染和过拟合的方法：
- 私有测试集 (Private test sets): 如GSM1K（GSM8K的重新采样版本），开源模型在新测试集上表现通常差于旧测试集，而闭源模型（如Claude, GPT-4）表现相对稳定。
- 动态测试集 (Dynamic test sets): 定期更新测试集内容，如Dynabench, Chatbot Arena。
- 检测方法：
  - 比较模型对不同答案的置信度。
  - 打乱测试集样本顺序，观察log-likelihood变化（若模型在预训练时见过有序数据，顺序改变会影响其预测）。
NLP基准的单一文化 (Monoculture):
- 英语中心： 一项对ACL 2021论文的分析显示，70%的论文仅评估英语。
- 准确率中心： 同一分析显示，40%的论文仅关注准确率，忽略效率、可解释性、公平性等。
- 尽管存在许多多语言基准（如XTREME、MEGA以及其他全球性基准，讲者提及了‘mega global, bench extreme’等例子，这些通常涵盖至少30-40种语言和多种任务），但学术界缺乏使用它们的激励。
简化为单一指标 (Reduction to a single metric):
- 现有基准大多只关注性能，忽略计算效率、偏见等。
- 对所有样本同等加权，对少数群体不公，也未考虑不同样本的实际价值差异（如生成生产代码 vs 回答日常问题）。
- 未考虑不同用户有不同偏好。
评估中的偏见 (Biases in our Evaluations):
- 计算效率： MLPerf是一个关注在限定时间内达到特定性能的良好基准。
- 公平性与偏见： DiscrimEval (Anthropic开发) 通过模板（改变种族、性别等）测试模型决策是否存在偏见。结果显示模型确实对某些群体存在歧视。
- 多语言问题： BLEU/ROUGE等指标假设词边界清晰（如空格分词），不适用于泰语、越南语等语言。算法本身也多为英语或西方语言设计。
- 基于LLM的评估偏见： GPT-4等评估器自身可能带有偏见，若广泛使用，会导致这些偏见被放大。研究表明，LLM（尤其微调后）的观点倾向于反映特定人群（如白人、东南亚裔、高学历人群）的偏好，这可能源于标注数据的来源。
核心挑战：缺乏变革激励 (No incentives to move to anything else):
- 尽管现有基准（如BLEU）存在诸多问题，但学术界仍广泛使用。一项研究发现，2019-2020年间82%的机器翻译论文仅评估BLEU分数。
- 研究者和审稿人为了与历史工作比较，倾向于沿用旧指标，阻碍了新评估方法的采纳。
- Speaker 1 指出："这确实是学术界特有的问题。在现实世界中，如果你知道你的指标不好，就换掉它。"

结论与启示 (Evaluation Takeaways)

回顾了不同评估类型（封闭式、开放式、LLM评估）及其特性和挑战（一致性、污染、偏见）。
Speaker 1 的最终建议：“最好的评估就是亲自检查你的输出 (the best evaluation is just check your outputs)。” 不应盲目相信数字，实际体验和观察模型行为至关重要。在Alpaca项目初期，尽管在标准学术基准上表现一般，但团队通过实际使用发现其潜力。

Speaker 2 在讲座中提出了一些澄清性问题，例如关于BLURT预训练BLEU的问题，以及关于如何验证简答题（short answer QA）的自动评估方法（Speaker 1 表示不确定具体方法，另一位听众补充了HotpotQA等使用F1和精确匹配的例子），还有关于在专业领域使用GPT-4评估时如何处理自身偏见与GPT-4偏见的问题。Speaker 1 认为，个体偏见的影响可能不如GPT-4这类单一评估器带来的系统性、规模化偏见严重，并建议未来可以为GPT-4提供更详细的评估准则（rubrics），如同教授给助教评分标准一样，而不是简单地让其自由判断。

摘要历史 (2)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-16 20:59

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-05-16 20:46

StreamSparkAI