speaker 1: Great. So I think let's get started because we have a lot to cover today. So my name is Jan. For those who don't know me, I'm a third year pg student advised by tattoo and Percy. And today I'll be talking about benchmarking and evaluations. So benchmarking and evaluations are honestly something that I think not enough people look at in academia. But if you really want na put something in production and you really care about, let's say, real world machine learning, evaluation is really key. So let's talk about that. So overview of what we'll talk about first is different reasons for measuring performance. Then I'll talk about text classification and how you measure performance there. Then text generations and how you measure performance there. And finally, how do you evaluate current large language models and some issues and challenges with the ways that we actually perform evaluations. Okay. So my mental model of how you actually develop a machine learning model is at first you will be training your model. So here, measuring performance is really key because you need to have a loss that you need to know when basically how to optimize. Then once you are optimizing your loss, the second step is basically development. So usually this is high proparto tuning. Or for example, if you have early stopping during your models, like if you see that your model is not performing that well, you might or that there's some overfitting happening, you might decide to stop or you might decide to like change the learning rate during the training of your model. So development is kind of the second step. And here you need to measure performance because you need to know how to do actually models, sorry, high prito tuning and like changing high parameters. Then the third step is essentially model selection. So if I have a task that I really care about, which model performs best for my task, that might be a model that I have trained. It might be a model that another group has trained. And finally, at least in the real world, you would decide to deploy your model. And here, measuring performance is really key, because you need to know whether your model is good enough to put in production in the parallel universe that we live in. There's also the publishing. So you basically need to test, like evaluate your model on standard benchmarks. And the reason why we do that is essentially for communicating to different groups the quality of our model. So at every step of this pipeline, you really need to measure performance. And that's what we will talk about today. But what is key to understand is that at different steps, you need to measure performance in different ways. So it is really not a single way of not a single ideal way of measuring performance. So for example, on the left, when you train your model for evaluating performance, you really need to have a way of measuring performance that is super fast, super cheap and differentiable. Because usually, I mean, with neural networks, you're basically back propagto the loss. So it needs to be differentiable. And finally, you really cannot have a way for your model to optimize some shortcuts to optimize the loss, even though it's not really what you want to optimize. And as you move more to the right, basically you're allowed or like you will measure performance less often. So it's fine if it's more expensive, but you really like the risk that you really need your evaluation metrics to be higher quality because the issues if you put a model in production are higher. So during the development stage, you need a way of measuring performance that is fast, cheap and also kind of avoiding shortcuts. Because when you do high prtuning, you're essentially also optimizing over a certain objective model selection. It can be a little bit less fast and less cheap, but still you will have to do it at many times. And most importantly, when you deploy your model, you really want the way to evaluate performance, to be trustworthy, because once you put something in production, there's kind of no way to go back for what happened during that time when it was in production. You also want things to be very task specific. So if I care about a certain task when I put my mom in production, it you really need to evaluate on that specific task. I don't care about other tasks. And finally, you need your metrics to be absolute. So the reason why I'm highlighting that is that in the three other steps, you really just care about comparing between things today is very different than if you want na kind of have a threshold which says, if I have less than 95% accuracy, I'm not putting my model in production. Okay. And now let's talk about publishing. This is a little bit different than honestly, evaluation in the real world. But when you basically do academic benchmarking and when you evaluate your models and academic benchmarks, you want the benchmark to be reproducible and standardized. And the reason why is basically because for the next five or six or ten years, everyone will be evaluated on that one benchmark. And you want papers in three years to be comparable to yours. So it's really important that your evaluations are reproducible. Honestly, you don't really care about that in the real world. You also want things to be easy to work with because researchers are usually a little bit they don't want to do additional work that they need to. And also, they usually don't have that much resource. So it needs to be fast and cheap. And finally, one thing which I really wanna highlight is that for the academic benchmarks that we usually have, it's fine if the metrics that we use are not perfect because really what matters is that over ten years, the direction that your metrics is showing you to go into basically how the fuel is moving, really, if the metric is saying that it's slightly better sorry, it's better over ten years that in reality, the field has made some progress. So at a meta level, it's fine if we use crude metrics in in academia. And also, you kind of need to to balance between difficulty and simplicity. And what I mean by that is that if your benchmark is way too complicated, then basically all methods will have essentially random performance. So no one will use your benchmark. And if your benchmark is too simple, then the baseline will be so good that no one will use your benchmark because no one can be the baseline. This is really something that is specific to academia in the real world. You're not gonna to be able to change tathat. You're performing based on like how good your model is. So that's why I kind of just want to highlight that because usually people talk about evaluations, but there's really different different ways of evaluating and different reasons why we evaluate. Does that all make sense? Also, feel free to ask questions. speaker 2: Great. Okay. speaker 1: So benchmarks in academia. This is really the way we drive the field. So this is the mmlu benchmark. I think arit briefly mentioned it, but I'll talk about it later. Again, so this is the most standard benchmark right now. And you basically see that in the last four ish years, it has gone from 25% accuracy, which is essentially random because it's multiple choice and there are four choices, to around 90 ish percent accuracy. So Yeah, benchmarking is really what drithe progress in the field. And again, you see what I mean here? What I meant here is that it's not really the differences between small points that matter. At least in academia. You have to take a step back and you have to think what matters is like how your models will perform over ten years and making sure that the model on the top right here is better than the model on the bottom left, even if the benchmark is not perfect. And I think mmlu is a pretty good one in that sense. Okay. So there are two main types, at least classically, of tasks in nlp, close ended tasks. So I'll talk about it later. But essentially, you can think about classification, where you know exactly the label, the correct label for the tathat you're performing. So here, this is the imdv data set where you're asked to say whether a sentence has positive sentiment or negative sentiment. So the texis read the book, forget the movie. So this is about a sentiment classification of the movie. So here it's basically negative. And then there's open ended evaluation. So think about ChatGPT, like how do you evaluate something like that where really there's no correct answer and are many or there are many possible correct answers and they all have different qualities. So we're going to distinguish between those two. So close ended evaluation. So as I just said, close ended Teit's basically defined one as the task where there's a limited number of potential answers, think like less than ten, and often there's just one or maybe a few correct possible answers. So this really is standard machine learning. If you think about standard classification, you can just do accuracy. You can look at your precision, your recalls. There's nothing special here about nlp. That is not to say that it's simple, it's just that there's nothing special about an lp here. So some tasks, some close ended ded tasts, I already told you about sentiment analysis. So usually this is a binary mic classification task where you just have to say whether the sentiment is positive or whether it's negative. Another task is entailment. Also for sentiment analysis, the typical benchmarks, I always put it next to the task is IMDb and ssc from Stanford. Entailment is snli, also from Stanford, where basically you have some text. So here, soccer game, we have multiple males playing. And then you have a hypothesis. Some men are playing sport, and you have to say whether the hypothesis is implied or entailed by the text. So here it is, other tasks, part of speech, typical benchmark, pantry bank and name entity recognition, which is the kal benchmark. A few other tasks. You don't need to know all of them, but just to give you a brief overview, co reference resolution. So this actually a pretty challenging nlp task where you have to say what pronoun which pronoun refers to what noun. So you have here the sentence, Mark G. Told Pete many lies about himself, himself, which Pete included in his book. He should have been more truthful. And now you have to say, what does he refer to? Whether he refers to Pete? And then this question answering, where you basically have a long text and you ask a question and sorry, the test a question, and you're supposed to provide an answer based on the text that you have before. So those are some examples of close ended tasks. And again, the key here is that the way we evaluate those is just standard machine learning can look at accuracy, precision, recall, F1, schopefully. You all know about these type of metrics, but if you don't, you should look at Chris peach's class. I think it's cs 224. You but his lecture is online and it's actually really good on different metrics. So the ways that people evaluate some of these benchmarks is usually by looking at many of them concurrently. So the most common, I would say, like super or multitask benchmark is called super glue. So here you see on on the columns here, you have all the different tasks in super glue. So I think are eight or nine. And then you really just look at the average performance in each of these benchmarks and you get a ranking on that. And that is kind of an attempt to measure general language capabilities. This is what people used to do, I would say, until maybe two years ago. I will tell you about what people do now around the end of the lecture. But Yeah, super glue is definitely something you should at least be aware of. And example of tasks that are in super glue, one is bull queue, which is simply you have some text, you have some question, and you have to say whether the answer is yes or whether it's no. So that's very easy to evaluate. You just look at accuracies or precision, regal entailment we already talked about. And then there are other ones like co reference resolution, which we also talked about, and meaning of words, which is something where you have two sentences with the same words and you have to say whether they actually mean the same thing in this sentence. For example, if you have bank, it could mean bank like water and bank like money, and you have to say whether in these two sentences they refer to the same concept. And there are some question answering tasks too. So this is about super glue. Are there any questions? Cool. So again, although I said many times that this is essentially just classical machine learning, I want to emphasize that it doesn't mean that it's simple. And you really have to think carefully about what you do when you use those type of cloended ded tasks. And particularly, you're gonna to have to choose whether you look at accuracies, Precision Recall, F1 scrock curves, auc curves. If you don't know these names, you should really check out the slearn documentation or the lecture from criris speech that I linked above, both of which are really good. But depending on which metric you will choose, you will decide on very different type of algorithms. And the usual example is that if, let's say, you look at spam, you want na do classification of whether a email is spammed or not. Most emails are not spammed, thankfully, at least I hope so. Let's say that 90% of emails were actually not spammed and only 10% of them are spam. If you look at their accuracy, then just a random classifier that predicts the most likely label will get 90% accuracy. And that seems, if you don't know really about your data set, like 90% accuracy seems good. But in reality here, it means that you're not classifying anything. So that's why you want to look at Precision Recall and F1 scanyways. I will not talk too much about that because again, this is not specific to nlp, but it doesn't mean it's easy. Another issue is that once you have multiple different tasks, there's a question of how do you aggregate these metrics. So right before I told you, Oh, you just take the average between all of these things. This honesty is a really terrible thing to do, but that's actually what people do. But these columns actually mean very different things. Some of them are accuracies, others are F1 scothers are correlations. And you just average everything. I can't remember which benchmark, but I remember a few years ago, there was one benchmark where actually one of the columns was you were better. Basically, you had better performance if the value was lower. And you still took an average of these things until someone realized that they were like, maybe we should take him, put a miners there. So Yeah, be careful and don't always think that what people do in academia or Yeah what people do is correct. You should think a little bit about that then. Now some other questions I want you to think about. Where do those labels come from? I said that is usually a real answer, but how you actually get those labels is unclear. So I will tell you about some issues in the next slide. And also related to that, there might be some spuous correlations, and that's what we're going to talk about right now. So we already talked about sni. So entailment. So here you have again, your premise the economy could still be better, and the hypothesis the economy has never been better. And you have to say whether the hypothesis is implied by the premise. And what this paper from 2019 found is that actually all the different models were performing really well. But if you just classified based on the hypothesis, you could also perform really well. So even if you did not look at the premise, which seems like something that you need to take into account because it's part of the task, you could perform well. And the reason why is because they realized that when the humans actually wrote the hypothesis, they were asked or write a hypothesis which is not entailed by the premise. And how humans usually do that is by adding a negation. So if you only look at the hypothesis and you see that this negation, it's very likely that it's not entailed by the premise. So again, even though this is standard machine learning, be really careful about what metric you use and where do the labels come from. And don't do everything. Like don't just use what people do thinking that if there was an issue, people would have realized. So Yeah, so that is spous corlations. Any questions on close ended tasks? Cool. Okay. Open ended evaluations. So I'm going to mostly talk about that because this is what is specific to nlp. So open ended evaluation or open ended task is essentially the opposite of the close end the task, which is to say that there are many possible correct answers and you cannot enumerate all of them. So you really can't use standard machine learning metrics and more. Oops, even more than the fact that you cannot enumerate all the possible answers. Usually there are different levels of correctness. So if I ask you to write a book or if I ask chagpt to write a book, like it might be a decent book, but it might be a better book that it could have written or that another model could write. So it's not just right and wrong. There's like it's a continuum. Yeah, it's a continuum. So stand examples for open ended tasks, the two most common ones are summarization. So summarization, you have a long piece of text and you just has to summarize it in less than x characters. Standard benchmark is the cnn and digmail benchmark. So the way they actually collected that data set is that they took a lot of cnn articles. And you know at the top of cnn articles, you have bullet points that kind of say, like, what are the most important things in the article? So they use this as essentially the gold summary. So this is the classic one. And for summarization, for translation, you basically have sentences in two different languages and you have to translate from one to the other. So those are the classical ones. The way that people currently do it is, I would say the most standard task right now is instruction following. Instruction following is kind of the mother of all tasks in the sense that you can view any previous task as just A Chabot or like some question that you asked, who basically ChatGPT. You can think classification, I could just ask ChatGPT to do that. You can think summarization, I could ask ChatGPT to do that. So essentially, you could just view a chatbot as the most general type of task, and you can ask it to perform any possible task, and it should just provide the answer for that task. So this is what we call instruction following. So as you might think, evaluation is very hard in that domain. And that's what we'll talk about later is how do you evaluate something? Hgpt, okay. So types of evaluation methods for text generation or open ended tasks, the classical ones are content overlap metrics, which I'll talk about then. So that's really comparing just the words between a reference answer, a gold answer that humans wrote, and the actual generation that you got from your model. Then there are model based metrics where you basically turn evaluation into machine learning. So ask you train a model to basically become an evaluator, and then this human evaluation, which is usually see as the goal standard for open ended tasks. So content overlap metrics. So as I just said, this is really just comparing word by word or group of words between the generated sequence and some reference. So here I have the generated sequence being the woman went to the hardware store, and the gold reference, which is the reference written by humans. I actually don't even know what the task is, but the reference here is they walk to the grocery, we saw. And then what you do is that you just compare the two different sentences by looking at the lexical similarity between those two texts. And this is super fast and efficient. And the way you usually do that is by using ngram overlap metrics. So what I mean by this is that the simplest possible thing is just to say whether for every word in the generated sequence, whether it appears in the reference sequence, and if it does, then you kind of increment your performance. So n gram is is essentially the same thing. But instead of looking at a single word, you basically look at ibigrams, trigrams and kind of multiple words next to one another. So the usual overlap matrix, the most common ones are blue. And the blue means blue and the roumeans red. That's not what they stand for, though, and I always forget what they stand for, but basically blue. What it is, is that it's an n gram overlap metric that tries to look at precision. Well, Rouge is what looks. It looks at the recall. So as I said before, as I luted before, what is important, even if you turn everything into kind of sentence classification, you have to think about whether you care about precision or recil. So those metrics are not ideal. But until I would say two years ago, they were a gold standard for translation and summarization. For translation, people use blue because you really want, na Yeah, you basically look at the, let's say I'm translating from French to English. I want to look at the generated sequence in English and the actual reference sequence in English, and I want to know whether every bigram that I generated appears or like how many of the bigrams that I generated appears in the reference sequence. There's one additional thing, which is that they don't only look at precision because you could get a very high precision by actually predicting something very small. For example, if you always predicted the word the only generated the word the, you would most likely get very high precision because there usually appears in every sentence or like let's say, a full stop. So there's also like some length penalty and who just kind of deopposite it just looks at weacall. So those are the common content overlap metrics. And just to illuate why those are not ideal, well, they have many issues, but one of them is that they don't really take into account the semantic relatedness between words. So imagine that Chris asks you, are you enjoying the cs 224 and lectures? Of course, the gold answer is heck yes. So that's the reference answer. So now let's say that the model just generates yes. So here what you're going to have is if I look at the blue score, I will have 67% essentially blue score because two of the words that I generated or two of the unigrams that I generated are in the reference, in the gold reference, if I generate you know it, then I will only have a single token in the generated sequence that appears in the reference sequence, which is the exclamation point. So I get a much lower blue scull. And if I just say Yep, then that doesn't appear at all in the generated, sorry, in the reference sequence. So I get zero. Blue scwhich is a false negative because really it literally means the same thing as hikso. Hopefully you see that these metrics really have issues. Also, you could have false positives. For example, if you say heck no, then most of the words are the same. So you get 67% blue skore. But it really means something completely different. Does that make sense? Any questions? Cool. So very naturally, now that you know everything about what embeddings, what you might ask is, Oh, why do we look at words if what we could do is looking at like learned representations, which really kind of maintained the semantic similarity between words? So this is exactly what people have done around 2:19, I think, is that they took some, even before actually 2016, they took some wide embeddings. They associated every word in the reference sequence to a word embedding, every word in the generated sequence to the corresponding word embedding, and they basically started comparing the word embeddings. So a very simple way of comparing word embeddings is just to take the average between the word embeddings in the reference sequence and the average between the word embeddings in the generated sequence. And you maybe look at cosine similarity. I mean, there are smaller ways of doing it. But honestly, at this point, it's not that important. So you can think about averaging another one. As you know, at this point, what embeddings don't really take into account the contextual or like the context of where the word basically appears. So a better way of getting good represenstations forward is by looking essentially at edbirso. What you could do is you could take a birmodel, you could PaaS the generated sequence to it, you get some embeddings, and then you can take boagain, the same boat. You PaaS the reference sequence to it, you get some other embeddings. And then you do, again, some comparison. I mean, this bird scpretty famous paper, they do like some smart comparison, but it's not that important to understand what exactly they do. What is important is that they take some smart averaging between those words. Cool. Any questions? Okay, so that was the simplest type of learning methods, which is word matching. Another slightly more complicated one is called blurt, also pretty famous, which is a mix between blue and bird. So the way that they did that is that basically they took a pre trained bird and then they do some continual pre training by trying to predict the blur score and some other metrics and then they fine tune. That's the important part, is that they fine tune their ptrain model to actually do the evaluation that they care about. So let's say that I have a lot of different sequences, and I have some human annotations of how I should be evaluating it. I could just treat that as a normal machine learning task, and I just fine tune my Birt to do the evaluation. So this is blurred. Any questions? Yes. Curious if you ptrain on blue. speaker 2: wouldn't it cause . speaker 1: the same problems? Ms, as if you're appreciating fas blue, then your would learn the ability to model languages semantically in the first place. Yeah, that's a very good point. So actually, I also find it kind of surprising. So they did two things. First they do the real pretraining of Birand. Then they do continual pretraining for predicting blue. And the reason why is because usually they say we have a lot of sequences in our data set that are unlabeled. So we have like some reference sequences and some generated sequences, but we don't have the human annotation of whether this is good or bad. So we will treat that as an unsupervised learning objective. So what do you use for the unsupervised learning objective while you have to use something? And they basically use blue and they also use actually bird score. So they use like many different tasks, and they basically do multitask learning. Okay. So one important issue with all these methods is that really they are, they can only be as good as the references are. And in reality, the references are usually not that good. So this is in, this is a paper that looks at summarization of news. So basically, as I said before, most of the news summarization benchmarks, they usually take the reference summary as being the bullet points that you find at the top of an article. And this is usually not that good. So here, what you see on the left, this is what if you look at the correlation between the x axis being the human rate of the human evaluated performance of every model. And in the y axis, you see the Roul, which is just a variant of Rouge. And you look at whether basically these two are correlated. And what you see is that they are essentially not correlated, which means that Rouge l on standard references is really not correlated to what humans would say is a good summary. That is not to say that Rouge is a bad score. That is to say that actually the references are bad. Because if you look at the exact same thing, but now you ask experts to write very good summaries, then you see that the correlation actually increases by a decent amount. Still not perfect. Rouge is definitely not perfect, but at least it's much better. So this is to say that the metric itself is not always perfect, but not only this, the references are usually actually not great. Cool. So that begs a very natural question, which is, can we just dump and basically move away from reference based evaluation? So as we just said, reference based evaluations are the ones that compare human written references to some model outputs using some different type of metrics. And those used to be the standard metrics for evaluating or the standard benchmarks for evaluating nlp tasks, I would say up to like two or three years ago. Right now, I think papers still have to always show the blue scores, like for example, in translation because reviewers one dose, but I don't think anyone in the real world actually uses them, but I might be wrong with that. So Yeah, so blue Rouge bird score. Oh, and I was mostly talking about blue and Rouge. Birscore is actually still decently used and actually pretty good. Okay. So reference free evaluation. So reference free evaluation is basically you have a model and you ask it to give a call, but don't know your human references. So the way that this used to be done is essentially by taking a model like Bert again, but instead of comparing between a reference reference answer and a generated answer, you could just ask it to take the input and just predict a score. That's like one simple way of doing it that used to really not work well. And I say used to, because until basically chagpt and GPT -4, now what people do, and honestly that works super well, is that you just ask GPT -4 to do the same task as you would ask a human. So you give a very long text, and then you give the generated summary and you ask, how good is it essentially? And that works surprisingly well. So common benchmarks here. I'll pack a evil and empty bench. There are many others. Now honestly, most people start using these type of techniques, but we'll be talking at this about apply. Eva, good. Okay. So let's talk a little bit about human evaluation before looping back to basically GPT -4. So as we saw the metrics until now, they all have some shortcomings and are definitely not as good as if you ask directly human evaluation because they are based on references. So human evaluation is really the goal standard for open ended tasks. And not only is it really the standard way of doing evaluation or like the goal standard for evaluation, it's also the goal standard for developing new automatic evaluations. So every time you develop a new automatic evaluations, you will want to compare to what humans would have basically predicted. Yeah, okay. So doing human evaluation, at first it might seem very simple. You basically ask humans to evaluate the quality of some generated text. Seems simple, right? But actually it's super complicated and it's a real challenge and it has many issues. So first, Oh, sorry, I'll talk about that before. Maybe one additional thing is that you should not only ask the human, you usually ask it also to as them to evaluate across different axes, for example, the fluency of the text or the coherence of the text, or like common sense, or like the style grammaality, redundancy, and like different axes that you might care about. Another thing to note is that you should absolutely never compare different human evaluations. So if there's one paper that says, Oh, humans have evaluated the fluency of our text to be, I don't know, four out of five, and then another paper that says, like three out of five, like they use different humans, different ways of like prompting the humans. So it's absolutely not comparable. Okay, so let's go back to some of the issues. So as I said, human judgments are regarded as the goal standard, but it definitely has issues. First, it's super slow. As you might expect, humans are definitely not as fast as automatic metrics. Second, at least in academia, it's still pretty expensive to do because I mean, when you pay well your workers, it's pretty expensive to do well human evaluation. Another part is internotator disagreement. So if I take two random people in this room and I ask them to evaluate the quality of a genuated text, I can assure you that you will really not agree. So this is, even if especially if it's subjective, it's really bad. But even if you talk for like one hour before about how you should be evaluating generations, I can most likely ignorto you that you will still disagree on many of the evaluations. And to give you an example, when we were doing alpka farm last year, which is something where we basically had to take some inputs and then take two models, think ChatGPT, alpc and these type of models, and you just have the two models predict an answer, and then you ask the humans to say which answer they prefer. This is a very simple task, and this is what I will talk about it later. This is what a lot of people basically use right now for evaluating models like jgpt. So an actual question is whether humans are good at doing that. And what we saw is that, so we were five researchers doing that. And the five of us, we talked for like two or three hours. We wrote extremely detailed rubrics about how to do the evaluations. And still we only agreed 67% of the time. So 50% is like random. And if we just label things independently, we only agree 67% of the time. And we really try to do our best, like we were working on this thing. So it's not as if we were trying to do it quickly. So really people disagree. Of course, if you then allow discussions between the annotators, then agreement actually improves, but then it becomes even like slow and more expensive. Intro, annotator, disagreement. This is something that is extremely annoying, which is that if I ask a human, if I ask myself right now to evaluate something or like in three hours, like after I have dinner or after I want to run, I will actually give different annotations. Yes. speaker 2: for the samples we might to pay by like . speaker 1: you mean for validating? Yeah. So this is a very good question. Honestly, there's no good answer. The usual way that people do it is that you look at some statistical like some statistical metrics basically where you're like, okay, I want to compare between these two models. I'm going to look at, I'm going to basically perform A T test, and I'm want to know that my p value is less than a certain amount. What people usually do also when they have human annotations, I unfortunately didn't put a slide on that, but they have metrics for computing the intra annotator, basically agreement, and they try to achieve a certain intranannotator agreement. And if not, they will essentially ask for more humans or for relabelings. Yeah, it's not reproducible. And this is like partly because of the two things that we said before, but also partly because Yeah I mean mostly because of the two things before. So this is an interesting paper. Think I forgot which here. I think it's from 2021, but I'm not sure where. Basically they say, and I read from the abstract here, just 5% of human evaluations are repeatable in the sense that there are no prohibitive barriers to repetition, and sufficient information about experimental design is publicly available for rerunning them. So this is a paper that analyzed, I think, 128 different papers that were published across like five years, I think, between 2015 and 2020, and they found that essentially only 5% of those papers were reproducducible. So honestly, working with humans is hard. That's definitely something to remember. Another part is that humans only basically evaluate precision and not recall. So what I mean by that is that if you show me what the model generated, I can only evaluate that generation. I cannot evaluate all the other possible generations that it could have generated, because then you really have to sample a lot of things that would become way too slow and way too expensive. And finally, usually the incentives are not aligned. So what you want is for the humans to basically do the best possible evaluations. What crowd workers usually want is basically to maximize the amount of money that they get paid per hour. So to give you, again, a concrete example, when we were doing our pka, fari think we were paying relatively well in the sense that we were paying 1.5 times the minimum wage in California. And then we divide, basically, we looked at how much time we would spend to do the thing, basically to evaluate a single example, the best we could. And then we divided by that time to basically know how much we would pay for every example. And what we realized is that they ended up being paid, I think, two or 2.5 times the minimum wage, because they were just doing things like two, three times faster than us. And I don't I mean, we could be slow, but I think what was happening is that they were just trying to maximize the dollars that they were getting per hour. And as a result, they were finding like shortcuts for doing their evaluations. And this is something that you really see in a lot of papers. For example, in our case, you saw that humans really preferred longer answers. And of course, if you give me two very long, like two sorry generations, and you ask me with minimal amount of work to say which one is better, like if I see a longer one, I'm like, probably there are more details. Probably it's better anyways. It's not to say that everyone is like that, but definitely it's the incentives on online. So you have to be careful with this other challenges. First, you have to decide how to describe your task. You really have to give very detailed rubrics for how the humans have to evaluate the task. Then is a question of how do you show the task to the humans? For example, the order in which you give examples are actually really important in our case, because we had two examples side by side. They're actually which one is on the left and which one is on the right is actually also very important. So all these things really matter. Of course, you can randomize these things away, but it is like it adds challenges what metrics to use? I mean, this is not specific to humans selecting the annotators. This is also very complicated. You might think, okay, I have some money now. I can go on Amazon m turkas, and I can just ask them to evaluate or to do some annotations. But in reality, you want to have the good annotators. So how it usually works in Amazon, in mtuk, is that basically you say, Oh, here's a task and want like 30 different people to do these annotations, and then they start annotating. And then if they don't achieve the level that you want, you basically pay for what they are annotated until then and you work with someone else afterwards. So then there's a question of how do you decide whether they achieved the performance that you want? So you probably have to do some good labeling before and then look at like some accuracies of how well and like some intrananator agreement with you and with like the other researchers on your team. So it is very complicated. And not only this, you have to monitor that over time. So there I mean, different ways you can monitor that over time, looking again at the accuracy. So maybe every, let's say, a typical thing is that every batch of example that you label, you give a few examples that are actually ones that you already know what the gold label is and you see how well they are performing on that. Another way to look at is like the time that they take to annotate. Yeah. Okay, so that was about humans. So human evaluation is hard, but it is the gold standard. Okay, now let's talk about reference, free evaluation and chabots. So I already told you about it before very briefly. How do you evaluate something like ChatGPT? This is extremely complicated because basically you could ask it any task you want, and it can answer text that is arbitrary, really long, and that just makes evaluation extremely hard. So as I suggested before, the usual way that it's done is that you take two models, you put them side by side, you ask the same question, and you just ask either some humans or some model, as we will see before afterwards, which one is better. So this is the most common benchmark right now, I would say, for human evaluation. It's called chatbot arena, where basically anyone can go online and just play for free with some of like the best models out there. And all they ask you is to say whether you prefer the one on the right or whether you prefer to love one on the left, essentially. And then once they reach, I think, a crazy amount of data, 200000 human votes, for example, they basically add it to a leaderboard. And the way they added it to a leaderderboard is that they essentially, I don't know if you know how chess works, but they basically look at the elo ratings. So they basically put everything as if it was a tournament so that not every model has to play against every other model, and then they get elo scores. Okay. So what's missing with this side by side human evil? As I said, this is really the gold standard for evaluation of chat lms. But there are still some challenges. First, like it's basically random people online that ask random questions and they provide their preferences. So that might may not be representative, although arguably when you have that many examples, like it becomes actually pretty representative of what people would want. So it's probably better than whatever we have, but it is still not ideal. And then really, the big issue is cathis takes a huge community effort and a lot of people to work on that. Also, it takes a lot of time to get new models on the benchmark and only the notable models. So think like the OpenAI models and the cloud and like the Google ones and the Facebook ones are going to be benchmarked. You will never have for your random model 200, zero people who are willing to annotate it for free. So this is an issue. And again, like as we talked about it in the first slide, even for those big companies, they can definitely not do that for like development of their model. This is something that comes at the end for maybe model selection. Okay. So how do we make it faster? So one very natural solution is basically to ask a large language model to do the evaluation for you. So imagine that I want to compare to GPT with mial. I basically add as GPT -4 to evaluate which one is better. This is surprisingly good, and I will show you some results afterwards. And some common versions of a paca eval and empty bench, probably the two most common ones. So when we started doing that, that's a problem I told you about. We saw that around last year. And we found that using GPT -4 essentially for evaluation is, at least if you look at the prices now, would be 100 times faster and 100 times cheaper than if you use human evaluations. But, and this is very surprising, the agreement with humans actually higher than humans agree with themselves. So what I mean by that is that if I ask so this is what we found. If I ask for humans, let's say I have a pool of four humans, and they take out one human, and they look at the agreement between that humans preferences and the mode of the preferences of the three others. And I do that in an leave one out fashion. And I look at this agreement, this will be lowered. And if I ask for the model to predict, essentially the preference of the mode of the humans, so in some ways, models are more highly correlated with humans than humans themselves, which is very surprising. And I will tell you about it in 2s, a little bit more. When we did that, we actually use that for collecting human preferences for rhf. So that's what we call our aif, as I think Archard told you about these things last week. So going back to this issue or this surprising result that actually models are more highly correlated with humans and humans themselves. The reason why this is, is because humans actually have high intertated disagreement and have high variants. Essentially models, they will always be very consistent, or maybe not perfectly like there's still some stoicticity, but essentially they will always predict the same label. So they have very little variants. So here what you see on this plot is on the sorry x axis, we estimated the variants, and you see that the human has a variance of like around 31 or 33. Well, if you look at a red point, this is basically if you just add GPT -4 to do evaluations, so even though the bias is still pretty high, so bias by definition for humans is zero. For GPT -4, it is like around 32%. The virus is much lower than humans. So this is why you can see that actually sometimes agreement is higher, but that's really because there's no or very little virus in lms. Yeah. Does that make sense? It means being tricurates lens is higher than that human. It means that being triminal Curto lis higher than that. Exactly. So which which is actually a good sign because that means it's, that makes it much easier for for research. The bad sign is that the bias is so high. Okay, so things to be careful with when you work. I mean, this is both with humans and with lms. There will be some spherous correlations. So we already talked about spherous correlations, but you will see a lot of those. One very common example is length. So if you just as I told you before, if you ask crowdworkers which examples they prefer, they are highly biased towards longer output. So here, the bluest humans, it's around, I think, 70% preferences for longer outputs and models are around the same bias. And another example is preference for lists. So usually if you see lists in an output, models prefer these examples and models model, and humans prefer these examples. Another bias, or spurce correlation, is a position I told you, like which one you put on the left? Which one do you put on the right? When compto, when you ask humans to label, there's the same thing with models. But this is usually pretty easy to control for you. Just randomize both. Another issue is GPT for self viras. So very naturally, like you might wonder if I ask GPT for to evaluate itself, like it will probably bias and it will prefer itself than other models. And this is but less than what you might think. I will tell you about it later. Ok. So how pevwill try, wait until what time do I have. Oh, thanks. Great. Okay, alpka eval. So alpka eval is the benchmark that we developed when we were working on alpaca. So as I told you before, we need one thing which is very important is what you use for the development. So basically for hyproprietal tuning. So what we did is that we basically did not trust many of the benchmarks out there at this point for instruction following. So we just developed a very small benchmark for ourselves, and this is what we were doing, five print tuning, and then it kind of became its own thing. So alpaca eval in a few numbers, it has very high correlation with chatbot arena. So the ranking, if you look at the correlation between the ranking and chatboarena and in alpa eval, it's 98%, so very high. And it takes around lar three minutes, $10 to evaluate. And the way it works, I think I already mentioned it, but basically you take an instruction, you generate an output from one model and then from another model that you're comparing it to, and you ask GPT -4 to basically give the probability that it prefers the model that you're evaluating versus the baseline that you're comparing to. And then you do some reweighand. The reason why you do some reweighis because these models, as I said, are very biased towards longer outputs. So you want na rewesuch that if it's a longer output, you give it a slightly less high preferences, high preference, and then you average across your entire data set and you get a rirate. So that's how it works. Any questions? So system level correlation. So here, what you see on the x axis is basically alpkaevva. I mean, a slight transformation of it, but essentially alpkae valscores. And on the y axis is this chatbot arena, which is the gold standard, and you see that things are relatively highly correlated. And on the lower plot, you see basically the correlation between different benchmark and chatbot arena, and you see like empty bench and alpkaevva, which are the two ones that use llms for evaluations, are relatively highly correlated with chatboad arena. And mmlu, which is automated, one that doesn't use in rheis, also very highly correlated. So I told you very briefly about the fact that we had to do some reweso. I'm not going to tell you how we do it, but I want to tell you why we do it. One of the issues that we realized a little bit too late is that if you take something at GPT -4 and you just ask it, you prompt it to be much more detailed, to basically provide much more detailed answers. Its win rate. So its performance on your benchmark goes from 50% to 64.3. So that's this one, 64.3. If you ask it to be more concise, like it decreases to 22.9. And that really doesn't fit like our mental model of what benchmark should be doing. If I just change a tweak a little bit, the prompt, I don't want my model to change completely its ranking. So that's why we have to do some reweighting. And you see that after the reweghting, you basically have that performance after you ask the model to be more of a boss is very close to the performance without any prompt tuning. speaker 2: Cool. So I told you slightly . speaker 1: or very briefly before my self bias. I do want to say that I'm pretty surprised about this result, but actually self bias exists, but it's not as high as you might think. So here you see on the x axes the ranking well, like the the different models that you're evaluating. And on the sorry, that's on the rows and on the columns, you see who is evaluating, which model are you using for evaluation and you actually see that regardless of the model that you evaluate with, the ranking will be the same. So even though it's that if I look at mial evaluated by misal, it gives itself a much higher accuracy. It still prefers Claude and GPT foso. It's not as bad as what you may think. It's still bad though. Cool. Okay. So that leads me to talking about current evaluation of lms ms. So I'd say there are three main ways that people currently evaluate lmms. The first one is perplexity, which is essentially just looking at training losses or validation losses. The second one is basically averaging everything, which is actually surprisingly more common than what you may think. And the third one is this arena like or what you basically have comparisons between models and either use humans or you use models to do the evaluation. And usually how it works is that ptrain model, let's say the new like when lama four comes out or like when GPT five comes out, they basically mostly show perplexity and average over everything. And the fine tune models, they usually tend to show average over everything and like performance under arena like models. And the reason why is because models that are fine tuned, usually the log likelihood that they predict is not Yeah it's not calibrated for your data set. So what do I mean by everything? I would say the two most common benchmarks that basically look at everything helm and hugging face, open arm, leaderboard, it's really just a collection of a lot of different automatically evaluated benchmarks and you evaluate across all of them. So what are some of the common benchmarks that we use? One is Yeah measuring like math performance. So gsm 8K, that's a pretty common one. That's basically great school math. Mmlu is multiple choice question answering on like math, science, history. Legal bench is on the legal aspect and you have Med qi. So I believe this is for helmed qa is medical licensing exams. So you basically ask many, many different questions that you can automatically evaluate. And you hope that by taking averages, it will say like how well your model performs. So that's kind of like the newer version of super glue. I would say one data set that I want to highlight, which is probably one benchmark, which is probably the most widely used and the one that people believe the most, is mmlu. So massively multitask language understanding. So this is, I think maybe Archard mentioned it last week, but this is basically multiple choice questions on 57 different tasks. So you have tasks like formal logic, conceptual physics, econometrics and these type of tasks. So here's an example. What is for type one, a supernova? This type occurs in binary system. This type occurs in Young galaxies. And you basically have to say which answer. So that seems very simple. I mean, the task is not simple, but the way you evaluate seems simple. And then like high school biology in a population of giraffes and environmental, and then this is an example of directional selection. So that seems simple. But actually it's also more complicated than what you might think. And I think I will tell you okay, I will tell you about it later, but that's one of the most probably the most common benchmark in what people actually look at. For example, when Mark Zuckerberg said that lama three was out, Yeah, he talked about mmlu scores, which I find kind of crazy. But Yeah, other capabilities that people look at, coding. Coding is a very common one that people evaluate on for two different reasons. One, because coding is usually, if you perform well on code, usually actually these models perform well on reasoning, which is actually pretty cool. So that's like highly correlated with things that people care about. Two, I mean, a lot of us are coded, so we like to have better models for helping us coding. And three, the other point is that it's actually pretty easy to evaluate because you can write test cases. So you basically ask the model to generate very long code or like functions to do something, and then you just run the test and you see whether it succeeds or not. Yes, evaluation, some of them was shcal ansapparently. How would you validate like Shand qa type of thing or where it's like multiple will choice makes sense. But if it's like short answer qa, how would you say something is correct as an automatic method? I think it's specifically to the top. Yeah, I actually don't know. Huh I actually don't know. Yeah, I should check. Sorry. So I don't know specifically for this one, but hopot qa and beer qa are other qa datasets, and they pluget F1 for the impulse. And then they also have an exact match, which is pretty punitive because like if you say President Reagan, and the answer is like President Ronald Reagan like shoting you, but anyway, so they use like an exact match on that. Yeah cool. Thanks. Okay, Samuel U coding. Another one that people start looking at are agents. I think Shacar is going to give a lecon it. So I'm not going to talk too much about it, but like one cool thing that lms can do right now is basically cool apis and then take actions in the real world essentially or like take control of your computer. You should not give it control to your computer. So a actual question is like, how do you evaluate these type of things? This is a real challenge because I mean, the biggest challenge is that if you, for example, if I really wanted to evaluate how good it is at coding or like how good it is at doing things in my terminal, I need to give it access to my terminal. And I really don't want na give my llm access to my terminal. So you really need to sandbox environments for the specific cases of terminal. I mean, it's pretty easy to sandbox, but once you want na do evaluation of like a model that, I don't know, pks people on slack or like writes things in your emails, then you have to write an entire sandbox environment for all the applications that you want your alms to have access to. So this is actually really complicated and something that people really have to deal with in kind of the real world. At least we'll have to because right now, it's still not in production. Okay. The last part is the penultimate one, perplexities. So one thing which is very surprising, or at least the first time you see it, is that really the performance that you have on pre training is extremely highly correlated with basically performance on any downstream task, at least for the current types of llabs. So what I mean by this is that if you just look at your training performance, just predicting the next word, it's extremely highly correlated. So this is the x axis, which is essentially perplexities, and the y axis, which is just the average over many different tasks. What you will see is that tasts that perform well on on perplexities will actually have higher average scores. And as a result, a lot of people actually end up when they develop just looking at perplexities, and they just trust sted enough that they don't need to do the downstream evaluations. I would not recommend doing it, but if you have to have something like quick and dirty, it usually works pretty well. One thing to be careful with, though, is that the perplexities are not going to be comparable across different data sets. So you really have to be careful with what perplexities you're looking at. And two, it will depend on the tokenizer. So if you have like lama three or you compare it to Gemini, even on the same data set, it's going to give different scores and it's not comparable. Yes, the easy answer. I mean, it's not the only answer, but the easy answer is that if the vocabulary changes, the size of the vocabulary changes, then clearly the type of I mean, everything is not on. Like the upper bound is different. Sequence, a sequence. Yeah talking but I'm not talking about that. I'm talking about the fact that, I mean, just think about it, if you have a vocabulary size of one, then I have to always predict the same thing. So basically, your entropy depends. Your entropy is up, abounded by log of the cardinality of your vocabulary size. So you're going to depend on that. Cool. And the last one is arena. Like as I already told you, basically you compare different models. You make them fight essentially against each Yother and you have elo ratings at the end. So that's really a more general way of saying it is. I've really just led the users decide and that works also pretty well. Okay. Issues issues and challenges with current evaluations. First, consistency issues. If you look at question answering, sorry, multiple choice questions, if you just change so you see on the top left and top right, if you just change abcd to like random symbols, the generations that you will give are actually going to be different. And then the rankings between different models will be different. So even things that are very simple, like multiple, like selecting out of four choices, will be very dependent on exactly like how you format these choices. And one real example, that's what I was alluding to before, is mmlu. So mllu seems really simple to evaluate. You just ask it to say like which one of the the model prefers. But actually, for a very long time, I think for nearly one year, there were three main implementation of mmlu. And people were comparing between those who having no idea that, those who gave different scores. And the reason the two main differences were, one, people used different prompts, so that clearly will give different answers. But two, they were using different ways of sampling the actual to get the actual most likely prediction. So one of them, for example, was saying, I have the four choices now to get my most likely, let's say that the correct answer is D, I will just look at the most likely answers out of abcd. Even though, like zzigotwas, another answer that has a higher likelihood, I will not look at it because I will basically do constrained decoding. And if I do constraint decoding here, I will say that the correct answer is d. But if I actually just look at the most likely token, I will not get the correct answer. So like those were two different implementation. And a third different implementation, which seems really different, is that instead of generating the correct token, which is basically the letter abcd, you can look at after this question, what is the likelihood, sorry, that the model would generate this. So you would look at the log likelihood, or like the perplexity essentially of predicting this log likelihood of predicting that. And that gives very different answers. So if you look at the top right, you see that lama 65b mmlu on helm was 63.7 and the original mmlu 63.6. But on harness, which is the thing that actually hugging face uses, is 48.8. So that's like a . speaker 2: huge difference. Yeah. What is helm harwe can imagine to these three things there? speaker 1: Yeah. One, I can't remember which one does what but each of them does something different actually now it's not anymore. So the middle column change what they're doing. So they start matching the other two ones. But at that time they want I'm not sure which one my guess would be that did the last one, but I'm not entirely sure. Okay, questions. Cool. Another issue contamination. So here you have harass here. If you don't follow him on Twitter, you should. And he basically says that he was looking at like code benchmarks and he was saying that pre pre 2021, I can't remember which mark, our GPT -4 was getting ten out of ten on questions on code force, but after 2021, or like more recent problems, it was getting zero out of ten, which seems very, very strange. So that really strongly points to the fact that it was contaminated and was probably the model was probably ptrained like that data set or the code force datset was probably in the pretraining datset. And of course, if I mean, essentially you do training on your test set, then you're going to perform really well. And Suzanne said also to follow also said something similar for phi 15, which is a model from Microsoft. So what is challenging here is that we've closed models. I mean, there are two things actually now challenging. One is that those are pre trained on so much data that even if we had access to the data, it would be hard to actually know what like if they were prere trained on your test set. But two, those are all closed source models, so you really don't even have access to the data set. So you have no idea if they were pretrained on that data. Overfitting issues. That's also relatively related, but could be slightly different. So here you see how much time it took for standard data sets to achieve, like Spary quotes, human level performance. And what you see is that on the recent ones where you really have this pre training, in less than like six months, you perform at human level performance. We don't really know if it's because of the contamination or if it's simply that like a lot of people are basically developing and trying to do high proprietary tuning on these test sets. We all know why, but it's clearly an issue with overfitning. So how do you alleviate that one? You can have private test sets. So there's a paper from, I think, two weeks ago that presented gsm 1K, which is the same thing as the gsm 8K that we saw before, which is the math datset, but tries to basically regenerate or resample this data set or recollect this data set. And then they look at how well different models perform on both the gsm 1K and the gsm 8K. And what you see is that at least the open source models, they perform much worse on a new data set than the one that people are able to tune on. This is not dofull like clouud and GPT -4. Another one is a dynabench, or just dynamic test sets. So ideally, every x number of days, you would basically have new instructions or new inputs to the models and your data aset would basically be dynamic. That's essentially also what a chatbot arena does. So that definitely helps. Another way of alleviating contaminators is that you may try to estimate or like to look at whether the models were actually trained on your test set. So one very simple way of doing it, which actually works, I think, relatively well, is just looking at the probability of different answers. And you will see that if your model is really sure by a certain answer, then probably was trained on that answer. Another one, which is also really cool as looking at the order of your test set. So your if a model was trained or pre trained on the test set, then most likely it thinks that example two comes after example one. So if you switch example one and example two and you see drops in log likelihoods, then most likely the model was actually ptrained on that data set. Cool. speaker 2: Any questions . speaker 1: here? Okay. So another issue is that I mean, really there's a monoculture of nlp benbenchmarking. What I mean by this is mostly the fact that we all just look at English, and this is a paper from 2021, 2022, I think. But they look at acl 2021, which is probably the most common conference or like conference machine in nlp, and they look at the best papers, so the oval papers. And they saw that out of the 461 papers, 70% of them only look at English and 40% of them only look at accuracies. So essentially just performance. So there are very few papers that look at multilinguand, even like efficiency on interpretability or fairness. And there's a similar paper from that analyzes another conference in 2008, and it was essentially the same finding. So unfortunately, it doesn't seem to improve over time. The thing is, there are actually a lot of benchmarks for mullinguality. I just hired a few here, mega global, bench extreme. Those have at least 30, 40 languages and many, many different tasks. So it's not that we don't have the benchmarks, is that there's no incentives, unfortunately, in academia to actually train or like sorry, to evaluate on those benchmarks. So if you have the chance, use those benchmarks. Another issue is that really we reduce everything to a single metric. So I already told you before, the way we aggregate metrics, this is usually kind of broken in some of these super benchmarks. But also we only look at performance. And in the real world, like we really care about computational efficiency too. We also care about biases, and we care about many other aspects. And like most of these benchmarks, don't consider those. Another part is that we usually average across every example. We just say that every example has the same value, essentially the same weight. So this is definitely unfair for like minoritized groups. But more than this, I think if for example, if you think about like agents, where maybe one example will be like how well it performs on, I don't know, writing code that will actually be put in production versus just like answering your daily question about where I don't know where to buy the best burger, like the value that you will get out of these examples are very different. And right now when we evaluate stuff, we don't actually consider that. So that's, I think, really a real issue. And also we basically, we don't take into account that different people have different preferences. So a few shout outs. One considering computational efficiency. So mlpuhas a great benchmark when basically, instead of trying to maximize the performance on a certain benchmark, they say, I want to achieve that performance in the least amount of time. So now you basically consider both like accuracies and like speed, either for training or for for inference, for biases. Disscreeval is a good data set from antropic where basically they have some templates. And so they try to ask questions like knowing whether someone should keep their insurance or not. And they have templates where they change the race or the gender in the template of the person, and they see how the decisions made by the model would change. And I mean, unfortunately, but unsurprisingly, you will see that some groups are much more discriminated than others. Other biases in our evaluations. I already told you slightly about multilingual issues, but honestly, this issue about English is like much more prevalent than you would think. For example, bluand Rouge scthey really assume that you basically have access to words like, you know how to tokenize and how to get words. So I used to work with Thai and Vietnamese. With Vietnamese, you have spaces in between words, and you have no spaces between words. Like you have no idea how to run like blue or Rouge really. It's much more than just a data. Like all our algorithms are really focused on English or at least western languages, a biased llm based evaluation. So one thing is that I told you about is that it's really cool because now you can use essentially GPT -4 for doing labeling. But that also means that given that GPT -4 is very consistent, if it has some biases, then most of essentially the nlp community will have these biases scaled up essentially. So one benchmark which tries to look at whose opinions llms reflect by default, this is actually pretty cool work that looks at the output distribution of llms on public opinion surveys. So just trying to understand whether llms reflect opinions from which groups. And they find that at least after, when you only do pre training, the model is actually relatively well. They are not too optimized to a single group. But after so this is in red, but after fine tuning, you basically see that the models really start being optimized for certain preferences, which is unsurprising because that's how we actually train the model. And typically, these models actually mostly show preferences from actually the answer as if they were from, I mean, White and Southeast Asian. So I think the Southeast Asian is actually pretty interesting. I think it's probably because a lot of these models were the human data that was used for supervised fine tuning and for lhf was actually labeled by people in Southeast Asia, which would explain why these models have these type of views and usually also highly educated. Okay. So this is the main challenge, the challenges of all challenges. We saw that there are many challenges in evaluation in at least in academic benchmarking, but the biggest one is that really there's no incentives for us to move to anything else. And this is actually pretty an interesting paper that looks at machine translation between all the papers of many papers from 2019 to 2020 in machine translation. And they found that 82% of papers, they only evaluated blue scores. And as we said, like blue scores have many, many issues. And if you see, we know that there are many better metrics, but still people are not incentivized to look at anything else. And actually, reviewers will usually ask you to show performance on blue scores. So it's not even that you're incentivized not to look at something else. You're also incentivized to continue. And it kind of makes sense because you want na be able to compare to methods from two, three years ago. But it also means that it's hard for the academic field to change to other benchmarks. But this is really specific to academia. Like in reality, if you know that your metric is bad, just switch. Okay, evaluation takeaways. So first I mentioned that there were different types of evaluation and different desired properties for different types of evaluation. Then I talked about close ended tasks and how you evaluate those, the fact that it's basically standard machine learning, but that you have to think carefully, even though it's standard machine learning of how you evaluate them. Then there are open ended tasks where you look at content overlap metrics typically, so things like blue and Rouge and birds Scand. Then you have chatboevaluations, which is extremely difficult, but people have start doing with using essentially llm based evaluations. And then we talked about challenges, one of them being consistency, the other one contamination, and the third one biases. In reality, honestly, the best evaluation is just check your outputs. So I think too many people, they just believe numbers in reality, like never just believe numbers. Like I remember when we did initially alpaca, like we kind of believed that alpaca eval, but once we started it, playing with it, that's when we were like, okay, this thing is actually, I mean, at that time, good. Now it would be a pretty bad model. But at that time we're like, okay, this thing is actually pretty good. We should do something about it. Even though on maybe standard academic benchmarks, it was pretty bad. So Yeah, don't rely on numbers and I'm happy to what time is it to take? Any other questions that you may have . speaker 2: question about? So there's this whole issue of bias, which trying which we're really trying to deal with, but we're sweeping under the rug here. So if we have a problem in which we're dealing with a very specialized domain, and yes, we try and go and run and run reference three valves using like let's say GPT -4, like is it considered bad practice to be checking a subset of these GPT -4 evals, breaking them up ourselves, and then like and then using and then using ourself, like inserting ourand our bias into this process by actually looking at many, many, many data points? speaker 1: So just to make sure and understand your question, you're saying that if we try to look at ourselves, at the answers, we might be incorporating . speaker 2: some biases there. Yes, but we should look at the answers to make sure that GPT -4 isn't being biased when it looks at the answers. There's this tension here, and I don't know what the cause in a control lled psyexperiment you would blind yourself to look at these answers. How do you deal with this? speaker 1: Yeah, that's a good question. I actually don't quite know. But one thing I actually feel less concerned about biases of a single person. My issue of the GPT -4 biases is that it's the same across every model. So things really scale up and kind of it's really it becomes a monoculture. And I think that I think that's less that's much worse than if everyone incorporates a little bit of the biases that they have in their direction. I'm not saying that that's the the best answer, but I think it's slightly better than then just going with whatever they have. speaker 2: Yeah how does one following up on how do I avoid the situation if we're . speaker 1: like one . speaker 2: is trying to solve a problem with a model? Yeah and one evaluates it with GPT chat, GPT -4Yeah and then one starts to to like look at it and say, okay, this this is good and stuff. And then one goes, okay, this is great. And everyone else in the world in GPT -4 thinks it's a terrible, terrible model. And it's just someone being, and it's just some academic being, like pressuring themselves into publishing something that doesn't actually work. How does the field structurally avoid situations like that? speaker 1: Well, I think that's one reason why they want standardized benchmarks and why every reviewer actually want standardized benchmarks because at least even though everyone knows that they're wrong, they understand how they are wrong. So I think that's like one perspective. Another thing, which is not doesn't completely answer your question, but I think could be a potential solution, is that how I view GPT -4 is just something that is really good at performing what I wanted to perform right now. The thing is, I not very specific about what I wanted to perform. And as a result, it will basically come in with its own biases that come from its pre training data or fine tundata. A potentially better way of doing it is that I could write exactly what I wanted. So right now, when we set, when we do the prompting to GPT five, basically ask a question, simple question, like how good is the summary out of five? But a much better way would probably be writing a very detailed rubric of everything that has to be in this answer for it to be a good answer. And if you think about it, this is exactly what like professors do when they evaluate for class. Like they basically you say, okay, Jan is a okay ta, but I cannot trust him perfblindly, so what I will do is that I will write a very detailed rubric, and I trust that he can apply that rubric. And I think that's also how we should be thinking about gpd four. This is not how we currently do it. Any other . speaker 2: questions?