speaker 1: We have like one minute, but since the room is completely full, I guess actually there were two spaces right at the front here. There's a couple of spaces down the front. So like do feel free to come in and take the remaining spaces. There's one in the middle here as well. This happened last year. I said, I promise, although it's a sponsored talk, people will come. And they put me in the same size room as last year. And again, the people are standing at the back. It's a relatively small screen. Can people see that at least approximately at the back? Oh, dear. Now that's a question. I have no idea. How do I do that? And. I'm going to try the word light and Oh, is that a bit better? Oh, that's even better. Okay. I should be able to do most of it from inside z. So we'll hope for the best. Okay. I guess, I guess we're kind of there's still two places here at the front. So if anyone wants to come and take them, I would I would encourage you in one here. So or the rest of you are just wondering whether my talis going to be really boring and you want to be able to get out easily without. And there's one more if you want it. But cool. Well, thank you so much all for being here. I am Samuel. I'm best known for creating pdantic, the original library. I assume that if you've got yourself to Pyon and found yourself into this room, you know what pydantic is. So I'm not gonna to talk too much about the original pdantic. I think that the one most important thing to say about it is it was first created back in 2017. So before gen AI I, it is downloaded today somewhere around 350 million times a month. Someone pointed out to me that that's about 140 times a second. So a lot of that usage, I don't know, but I'm making up a number. A third is from gen AI I, but a lot of the rest of it is from first api and from all many other sorts of general usage within Python. So I think that's where pdantic is, kind of why it's so downloaded so much as it's not just gen AI I, it's used by ora Fang and but in particular relevant to what we're talking about today by all of the gen AI I libraries. So that's both the sdks opai, anthropic rock, Google, etcetera. Et cec, but also the agent frameworks like langchain, lamer, index crew, AI, etc.etc.. I started a company around pdantic back at the beginning of 2023. And we have two, we built two new things as well as continuing to develop pdantic. So they are pdantic log fire, which is our developer first observability platform, which I will show you a bit today. Luckily, it's also in light mode and pdantic AI, our agent framework. But most importantly of all, do come to our booth. We're here for the next few days. We have, I think, a fun demo of log fire. We have t shirts, we have stickers and we also have a prize draw. So Yeah, please come along and say hi. What am I talking about today? Supposedly the title is building AI applications, the pedantic way everything is changing incredibly fast in AI. There are points that even in that since I first gave a talk like this a couple of months ago, that have fundamentally changed. But at the same time, some fundamental things are not changing at all. We're still trying to build applications that are reliable and scalable, and that is still hard. Arguably, that's actually harder than it was before gen AI I the assumptions, there are so many useful things that gen AI can do, but it is also a complete pig to work with. And so in some ways, those things are getting more difficult, and that's where I think we're trying to help you. So in this talk, I will use pdantic AI and pdanantitic log fire. But most of the principles that I'm going to talk about, I think, transcend that particular thing. So the first one is type safety. In this context, I think type safety is incredibly important and only getting more important in Python, but also in TypeScript. The number one thing that is changing is as AI is writing more and more of our code, whether that is like autocomplete or the full on cursor, go implement this view. For me, the number one most useful form of feedback that these agents can have is running type checking because it is side effect free. It is very fast and obviously it's is gonna to get even faster with ti, which is about to come out. And so Yeah, and the type safety story in pyston is great and only getting better. But if you go and use agent frameworks that have, for their own reasons, decided not to build in a tyessafe way, which is basically all other agent frameworks, as far as I can tell, you forfeit that type safety. You stop the type checker being able to help you or help the agent be able to develop. I'm also going to talk about the power of mtp model context protocol. Just as a quick straw poll, how many people have heard of mp model context protocol and how many people think they really understand what it does? Okay, I'm supposed to be one of the maintainers of the Python mcp sdk, although I don't get much time, so I will try and answer that as well. And then I will try and talk about how evals fits into this. I think we have quite a lot of time today, so hopefully I'll get to that as well. And throughout that, I'll talk about the importance of observability by using log fire. So what is an agent? I don't pretend to have anything especially new to say on this subject, but I think given that we're going to talk about agents a fair bit, it's useful to have an agreed definition of what an agent is. This, as of this year, seems to be relatively well agreed upon. This, the definition I will show here is from anthropic. It's the definition that OpenAI have also adopted in their new agents library. I think it's the model that Google is using. So some of the legacy agent frameworks, I would almost describe them as like langchain, who have still got a different definition, that are struggling now because we, everyone else seems to have kind of agreed upon this definition. So how Barry Zhang presented this at AI engineer back in February was like, this is the definition of an agent. Now, this is kind of helpful and elegant, but it doesn't actually make very much sense to me. What makes rather more sense is the pseudo code he showed on the next slide, which I have here. So the idea of an agent is it takes an environment, it takes some tools, which in turn have access to the environment. You take a system prompt shows that this is an ancient slide from three months ago, because now everyone will talk about instructions instead of system prompt. And then we run the llm, we get back instructions on what tools to call. We call those tools, we update the state, and we proceed. And even in this tiny bit of pseudo code, there is a bug. I don't know if anyone can see it, but the bug is that the while loop never exits. And sure enough, that actually points at one of the hard and as yet undefined bits of definition of an agent is, when do you stop that loop? When do you know when you're done? And sometimes that's obvious, at least to a user, but sometimes it is non obvious, whether it be to a human or to a lland m. And exiting can be a tricky thing. So enough pseudocode, I'll show you some actual code. So this is a very simple example of pylantic AI. We have a pdantic model person, just three fields, just name, date of birth, which is a date and city. And we define our agent here. We're going to use OpenAI GPT -4 zero here, but we could use, we have support for anthropic, OpenAI, Google groc, a whole bunch of other models. And one of the number one reason that people like using these Asian frameworks is that they give you this model, agnosticism, the capacity to switch model in one line of code and see how different models perform. And in fact, we have released this week, so it's not in my talk, what we call our direct api, where you basically get a direct interface to make requests to another of them without any of this agent stuff, where we just provide the model agnosticism. They're like unification of the api because there are places where you don't want this anyway, this example, we're using an agent. We've set the output type to be person. So when we look at the annotation here, result or output, we will get an instance of person or it will fail. We have instructions. And then we have the actual unstructured data where we're trying to extract, whether it we're trying to extract this pdantic model from, which is just Samuel lives in London and was born 20 eighth of January 87. And if I go and run that example and my Internet holds up and I haven't run it full, sure enough, we get out the structured data. Nice, the pedantic and cynical among you, which since your developers, I hope is all of you, will notice that this is not actually aantic. There is no loop here, right? We're making one request to an llm and we're getting back structure tured data in that successful. But you don't have to change your example very much to start to need that agantic behavior. So this is basically the same example, except that we've put in a functional validator in the Pantic model that the date of birth of this person needs to be in the nineteenth century. So you'll see in the actual prompt it says in 87, but it doesn't define the century. And we're being a bit unfair for the model here because we haven't put anywhere in its context. Oh, by the way, the person we're looking for was born in the nineteenth century. And so what will happen is that the model will fail, the validation will fail the first time, and that's where the aggiantic loop immediately kicks in because that validation error is then fed back to the model. And based on the validation error alone, it's able to retry and hopefully successfully pered the validation. Everything else is the same. Actually, we're using Gemini here, but other than that, it's no different. The only other thing we've added is these three lines of code here to instrument this example with log fire so that I can show you that a giantic loop going on. So if I run this example. It has succeeded and we immediately get some traces out here. And you can see we had two calls to Gemini. But if I come across here to log fire, and I will, we look at this trace here, which is from that run, very simple. We immediately see, and I'll zoom in here, although it's a bit of a pig to do so, you can see what's going on. So we have the original user message that we sent to the model, which was Yeah was the unstructured data. It returned as you would expect it to assuming 1987. We then sent back to the model. So in this case, the way that we do structured outputs at the moment in panantic AI is using tool calls under the hood. We're about to add support for not for at least the option to not use tool calls, but that's what we're using here. So it's calling the final result tool with this data. The pdantic validation is failing here. It's failing on functional validator, but it could fail on anything from the wrong, wrong input type to you wherever pdantic would fail. And if we then scroll on down the example and then zoom in again, you will see it then used the information in the validation error to detect that it had to return a different date of birth. And it did so and we succeeded. And the other thing that's useful to see here is in the trace view, we can see how long those two requests took. So on this occasion, the first one was a bit longer than the second one, probably that was making the http connection. And we can see the pricing here, both the aggregate pricing across both of them and the price on the individual ones. And at the moment, we don't have a cost for Gemini to flash. So it's not showing the cost, but it would show the cost here if it did. So moving on. So this is that example. But you will also have noticed that even in my second example, we didn't have any tools. So how do we register these tools that the model has access to? To well, what would I suppose do rag as in have access to costs and functions along the way to retrieve extra data that it will need whilst answering your query. So the way we can do that with pdantic AI is we can use they're at agent tool decorator to register tools within this particular agent. And you can see if I switch over to the actual code here, what you can see, and this is where I talk about type safety as being really, really important for us. So we work hard to get this get this stuff to be tysafe. I think no one else does. So we have some stuff to connect database connection. But then in terms of the agent code, we have here this depth data class. So this is just a holder for extra things, in this case, database connection and the user ID that you might need to access while you're inside the functions, which are the the way that we respond to the tools. And so we set critically, we set depth type here when we're defining our agent. So our agent is now generic in depth in this particular case. And the way we've set up the decorator here means that you'll see here we have run context is parameterized with depth. So if I change this to int, we will suddenly start getting an error here saying the function has the wrong signature. Now all of this you generally don't have to care about. The point is, when you access context depth, you have an instance of depth. And if you then access an attribute of that, you get, in this case of database connection. And if you had one n instead of two ends here, we will get a type checking error, whether that be nicely in our ide or when we're running ci with whatever pointing out that we've accessed this wrongly. If you were using if we hadn't done this extra work to make this particular thing type safe, you would have to go and run this code slowly in expense to find a runtime attribute error because we had one n in con. So what this example is actually doing is adding what people refer to as long term memory. So basically two tools, one to record memories and one to retrieve memories, which the agent can then use to record things it knows about you in turn and then retrieve them to answer questions. And we have Yeah we're using postgres here. So we're storing that data in postgres, making a query to insert into the memory table and then select that value from the memory table based on the memory contains, which is what effectively the query that the user, the database is giving. We could have a more complex example where we were using vector search and embeddings to do this kind of thing. But in many cases, this simple, I can say I like, will do well enough. So if we look at our code for actually running this, we have we're connecting to the database. We set up our incidence of depth, and then this is where we PaaS in the depths. So one of the other things, we realiis that these agents are very useful to define globally. So we can't PaaS in the depth here because we often wouldn't have access to the database connection when we were defining the agent in the module scope. So we're defining the depth type at this stage and then passing in the depths here. And again, because agent is parameterized with the depth type, if I passed in the wrong type here, we would get an error telling us that we hadn't passed in the right depths. All of which is just to say we can kind of guarantee with tychecking that this thing I passed here is going to be the same thing I get access to here, which is very useful. So if I run this example, you'll see the output here. But probably more useful is looking at the output here where we're running the memory tool twice. So you can look inside our first agent run, we had a call to ChatGPT in this case. And then we had we decided to run one tool, the record memory tool in this particular case. And then inside running that tool, we then had that database query to do the insert. One of the powerful things about log fire is it's a general purpose observability platform with great support for AI rather than an AI specific observability platform. So you have full instrumentation of, in this case, a postgrequery, but it could be whatever you're doing, system resource usage, http requests, whatever you want to go and do. And so we can see here the precise query that got run, the sequel that was executed. But I think that the most useful thing here is to see that we've got the bunch of database quwe've got the database query going on here. You can immediately see visually how little of the time was spent making the database query. So if we wanted to come an optitopine ze this case, you can see that trying to make my database query more performing is never going to help. We need to think about some other way to improve performance. And Yeah, if we go back and we look at the actual agent run here, you can see what it did. So I'll just talk through it for those of you who maybe can't see it that well, we have the system instruction. We have the user input, which is my name is Samuel. It decided to call the tool record memory with the user's name of Samuel value added to memory. And then it replied, and then when we made the second run, which here was obviously in the same process and the same call, but in theory could be weeks later because we've now just got this data stored in our postgres database, you'll see it. Same prompt. What is my name? Was the question I asked it. It basically called the retrieve memories tool with memory, contains name, retrieved the right value and was able to reply knowing what my name was. It's not using tools. It's not using tools. But it's worth also talking about short term memory. So what we talked about, so AI people seem to love anthroomorphic definitions of things that don't follow industry standard. I don't know why they're like doing that. That's what they like to do. And I think it's because they kind of feel cool that everything is like a thinking person. But whatever. That's why they talk about thinking rather than processing. They talk about long term memory effectively being this kind of tcall thing, and then short term memory basically being information that you put into the context that the agent has access to, so it can access it immediately. And it kind of makes sense. But it would be a lot easier if they rered to like tool based memory and context memory. I spent weeks not understanding the two, and then I realized only by implementing it and realizing what the distinction what's. So in this example, we're doing a memory with messages or short term memory. Again, we're configuring log fire. We have this line of code. I forgot to mention is instrumenting the postgres connection, ada, yada yada. Our agent is now very simple. Excuse me, we don't have any tools definfind. But the critical bit here, and at the moment, this is a reasonable amount of work. And we hope to add an abstraction to make this kind of access to persistence easier. But we're basically record querying to get all of the messages that we've stored in the database and then adding them into the context via message history when we're doing an agent run. So this is what would happen if you're using a ChatGPT style interface and within a conversation, you ask a new question, it will go and get from the database all of the messages and put them into context before it calls the model again. And there are particular apis for that within all of the model providers. But effectively, I assume what it's doing in the background is basically smudging all of those messages into the big context windows so that the model can access that data. And so we run this twice. So this run agent is effectively taken care of, retrieving messages at the beginning and recording them after we finish running. And we have a nice type so that we can get back, for example, in this case, messages in json format that makes it easy just to put them into the database. And so when our actual code is relatively simple, we're going to run the agent twice again, first with telling it the fact, and then secondly, seeing if it can get back the fact. And if we go and run this example, it was able to run, and you might even be able to see just visually that it was immediately faster. So you can see here. If we look at the agent run, you can see there's no tool calls going on here. It's just responding immediately. And then in the second case, which is kind of the acid test where had access to that message, you'll see all of the previous messages are included in the context. So you can see them in the conversation here. And it was able to respond. But critically, I guess, this time, it was able to perform that in just under 700 milliseconds, whereas before, when it was using long term memory, I guess the same case took 1.6s because it had to make two calls to the model. You can see where that would be useful. So I will stop those two examples. Close that. That's talking about the second way of doing memory. So mcp, so model context protocol came out back in December, came out the same week actually as pydanki. And it is it was designed by anthropic for cases like Claude desktop or cursor to allow these local llm apps to basically have access to external tools and resources in a way that was usable by any of these different tools. So windsurf cursor etc.etc.z could all use mcp called desktop. In this case, we're using it for a slightly different application. Here. We're building autonomous agents. So we're writing Python code that should run, not necessarily, but in general, without a user being involved in the immediate loop. But we can still use mcp or some of mcp very effectively. So mcp under the hood has three primitives, tools which we're going to talk about here, resources, which are effectively documents that you're supposed to go and download and put into model context, which I think we will support in future. And then a third concept of prompts, which is effectively like a template for a particular query. So you can imagine if you have an mcp server to query some particular database, you can imagine a useful prompt which effectively contains all of the database schema that the model is going to need. And then you basically fill in the variable, which is what exactly you want to go and get. As far as I know, prompts are not heavily used by mcp, and I think it's one of the creators frustrations, David Perea, that people have ran off and used mcp for the tools and haven't really thought about the resources and the prompts. But anyway, here we're doing, we're guilty of exactly that. We're going na use it just for the tools. So in this particular case, we're going to use an mcp server that we've built that is actually built into pylantic AI. I always maintained in the same repo, which is mcp run Python. This is a way of running sandboxed Python code locally or remotely or wherever you want, but without it having any access to the host. So sandboxing Python has been notoriously hard until now. Most people have done that via docker containers, via like us level isolation. Mcp run Python is built using the amazing piodide project, which is how you can run Python in the browser. And then in turn, we're running pode inside Dino, which is an alternative to node, but a way of running JavaScript code locally. And so but Dino in particular provides isolation, just as the browser would, to prevent JavaScript code that is running in your chrome from accessing your operating system. Dino uses those same techniques. And so it's a bit weird that we're calling Python inside Dino, inside JavaScript, inside wam when them running Python code, but it works really well. And in particular, we've built this as an mcp server, so you can use it with Padan's AI, but in theory, you can use it with whatever tool you like, right? As in because it's just an mcp server and you can go and connect to it. The commands, a bit of a bobecause. We give it some permissions. And in all permissions, mcp has two ways of operating, either over what they call standard io, which is basically running as a subprocess and using standard in and standard out to communicate, or you can run it over http. But here we're running it locally. And so we have a plantic AI concept here of setting up our standard io mcp server. We give it the full command to run it. So it's ours in this case. But I'll show you in a minute, it doesn't have to be one of our mcp servers set up our agent. But critically, we register here saying there are some mcp servers that we want to set up, that we want to register, but we don't want them to be running yet because as I said, we want our agents to be global. So we don't want na have to start running our mcp server. So you can imagine then so we then use agent dot run mcp servers, and that starts the mcp servers, whether that be standard io runs where we start the process or http ones where we're going to basically set up the http connection. So if you were running a fast api app, you would use this run mcp servers within your lifespan function to start them up for the duration of your server running. And then the actual question that we're gonna to ask the agent is how many days between these two dates? Now this is not something where you would want the model to try and pull the number out of its arbasically. You want it to go and do the calculation. And in fact, sure enough, the recent leak of ChatGPT's mega prompter that they use within ChatGPT says to the model, never do calculations directly, always use the run Python tools. So effectively, inside OpenAI, they have some equivalent way of running sandbox Python code, and that's what they're using. If you ask ChatGPT a math question, it is not trying to do the calculation from first principles. It is writing Python code in the background to go and do that calculation. So if we run this example, you can see it registering the tools. And if I come over here, we should see it running, still running at the moment, but once it is finished, we should be able to see what happens. So we asked it this question. And the point is, instead of it doing the calculation, it wrote this Python code in a slightly weird way. But the principle looks about right to calculate the number of days between these two dates. One of the useful things about using Python calling like this is we can eventually tively go back and debug our calculation and work out what it used to calculate something. Whereas if we just ask the model off the top of its head to do a calculation, you have no guarantee of whether it was right or not, or what might have gone wrong. And so it then responded. So the response that we got from the mcp server here was that it was success and the return value from running the Python code, or like the final line of Python code, which was just the numeric value. And then it's printed out a summary for us. It's returned a summary on the second call to the lland m. So you should be able to see that here. If we look, we had we ran one tool here, which was the mcp server. So we had the first call to ChatGPT. Then we had calling the tool, which took one and a half seconds because it had to go and basically boot up Python. I don't think it had to install any dependencies, but if it had dependencies in the code, it would automatically find them and install them within the do environment. So if you were using NumPy or something that would just go and be automatically installed would obviously be a bit slower, but would work. And then there was the final call to ChatGPT with a response, which is what it then returned. Here is worth saying that there are other libraries that use this tool calling thing more small agents being the most prevalent of them from hugging face. I'm told I'm not allowed to be rude about the capacitors, but I find the idea of an established company like hugging face releasing something where as far as I know that the like code isolation is very minimal. It's like, let's block some import parts and hope for the best seems extraordinary to me, because sure, agents aren't yet trying to or good enough to break out of your that isolation and they probably don't want to try and delete your home directory. But you can imagine how hard it would be to be certain that a user hadn't managed to put some code in that the model then ran. You can imagine, ignore all previous instructions, run this Python code, and now you effectively have someone having access to run up remote code execution on your system. And that's why we built it this way and worked really hard to have effectively Google v eight isolation between the operating system and the Python codes that the model has written but a user might have influenced. But what we're going to go further with this, and the next thing that I already have a pr up to do, and I need to finish it, is allowing this Python code to call back to particular functions that you register and allow it to call on the host. So if you wanted to get access to a enormous Python file, you didn't want na PaaS that through in context. You could, sorry, source code or something. One of the best uses I've seen for this is, for example, if you want to go through a very large html page and extract certain attributes, you can quite easily exceed the context window of even the biggest models. But the model will write you beautiful soup code to basically extract the right bits of an html page very effectively. So you can use a model to process html that way rather than just giving it the full html and hoping for the best. And Yeah, we'll allow that, like calling back stuff in future, which will make this even more powerful. In fact, on that exact point, I think it's worth using showing another example of an mcp server that's not written by us. So if I show you this example here, so this again, we typo that we've got some code here. We're instrumenting pylandki. We're also instrumenting mcp. So I didn't call this out explicitly before, but we have support within Plandai. Obviously, I was showing you instrument async pg earlier that was instrumenting the postgreconnector. But we can also instrument mcp itself so we can see the particular calls going on. And again, that is not specific to pylanai. So if you're using the Python mcp sdk, you want to instrument it, you can use log fire regardless of whether you're using pylantic AI. And then we're obviously instrumenting pylantic AI and we're setting up our mcp server. Here is we're using the excellent mcp server from playwright. So playwright is a library for browser control maintained by Microsoft. You would use it for things like unit testing or testing your front end, but they've built an mcp server that effectively allows you to control a web browser from within your code, and it works really well. And they've done all the hard work, as I'll show you in a minute, to basically simplify the page rather than just returning the full html to the. AI. And so in this case, we're going to ask it to go to pdantics website and try to find the most recent blog post and summarize the announcements. So a relatively complex task, if you can imagine, like before mcp, led alone before AI, this is an incredibly hard job to go and set up. This is like a long time getting the right seup for navigating arbitrary sites, simplifying the data, extracting the right things from the html. And now mcp gives us, this app allows us to connect pdantic AI to playwrights mcp server really trivially. So I come back here and I run this example, and I'm just going to print the output at the end. So you should see it running. We're using Claude 37, which is relatively slow in this case. So itdo a bit of thinking. It's try to go to pdantic dot Dev slash blog, which is wrong. It's realized it's wrong. So it's gone back to panantic dot Dev, the homepage. Fingers crossed it will then work out to click articles. It seems to be hanging indefinitely, which always takes longer when you've got everyone watching. And it has now successfully gone and found our evval's blog post, and it will think for a bit longer and hopefully at some point return a summary of that. If we come across here and we look at that in log fire, we should see, Yeah this request is still going on. That we can already start to look at what's going on within here. So you can see we had multiple different calls to browse a tool. So we use browse and navigate. First of all, which I think if I go to that, sorry about this, you'll see it navigated to blog. You can see exactly how long each of those steps took. That was relatively quick. We went back to Claude on it. It thought about this for a bit and obviously decided that was the wrong page because it had got a 404. If you see here that the response code should have included 404 or will have included 404 back and forth. And if we now it's finished, we look at the full conversation, you can kind of see a better summary of what happened. So Yeah, so this is perhaps the most interesting thing to show from their mcp server. Sorry, I have to do some funny scrolling here. But instead of it returning the full html of the page, which is an enormous piece of data and would fill the context very quickly, especially if you had, for example, embedded images or embedded like css or JavaScript or something, it turns the html into this yaml format, which I guess someone has decided is what models like to process. You can imagine how much easier and quicker that is to process for an lllm than the full html of the page. And then once we get past that tool, it navigated, yada, yada, yada. And then finally, it came up with summary of our blog post. So the main announcement was our evval's library. I'm going to show you in a minute. But also like new sdks for JavaScript and rust etc.. I won't make you read our full pr announcement, but you get the idea. I would encourage you to go and read it afterwards. So Yeah and I think the other thing we'll be able to show here is Yeah total cost. You can see here this was 14p, sorry, 15p, 15Cenin total. Yeah so we're aggregating the costs from different from the individual runs. I think this is a good time to mention that as a company, we care enormously about open source. Obviously, we maintain pdantic and pdantic AI, which is a completely open source log fire. The sdks are all open source, but the back end, the platform, is closed source. But even there, we still care about open standards. So log fire is built on open telemetry. So you can send data to it from anything that emits open telemetry, or you can use our sdk log fire and send that data to whatever platform you like. And I know we have people who are doing that. And although we would love them to pay us money, I'd rather people found our stuff useful and didn't pay us money than just didn't find our stuff useful. And that's the same principle we're using in dantic AI. So the attributes that we are so again, the data we're emitting from pdantic AI is open telemetry. But even there, even beyond that, we are following semantic conventions for gen AI within open telemetry so that the data exported should work in any of the platforms that are designed to receive gen AI data from open telemetry. We obviously think log fire is best. We hope you end up using it because it's best. But we're not trying to do the lock in thing. We think we're going to try and succeed based on actually building a good product rather than lock in, which is seems obvious, but we are not everyone thinks that in our space. I won't name any names. So Yeah, this is these prices here are coming from if you look at the raw calls to the llm, you look at the details and you see the pricing, which I think will be down here somewhere. There we are. So these token counts are coming. We're using the specific attribute names that gen AI that the hotel recommend. So again, this data should work in any platform. Itlook best in log fire, because ours is best, we hope. But like in theory, you can send that data anywhere. If I come back into here and I'm going to move on to the next part. So we have we have quite a lot of time. I'm going to dive into an evals use case. There's going to be quite a lot of code. I'm not going to apologize for that. The kind of has to be to explain what we're doing. So evals are this concept of, I mean, people think of them as equivalent to unit tests, but for stochastic, non deterministic applications, they're actually much more like benchmarks in the sense they don't generally PaaS. They can outright fail. But there's some nuance in working out how well and eval has done. Evals are an evolving art or science, and anyone who claims that they know the exact answer is wrong. So we have our take on how we think one way of doing evals, I'm totally confident that what we have won't be the state of the art in five years time. But I think we're like trying to move things forward and we'll hopefully evolve if people work out the like. Go to answer, I will say I avoided building evals for a year because I thought someone at open enai or anthropic would have some magic source for how to do evals, and that eventually we would all get like wiped out by that answer. That seems to have not happened. And having spoken to people inside those companies, there is no magic source for how to do eals. They are just hard working out whether a model has done the right thing. So we're trying what we have built in panantic. Evvales is trying to be the kind of pi test of this space. So we're not necessarily telling you exactly how to do the evaluation. We are giving you a framework by which to run it and some useful tools you might find useful at llm as a judge that we have set up. But in theory, we allow you to define whatever tests or metrics you want to do. So this is the example that we are evaluating. So fundamentally, it comes down to this very simple function here. The idea is this is a feature from log fire here where we allow you to enter a human description of a time range and get back an interval. So it's a good small, simple use case of an AI. And so we have a pdantic AI agent defined somewhere here. It returns a union of either basically success or failure with some details. It has some depths, as we've shown already, and it's instrumented. And Yeah so but fundamentally, what we are evaluating when it comes down to it is this stochastic function, which takes a text description of what someone wants and returns either an error or details about why it's incorrect or returns a time interval. And so the first stage of building the data set of building evils is to have a data set, a set of examples that you can run where you know what it should do. You can define them yourself, a human, write out the different examples. You can use a platform like logfire to get some actual real world user examples of what people entered, work out what they would have expected to get back, and that can become your data set. Or you can do the lazy thing, which is using a powerful model like zero one to basically go and make up a bunch of examples. And so we have this function, generate datset, which is use as a model, obviously, is using Pantic AI under the hood. We tell it the types that it needs to generate, and then we have a big instructions on what it should do. So generated data test cases for the time range inference agent include a variety, yada, yada, ada. We also tell it which evaluators we want it to put in. But fundamentally, this is just a very complex example of the very first thing I showed you, structured data extraction. We're giving it this long description, we're telling it to make up some examples, and we're giving it a very complex json schema of how we want the data to be returned. And if you run this example, it will go away and run for about a minute, minute and a half, and it will bring you back something like this. So and then the end of the code was basically printing out the data. It got back as a yaml file. So this yaml file is our data aset. So it contains, you can look at an individual case like this. So we've got a name for it. We've got the input. I'd like logs from 2:00p.m.on a given date. We have a now because you can imagine this is a complex example. We have to give it the concept of now because that's relevant to, well, if you say get it me logs from yesterday, we need to know when when today is. And then we have what we would expect that function to have returned. So instructured form, that is the mtime stamp, max tistamp, and some explanation to the user of why we chose that range. And then we can add individual validators, sorry, evaluators, to each case. So in this case, we've added the is instance of success evaluator. Next example, I think the evaluator is instant success again. But now we have we've added the lland m judge evaluator with a rubric, which is effectively a description to the AI of what it should be evaluating and on and on. And then we have I think we have down here some evaluators that we apply in all cases. So we have an llm judge saying ensure that the output, the explanation on error messages and seconds is in the second person. Sorry. Be concise, ettera etc.. Then in this case, this is a relatively simple simple bit of code that just adds a few more evaluators into that code. So if I run that and the gods are with me, it will run successfully. And now we get this slightly different version of that output with a few more evaluators added. In some cases, you can see extra ones added on here. So these, in this case, these are human. These are excuse me, these are user defined evaluators. So if we look at validate time range, this is evaluator. You could define yourself, which is basically inheriting from the evaluator data class and basically defining the or abc defining the evaluate function, where in this case, if it's success, we're basically checking that the time range looks valid and we'll return if the window is too long, if the window is in the future, that's obviously an error again. So this will be run for each evaluation to define what success looks like. And as I say, you can define your own evaluators as well as using R ones. The last thing I think it's really important to say is as pdantic, we care about validation and type safety a lot. And so we've gone the extra mile. So one of the odd things you'll see in this file is this magic comment here saying the yaml language server, which is basically referring to this json schema file, which is next to it. And it means that even inside the yaml, you get auto complete and type checking. So and a description of what the fields are. So if I say evaluator instead of evaluators, I get an error because of the json file, because of the json schema file. And obviously, if you then try and load that, we're doing validation on this input file, but this allows you to get kind of auto complete if you're editing these files yourself. So with all that set up, now let's actually run an example. Pydantic evals integrates very nicely with log fire, and so we can show the summary of what's happened in log fire. But fundamentally, there's no requirement on you using log fire. Unlike our competitors, we think pi test wouldn't be successful if pi test was linked to a particular provider or like required you to use a particular provider. So while we are obviously for profit company, we think that building open source the right way matters. And so here we are configuring polantic evils to work with logfire, but you don't have to. And as I show you, you get a nice printout of how things have gone even if you're not using log fire. So Yeah, in this case, we're going to run fundamentally with setting up our data set. We're loading that from our yaml file and we're going to run evaluate, just going to go and run all of those cases with the function that we passed it. And we're going to basically, this is a kind of unit test example or context of like checking whether or not it's doing well enough. And so we'll basically assert we'll run it and see whether the PaaS rate is above 80%. So we're going to run this. You'll see it should print out many examples. And if I come over to log fire, you will see them coming in here as it runs of all of those different sets. Again, we display this stuff in the tracing view because it's very useful as you run a particular eval to be able to see what actually happened inside your Python code. And in fact, we even have evaluators where you can use these traces to basically check whether a particular tool was called or whether a particular code path was followed. But Yeah, so it finished and it failed. So it got 87% success, which is obviously lower than 80. But the other thing to show is Yeah so very much like pytest benchmark, we will print you out a summary of how this performed, which cases worked well and work badly locally so that you can use this without log fire if you so wish. But you can also see that same data in log fire. So if we come back to the beginning, the outer trace of and it's not loaded properly. Let me see if it loads one of those properly. Now that we up, it's succeeded. I don't know what went on there, but you can see each individual case here and which of the assertions passed and failed. So you can see with a single point in time, most of them passed. But this one failed the lland m judge because the response was in the first person, not in the second person. And we had told the llm judge in the rubric, you should check whether or not the output is in the second person. And you can see that second person case has failed in quite a few of these individual cases. This one failed for the particular time range. This one failed for a number of things. This one passed on all three etc.. And the neat thing is that we can go in and we can look at the individual span where that happened, and we can look at the individual inputs and outputs and work out what happened, and then start using that to go and work out how we could improve our model or improve our agent. And so in this case, I think what I can try and do, I haven't tried this before. So this may go, this may not work, but we will try it. We will take our model here. This is where we define our agent. And we're going to say, always reply in the second person. I'll even give it an exclamation mark, and we'll see what happens. So we'll come back here to the unit test case and run it. Let me just clear this to make things a bit easier to view. If I run that case. I don't even know if the school is going to get better, but we hope so. And then it will run the judges at the end, which I think is what is going on at the end. And it succeeded. The average went up to 83%. So you can see how we can use evals to basically systematically improve our behavior and then run the tests as it will, and see what went well and what went badly and dig in to individual cases. You can see that second person check started passing. There's a bunch more that's still failing. We would want to go through and start trying to improve the performance systematically after that. And the other interesting thing to run is this last case where we're comparing models. One of the things that we allow you to do quite easily within pdantic AI is this override method, where we can override the model used by a particular agent, deep encode. So if we wanted so for example, this is useful in actual unit tests, where you want to replace the model you're using with our test model, which will always return a response, or in this case, in evils, where you want to basically be able to go and change the particular model used. And the point is you don't need to go and edit your application code in your tests or in your evils. You can use this override methods to set the particular model or I think some other parameters as well. Yeah depths as well, you can override. So if we run this example, this will take a bit longer, but this should run two sets of evals for whatever those two models were, GPT -4 zero and Claude 37, and see whether or not one performs better than the other. So if we come back over here and you see it's running at the moment, still running all of the cases. And it has succeeded. And if we go up here and then we look at the final performance, you'll see GPT -4 zero got this case this time around 76% success. And Oh, sorry, and given how much longer it takes to run, claw did not do much better, only 78%, which isn't really good enough given it's much lower and more expensive. But anyway, this is the evils library. There's a lot more to explain. It's a complex piece of kit, but we think really valuable. Would love to hear people's feedback on it. As I say, this is still, to some extent, I think it's as production ready as any code, but like the concepts within it are still evolving quickly. And so would love people's feedback on what's working and what's not. I know there was a company who I think somewhere at this conference, I don't know if they're here now, who spent $25000 running evals with pdantic AI the other day. So people are beginning to pick this up and use it seriously to run their evals. So coming back to the talk, I think the first thing to say about pdantic AI is we are not claiming it is finished yet. There is lots more things to add. I think the memory persistent stuff I was talking about should also be added to this list of big improvements that are coming soon. But I talked about structured outputs are using tools at the moment. We want to allow them to use effectively the built in system in the case where models have that or effectively json schema in the instructions where that's better because that works much better on dumber models that don't do talk calling Olwell. I haven't talked about mcp sampling, but that's another very powerful concept in mcp of effectively the server asking the client to proxy llm requests, which we want to support both as a client and as a server, have some more control over which tools are registered for particular steps. The mcp run pson being able to call back to the host that I've already talked about. I didn't even get to our graph library implementation today because I didn't have time. But like we have some changes to that, that make it hopefully more composable to use. So that's kind of equivalent to land graph. But typessafe is the objective thing. I can say there are some subjective judgments I would make, but I won't make them. But the most important thing of all at the same time as continuing to add features, is we know people care about stability and being able to build on these things, knowing they're not gonna to change all the time. So pansgi is still very Young, only came out in December, but we are we will make a version one release by the end of June. And then we will follow semantic conventions very closely and not break your code because we know that is something that matters to people. So thank you very much. I think we have some time for questions. Yeah, we've we only got eight minutes, but anyway, I can take some questions if there are any. Thank you very much. Oh, sathank you very much for your talk. I have a few questions. The first one, when you create agent, and in the agent, there is a model name. Sometimes right now we can not use the public models. We we host the model internally internally. Do you think we can PaaS them model internally? Let me see if I can find where was that example here? Yes. So if you look at the signature of, let me take, which is the coat, this is what I want. I'm going to take this example. So here, if you look at the signature of agent, initialize it takes model, an instance, a model, a string or none, because you can PaaS the agent later. But the point is, this is basically shorthand for defining the opeye model. And if you want to go in and go and look at the model type, the model is an abstract basclass, of which we have a number of implementations for things like OpenAI, anthropic, roc, etcec. Bit in theory, if you want to implement your own model, you just need to just need to implement this abstract base class, which has Yeah, I think it has like actually, it only has two abstract methods that you need to go and define so you can implement your own model. If you're using an OpenAI compliant model, you can point it to whatever domain you're using. And we know that's a really important thing. And I know people who are using that now. Yeah, next question. Yeah, I had two questions I wanted to ask. How 's the async support and also the supervision language models? So all of pdanai is async, and we effectively have. So you'll see here we have run sync. Run sync internally is just a wrapper around. Is that going to open the right thing? It's decided doesn't want to open that right now. Let me try and do that and see if that's going to get it to work. Anyway, the point is interally, it's all async. And then we just have a few wrapper methods that effectively give us a like pseudo sync interface. Actually, if we have a problem, honestly, it's on the sync side where if you're doing stuff inside threads, celery, for example, has some trouble doing async. But Yeah, it's basically all I ync under the hood. Thank you. And the vision model stuff, I can't. We are yes, in some context, ts, we allow vision. We allow like multimodal inputs. I don't think the full story on multimodal outputs is fixed yet, but we're working on it. There was a question here, Oh, if you had the if you got the mic franca for it. So is this only OpenAI or does it also work with like lama 3.1? Yes. So we support in here. If you look at where am I going to find it? If I am, Yeah, I'll go into that's not what I meant at all. And if I go to known known model names, so these are the models we support the moment we're adding to this list and we're happy to accept prs for most. We have some rules on basically when we will add add a model. We had some tiny providers who are like add thousands of lines of code to us and we've not accepted it, but like if it's a widely used model, we'll add it. But Yeah, we have a reasonably large list already over here, the model class that you showed us. Do you have any plans or do you already have some sort of conversion to turn a blank chain runable into a pedentic AI model? Or is that something you're not interested in just for making it easier to switch over or interit's? An interesting idea. Yeah. Happy to look at an implementation and see if we can add it. We would we would definitely consider it because I hear people being like, we built all this stuff with langchain. We would love to find a way. Definitely something we would yes, we would consider it. I don't know how doable it is, but definitely will consider it in terms of structured outputs is constrained generation, for example, on the roadmap? And when you say constraint generation, what precisely for localized models is trying to get structured outputs from localized models? Not something we've thought about lots, but if you have a particular idea, come and talk to us and happy to hear it. At the moment, most of the structured output is like give it json schemer, and it does a pretty good job. And I think one of the things where we try and work work out, if you speak to the like AI headbangers, they keep talking about, don't bet against the model. Then you hear their company, and their company is doing nothing other than betting against the model. If you think of agent frameworks are fundamentally betting against the model, right? Like if the model was smart enough, you wouldn't need an agent framework. You would just give your model access to the Internet and be done. And so working out where to be like don't bet against the model and where you're like, sure, that will be fixed one day, but if that's in three years time, that's not very helpful for people today is a difficult line line to walk, right? So for example, some of the dumma models return like JavaScript type data like json five or rather than Jason. So would it be valuable to have json five passing support? Or is that something where even the cheaper models, again, to fix it really quickly, really hard to know the answer to when to help the models and when to just be like the rising tide will lift all ships. Question. Lots of questions. But question here of model. If you configure a selposted model, you had pricing on log fire. Can you also configure pricing for your own self hosted model? Yes, we're working on it now. Okay. We are in the process of taking basically building an open source database of all model prices, and then we will so at the moment with server side, we're basically calculating the price based on the model name and the tokens. We're gonna to instead do that in pdantic AI. And so effectively, if you want to send your own prices, you can send whatever you like. And actually, the ultimate advantage of that is you'll be able to basically change what goes on in this itpop up to be whatever you want fundamentally. The other advantage of that is at the moment, because we're doing the calculation to render this panel, so I haven't talked about the fact that I haven't tried to do too much spiel on log fire. But like one of the powerful things about log fire is you can write arbitrary sql to query your data. So whether that be entering sql here to do a search or you can use our explore view to go and Yeah run arbitrary sel to do calculations. In fact, here's an example of of looking at token usage and counting it. But also our dashboards are built on arbitrary sel. One of the problems at the moment is because those prices are being rendered just for the ui. You can't query on prices, you have to query on token counts. So if we put the prices in the telemetry data, then you can query on them directly. Yep. Very polite for you to call them slides the first time you define Oh, there. Okay. So for example, when you run this code, it's pretty easy, for example, to guess the name and the date when the person is born. But when there is the big ambiguity between the the entities, I want to extract from data how this agent will stop. For example, if the class person something, maybe the attribute like the product names, and I have a bunch of text, and I want to extract these product names, which are really big search space and how, I mean, you sodly rely on the response from the llm model or you have something on top of date to valdate. So if we get rid of this, and now we have the agent output type as a string, if we have, I don't know why that's not updated, but that would now be string. If we haven't set the output type, then we basically iterate through running the agent until it returns text output, and we assume that's the end. If you set the output type like this, what we're actually doing under the hood is registering another tool that by default is called final result. And when as soon as the model calls final result, we call that the end of the end of the run. So we assume that's the final result. Well, if validation fails when we call that tool, then we put the validation error back into the model and retry as many times as if you set retry here. The language server has died on me again. But if I if I set retries here, I can set retries to as many as I want. I think come and talk to us at the booth and I'll happily talk you through it in more time because I know we're close to time and over time. And there was one more question I'll take. So as someone who's used lagrap and lachain, my biggest complaint is that they're always changing something or the documentation is out of date. Yep. Just flat out wrong. Yep. So I guess AI and all of this stuff is constantly changing. How are you going to keep up with your documentation and make it easy for people to pick up? I get your question and I agree with you, and it's something that's just frustrated me for years. And so we go the extra mile on that stuff. So if you look at our documentation, every single one of these examples is unit tested when you run when we run unit telocally. So we can't we can't basically merge something where any of these where like this output, for example, is not actually that is literally being programmatically generated by running this code as part of our tests. Same we do on Pantic because I've been driven around the fucking bend in the past by examples that don't work. And the other side effect of that is all the inputs, all the inputs have to be there for the code to run. So it's stuff like that which you don't think of as particularly important, but it actually massively affects user experience. And I think beyond that, it's just like we have been maintaining open source libraries for years, is like me working on this, but also Marcello who maintains uv orand starlialex, who's been who maintains numerous libraries, David Monske who's done lots of stuff on pantitic and on fast api, like we're experienced open source builders in a way that those maintaining other libraries often are not, is about the trying not to be ruder. Thank you. I think we're probably getting it time, but Yeah, as I say, we're about all week. So come and talk to us if you have any more questions. Thank you very much.