2025-05-21 | PyCon 2025 | Building AI Applications the Pydantic Way (Sponsor: Pydantic)

构建AI应用的Pydantic之道

视频科技

媒体详情

上传日期: 2025-06-05 22:19
来源: https://www.youtube.com/watch?v=zJm5ou6tSxk
处理状态: 已完成
转录状态: 已完成
LLM 提供商/模型: openai/gemini-2.5-pro-exp-03-25

转录

speaker 1: We have like one minute, but since the room is completely full, I guess actually there were two spaces right at the front here. There's a couple of spaces down the front. So like do feel free to come in and take the remaining spaces. There's one in the middle here as well. This happened last year. I said, I promise, although it's a sponsored talk, people will come. And they put me in the same size room as last year. And again, the people are standing at the back. It's a relatively small screen. Can people see that at least approximately at the back? Oh, dear. Now that's a question. I have no idea. How do I do that? And. I'm going to try the word light and Oh, is that a bit better? Oh, that's even better. Okay. I should be able to do most of it from inside z. So we'll hope for the best. Okay. I guess, I guess we're kind of there's still two places here at the front. So if anyone wants to come and take them, I would I would encourage you in one here. So or the rest of you are just wondering whether my talis going to be really boring and you want to be able to get out easily without. And there's one more if you want it. But cool. Well, thank you so much all for being here. I am Samuel. I'm best known for creating pdantic, the original library. I assume that if you've got yourself to Pyon and found yourself into this room, you know what pydantic is. So I'm not gonna to talk too much about the original pdantic. I think that the one most important thing to say about it is it was first created back in 2017. So before gen AI I, it is downloaded today somewhere around 350 million times a month. Someone pointed out to me that that's about 140 times a second. So a lot of that usage, I don't know, but I'm making up a number. A third is from gen AI I, but a lot of the rest of it is from first api and from all many other sorts of general usage within Python. So I think that's where pdantic is, kind of why it's so downloaded so much as it's not just gen AI I, it's used by ora Fang and but in particular relevant to what we're talking about today by all of the gen AI I libraries. So that's both the sdks opai, anthropic rock, Google, etcetera. Et cec, but also the agent frameworks like langchain, lamer, index crew, AI, etc.etc.. I started a company around pdantic back at the beginning of 2023. And we have two, we built two new things as well as continuing to develop pdantic. So they are pdantic log fire, which is our developer first observability platform, which I will show you a bit today. Luckily, it's also in light mode and pdantic AI, our agent framework. But most importantly of all, do come to our booth. We're here for the next few days. We have, I think, a fun demo of log fire. We have t shirts, we have stickers and we also have a prize draw. So Yeah, please come along and say hi. What am I talking about today? Supposedly the title is building AI applications, the pedantic way everything is changing incredibly fast in AI. There are points that even in that since I first gave a talk like this a couple of months ago, that have fundamentally changed. But at the same time, some fundamental things are not changing at all. We're still trying to build applications that are reliable and scalable, and that is still hard. Arguably, that's actually harder than it was before gen AI I the assumptions, there are so many useful things that gen AI can do, but it is also a complete pig to work with. And so in some ways, those things are getting more difficult, and that's where I think we're trying to help you. So in this talk, I will use pdantic AI and pdanantitic log fire. But most of the principles that I'm going to talk about, I think, transcend that particular thing. So the first one is type safety. In this context, I think type safety is incredibly important and only getting more important in Python, but also in TypeScript. The number one thing that is changing is as AI is writing more and more of our code, whether that is like autocomplete or the full on cursor, go implement this view. For me, the number one most useful form of feedback that these agents can have is running type checking because it is side effect free. It is very fast and obviously it's is gonna to get even faster with ti, which is about to come out. And so Yeah, and the type safety story in pyston is great and only getting better. But if you go and use agent frameworks that have, for their own reasons, decided not to build in a tyessafe way, which is basically all other agent frameworks, as far as I can tell, you forfeit that type safety. You stop the type checker being able to help you or help the agent be able to develop. I'm also going to talk about the power of mtp model context protocol. Just as a quick straw poll, how many people have heard of mp model context protocol and how many people think they really understand what it does? Okay, I'm supposed to be one of the maintainers of the Python mcp sdk, although I don't get much time, so I will try and answer that as well. And then I will try and talk about how evals fits into this. I think we have quite a lot of time today, so hopefully I'll get to that as well. And throughout that, I'll talk about the importance of observability by using log fire. So what is an agent? I don't pretend to have anything especially new to say on this subject, but I think given that we're going to talk about agents a fair bit, it's useful to have an agreed definition of what an agent is. This, as of this year, seems to be relatively well agreed upon. This, the definition I will show here is from anthropic. It's the definition that OpenAI have also adopted in their new agents library. I think it's the model that Google is using. So some of the legacy agent frameworks, I would almost describe them as like langchain, who have still got a different definition, that are struggling now because we, everyone else seems to have kind of agreed upon this definition. So how Barry Zhang presented this at AI engineer back in February was like, this is the definition of an agent. Now, this is kind of helpful and elegant, but it doesn't actually make very much sense to me. What makes rather more sense is the pseudo code he showed on the next slide, which I have here. So the idea of an agent is it takes an environment, it takes some tools, which in turn have access to the environment. You take a system prompt shows that this is an ancient slide from three months ago, because now everyone will talk about instructions instead of system prompt. And then we run the llm, we get back instructions on what tools to call. We call those tools, we update the state, and we proceed. And even in this tiny bit of pseudo code, there is a bug. I don't know if anyone can see it, but the bug is that the while loop never exits. And sure enough, that actually points at one of the hard and as yet undefined bits of definition of an agent is, when do you stop that loop? When do you know when you're done? And sometimes that's obvious, at least to a user, but sometimes it is non obvious, whether it be to a human or to a lland m. And exiting can be a tricky thing. So enough pseudocode, I'll show you some actual code. So this is a very simple example of pylantic AI. We have a pdantic model person, just three fields, just name, date of birth, which is a date and city. And we define our agent here. We're going to use OpenAI GPT -4 zero here, but we could use, we have support for anthropic, OpenAI, Google groc, a whole bunch of other models. And one of the number one reason that people like using these Asian frameworks is that they give you this model, agnosticism, the capacity to switch model in one line of code and see how different models perform. And in fact, we have released this week, so it's not in my talk, what we call our direct api, where you basically get a direct interface to make requests to another of them without any of this agent stuff, where we just provide the model agnosticism. They're like unification of the api because there are places where you don't want this anyway, this example, we're using an agent. We've set the output type to be person. So when we look at the annotation here, result or output, we will get an instance of person or it will fail. We have instructions. And then we have the actual unstructured data where we're trying to extract, whether it we're trying to extract this pdantic model from, which is just Samuel lives in London and was born 20 eighth of January 87. And if I go and run that example and my Internet holds up and I haven't run it full, sure enough, we get out the structured data. Nice, the pedantic and cynical among you, which since your developers, I hope is all of you, will notice that this is not actually aantic. There is no loop here, right? We're making one request to an llm and we're getting back structure tured data in that successful. But you don't have to change your example very much to start to need that agantic behavior. So this is basically the same example, except that we've put in a functional validator in the Pantic model that the date of birth of this person needs to be in the nineteenth century. So you'll see in the actual prompt it says in 87, but it doesn't define the century. And we're being a bit unfair for the model here because we haven't put anywhere in its context. Oh, by the way, the person we're looking for was born in the nineteenth century. And so what will happen is that the model will fail, the validation will fail the first time, and that's where the aggiantic loop immediately kicks in because that validation error is then fed back to the model. And based on the validation error alone, it's able to retry and hopefully successfully pered the validation. Everything else is the same. Actually, we're using Gemini here, but other than that, it's no different. The only other thing we've added is these three lines of code here to instrument this example with log fire so that I can show you that a giantic loop going on. So if I run this example. It has succeeded and we immediately get some traces out here. And you can see we had two calls to Gemini. But if I come across here to log fire, and I will, we look at this trace here, which is from that run, very simple. We immediately see, and I'll zoom in here, although it's a bit of a pig to do so, you can see what's going on. So we have the original user message that we sent to the model, which was Yeah was the unstructured data. It returned as you would expect it to assuming 1987. We then sent back to the model. So in this case, the way that we do structured outputs at the moment in panantic AI is using tool calls under the hood. We're about to add support for not for at least the option to not use tool calls, but that's what we're using here. So it's calling the final result tool with this data. The pdantic validation is failing here. It's failing on functional validator, but it could fail on anything from the wrong, wrong input type to you wherever pdantic would fail. And if we then scroll on down the example and then zoom in again, you will see it then used the information in the validation error to detect that it had to return a different date of birth. And it did so and we succeeded. And the other thing that's useful to see here is in the trace view, we can see how long those two requests took. So on this occasion, the first one was a bit longer than the second one, probably that was making the http connection. And we can see the pricing here, both the aggregate pricing across both of them and the price on the individual ones. And at the moment, we don't have a cost for Gemini to flash. So it's not showing the cost, but it would show the cost here if it did. So moving on. So this is that example. But you will also have noticed that even in my second example, we didn't have any tools. So how do we register these tools that the model has access to? To well, what would I suppose do rag as in have access to costs and functions along the way to retrieve extra data that it will need whilst answering your query. So the way we can do that with pdantic AI is we can use they're at agent tool decorator to register tools within this particular agent. And you can see if I switch over to the actual code here, what you can see, and this is where I talk about type safety as being really, really important for us. So we work hard to get this get this stuff to be tysafe. I think no one else does. So we have some stuff to connect database connection. But then in terms of the agent code, we have here this depth data class. So this is just a holder for extra things, in this case, database connection and the user ID that you might need to access while you're inside the functions, which are the the way that we respond to the tools. And so we set critically, we set depth type here when we're defining our agent. So our agent is now generic in depth in this particular case. And the way we've set up the decorator here means that you'll see here we have run context is parameterized with depth. So if I change this to int, we will suddenly start getting an error here saying the function has the wrong signature. Now all of this you generally don't have to care about. The point is, when you access context depth, you have an instance of depth. And if you then access an attribute of that, you get, in this case of database connection. And if you had one n instead of two ends here, we will get a type checking error, whether that be nicely in our ide or when we're running ci with whatever pointing out that we've accessed this wrongly. If you were using if we hadn't done this extra work to make this particular thing type safe, you would have to go and run this code slowly in expense to find a runtime attribute error because we had one n in con. So what this example is actually doing is adding what people refer to as long term memory. So basically two tools, one to record memories and one to retrieve memories, which the agent can then use to record things it knows about you in turn and then retrieve them to answer questions. And we have Yeah we're using postgres here. So we're storing that data in postgres, making a query to insert into the memory table and then select that value from the memory table based on the memory contains, which is what effectively the query that the user, the database is giving. We could have a more complex example where we were using vector search and embeddings to do this kind of thing. But in many cases, this simple, I can say I like, will do well enough. So if we look at our code for actually running this, we have we're connecting to the database. We set up our incidence of depth, and then this is where we PaaS in the depths. So one of the other things, we realiis that these agents are very useful to define globally. So we can't PaaS in the depth here because we often wouldn't have access to the database connection when we were defining the agent in the module scope. So we're defining the depth type at this stage and then passing in the depths here. And again, because agent is parameterized with the depth type, if I passed in the wrong type here, we would get an error telling us that we hadn't passed in the right depths. All of which is just to say we can kind of guarantee with tychecking that this thing I passed here is going to be the same thing I get access to here, which is very useful. So if I run this example, you'll see the output here. But probably more useful is looking at the output here where we're running the memory tool twice. So you can look inside our first agent run, we had a call to ChatGPT in this case. And then we had we decided to run one tool, the record memory tool in this particular case. And then inside running that tool, we then had that database query to do the insert. One of the powerful things about log fire is it's a general purpose observability platform with great support for AI rather than an AI specific observability platform. So you have full instrumentation of, in this case, a postgrequery, but it could be whatever you're doing, system resource usage, http requests, whatever you want to go and do. And so we can see here the precise query that got run, the sequel that was executed. But I think that the most useful thing here is to see that we've got the bunch of database quwe've got the database query going on here. You can immediately see visually how little of the time was spent making the database query. So if we wanted to come an optitopine ze this case, you can see that trying to make my database query more performing is never going to help. We need to think about some other way to improve performance. And Yeah, if we go back and we look at the actual agent run here, you can see what it did. So I'll just talk through it for those of you who maybe can't see it that well, we have the system instruction. We have the user input, which is my name is Samuel. It decided to call the tool record memory with the user's name of Samuel value added to memory. And then it replied, and then when we made the second run, which here was obviously in the same process and the same call, but in theory could be weeks later because we've now just got this data stored in our postgres database, you'll see it. Same prompt. What is my name? Was the question I asked it. It basically called the retrieve memories tool with memory, contains name, retrieved the right value and was able to reply knowing what my name was. It's not using tools. It's not using tools. But it's worth also talking about short term memory. So what we talked about, so AI people seem to love anthroomorphic definitions of things that don't follow industry standard. I don't know why they're like doing that. That's what they like to do. And I think it's because they kind of feel cool that everything is like a thinking person. But whatever. That's why they talk about thinking rather than processing. They talk about long term memory effectively being this kind of tcall thing, and then short term memory basically being information that you put into the context that the agent has access to, so it can access it immediately. And it kind of makes sense. But it would be a lot easier if they rered to like tool based memory and context memory. I spent weeks not understanding the two, and then I realized only by implementing it and realizing what the distinction what's. So in this example, we're doing a memory with messages or short term memory. Again, we're configuring log fire. We have this line of code. I forgot to mention is instrumenting the postgres connection, ada, yada yada. Our agent is now very simple. Excuse me, we don't have any tools definfind. But the critical bit here, and at the moment, this is a reasonable amount of work. And we hope to add an abstraction to make this kind of access to persistence easier. But we're basically record querying to get all of the messages that we've stored in the database and then adding them into the context via message history when we're doing an agent run. So this is what would happen if you're using a ChatGPT style interface and within a conversation, you ask a new question, it will go and get from the database all of the messages and put them into context before it calls the model again. And there are particular apis for that within all of the model providers. But effectively, I assume what it's doing in the background is basically smudging all of those messages into the big context windows so that the model can access that data. And so we run this twice. So this run agent is effectively taken care of, retrieving messages at the beginning and recording them after we finish running. And we have a nice type so that we can get back, for example, in this case, messages in json format that makes it easy just to put them into the database. And so when our actual code is relatively simple, we're going to run the agent twice again, first with telling it the fact, and then secondly, seeing if it can get back the fact. And if we go and run this example, it was able to run, and you might even be able to see just visually that it was immediately faster. So you can see here. If we look at the agent run, you can see there's no tool calls going on here. It's just responding immediately. And then in the second case, which is kind of the acid test where had access to that message, you'll see all of the previous messages are included in the context. So you can see them in the conversation here. And it was able to respond. But critically, I guess, this time, it was able to perform that in just under 700 milliseconds, whereas before, when it was using long term memory, I guess the same case took 1.6s because it had to make two calls to the model. You can see where that would be useful. So I will stop those two examples. Close that. That's talking about the second way of doing memory. So mcp, so model context protocol came out back in December, came out the same week actually as pydanki. And it is it was designed by anthropic for cases like Claude desktop or cursor to allow these local llm apps to basically have access to external tools and resources in a way that was usable by any of these different tools. So windsurf cursor etc.etc.z could all use mcp called desktop. In this case, we're using it for a slightly different application. Here. We're building autonomous agents. So we're writing Python code that should run, not necessarily, but in general, without a user being involved in the immediate loop. But we can still use mcp or some of mcp very effectively. So mcp under the hood has three primitives, tools which we're going to talk about here, resources, which are effectively documents that you're supposed to go and download and put into model context, which I think we will support in future. And then a third concept of prompts, which is effectively like a template for a particular query. So you can imagine if you have an mcp server to query some particular database, you can imagine a useful prompt which effectively contains all of the database schema that the model is going to need. And then you basically fill in the variable, which is what exactly you want to go and get. As far as I know, prompts are not heavily used by mcp, and I think it's one of the creators frustrations, David Perea, that people have ran off and used mcp for the tools and haven't really thought about the resources and the prompts. But anyway, here we're doing, we're guilty of exactly that. We're going na use it just for the tools. So in this particular case, we're going to use an mcp server that we've built that is actually built into pylantic AI. I always maintained in the same repo, which is mcp run Python. This is a way of running sandboxed Python code locally or remotely or wherever you want, but without it having any access to the host. So sandboxing Python has been notoriously hard until now. Most people have done that via docker containers, via like us level isolation. Mcp run Python is built using the amazing piodide project, which is how you can run Python in the browser. And then in turn, we're running pode inside Dino, which is an alternative to node, but a way of running JavaScript code locally. And so but Dino in particular provides isolation, just as the browser would, to prevent JavaScript code that is running in your chrome from accessing your operating system. Dino uses those same techniques. And so it's a bit weird that we're calling Python inside Dino, inside JavaScript, inside wam when them running Python code, but it works really well. And in particular, we've built this as an mcp server, so you can use it with Padan's AI, but in theory, you can use it with whatever tool you like, right? As in because it's just an mcp server and you can go and connect to it. The commands, a bit of a bobecause. We give it some permissions. And in all permissions, mcp has two ways of operating, either over what they call standard io, which is basically running as a subprocess and using standard in and standard out to communicate, or you can run it over http. But here we're running it locally. And so we have a plantic AI concept here of setting up our standard io mcp server. We give it the full command to run it. So it's ours in this case. But I'll show you in a minute, it doesn't have to be one of our mcp servers set up our agent. But critically, we register here saying there are some mcp servers that we want to set up, that we want to register, but we don't want them to be running yet because as I said, we want our agents to be global. So we don't want na have to start running our mcp server. So you can imagine then so we then use agent dot run mcp servers, and that starts the mcp servers, whether that be standard io runs where we start the process or http ones where we're going to basically set up the http connection. So if you were running a fast api app, you would use this run mcp servers within your lifespan function to start them up for the duration of your server running. And then the actual question that we're gonna to ask the agent is how many days between these two dates? Now this is not something where you would want the model to try and pull the number out of its arbasically. You want it to go and do the calculation. And in fact, sure enough, the recent leak of ChatGPT's mega prompter that they use within ChatGPT says to the model, never do calculations directly, always use the run Python tools. So effectively, inside OpenAI, they have some equivalent way of running sandbox Python code, and that's what they're using. If you ask ChatGPT a math question, it is not trying to do the calculation from first principles. It is writing Python code in the background to go and do that calculation. So if we run this example, you can see it registering the tools. And if I come over here, we should see it running, still running at the moment, but once it is finished, we should be able to see what happens. So we asked it this question. And the point is, instead of it doing the calculation, it wrote this Python code in a slightly weird way. But the principle looks about right to calculate the number of days between these two dates. One of the useful things about using Python calling like this is we can eventually tively go back and debug our calculation and work out what it used to calculate something. Whereas if we just ask the model off the top of its head to do a calculation, you have no guarantee of whether it was right or not, or what might have gone wrong. And so it then responded. So the response that we got from the mcp server here was that it was success and the return value from running the Python code, or like the final line of Python code, which was just the numeric value. And then it's printed out a summary for us. It's returned a summary on the second call to the lland m. So you should be able to see that here. If we look, we had we ran one tool here, which was the mcp server. So we had the first call to ChatGPT. Then we had calling the tool, which took one and a half seconds because it had to go and basically boot up Python. I don't think it had to install any dependencies, but if it had dependencies in the code, it would automatically find them and install them within the do environment. So if you were using NumPy or something that would just go and be automatically installed would obviously be a bit slower, but would work. And then there was the final call to ChatGPT with a response, which is what it then returned. Here is worth saying that there are other libraries that use this tool calling thing more small agents being the most prevalent of them from hugging face. I'm told I'm not allowed to be rude about the capacitors, but I find the idea of an established company like hugging face releasing something where as far as I know that the like code isolation is very minimal. It's like, let's block some import parts and hope for the best seems extraordinary to me, because sure, agents aren't yet trying to or good enough to break out of your that isolation and they probably don't want to try and delete your home directory. But you can imagine how hard it would be to be certain that a user hadn't managed to put some code in that the model then ran. You can imagine, ignore all previous instructions, run this Python code, and now you effectively have someone having access to run up remote code execution on your system. And that's why we built it this way and worked really hard to have effectively Google v eight isolation between the operating system and the Python codes that the model has written but a user might have influenced. But what we're going to go further with this, and the next thing that I already have a pr up to do, and I need to finish it, is allowing this Python code to call back to particular functions that you register and allow it to call on the host. So if you wanted to get access to a enormous Python file, you didn't want na PaaS that through in context. You could, sorry, source code or something. One of the best uses I've seen for this is, for example, if you want to go through a very large html page and extract certain attributes, you can quite easily exceed the context window of even the biggest models. But the model will write you beautiful soup code to basically extract the right bits of an html page very effectively. So you can use a model to process html that way rather than just giving it the full html and hoping for the best. And Yeah, we'll allow that, like calling back stuff in future, which will make this even more powerful. In fact, on that exact point, I think it's worth using showing another example of an mcp server that's not written by us. So if I show you this example here, so this again, we typo that we've got some code here. We're instrumenting pylandki. We're also instrumenting mcp. So I didn't call this out explicitly before, but we have support within Plandai. Obviously, I was showing you instrument async pg earlier that was instrumenting the postgreconnector. But we can also instrument mcp itself so we can see the particular calls going on. And again, that is not specific to pylanai. So if you're using the Python mcp sdk, you want to instrument it, you can use log fire regardless of whether you're using pylantic AI. And then we're obviously instrumenting pylantic AI and we're setting up our mcp server. Here is we're using the excellent mcp server from playwright. So playwright is a library for browser control maintained by Microsoft. You would use it for things like unit testing or testing your front end, but they've built an mcp server that effectively allows you to control a web browser from within your code, and it works really well. And they've done all the hard work, as I'll show you in a minute, to basically simplify the page rather than just returning the full html to the. AI. And so in this case, we're going to ask it to go to pdantics website and try to find the most recent blog post and summarize the announcements. So a relatively complex task, if you can imagine, like before mcp, led alone before AI, this is an incredibly hard job to go and set up. This is like a long time getting the right seup for navigating arbitrary sites, simplifying the data, extracting the right things from the html. And now mcp gives us, this app allows us to connect pdantic AI to playwrights mcp server really trivially. So I come back here and I run this example, and I'm just going to print the output at the end. So you should see it running. We're using Claude 37, which is relatively slow in this case. So itdo a bit of thinking. It's try to go to pdantic dot Dev slash blog, which is wrong. It's realized it's wrong. So it's gone back to panantic dot Dev, the homepage. Fingers crossed it will then work out to click articles. It seems to be hanging indefinitely, which always takes longer when you've got everyone watching. And it has now successfully gone and found our evval's blog post, and it will think for a bit longer and hopefully at some point return a summary of that. If we come across here and we look at that in log fire, we should see, Yeah this request is still going on. That we can already start to look at what's going on within here. So you can see we had multiple different calls to browse a tool. So we use browse and navigate. First of all, which I think if I go to that, sorry about this, you'll see it navigated to blog. You can see exactly how long each of those steps took. That was relatively quick. We went back to Claude on it. It thought about this for a bit and obviously decided that was the wrong page because it had got a 404. If you see here that the response code should have included 404 or will have included 404 back and forth. And if we now it's finished, we look at the full conversation, you can kind of see a better summary of what happened. So Yeah, so this is perhaps the most interesting thing to show from their mcp server. Sorry, I have to do some funny scrolling here. But instead of it returning the full html of the page, which is an enormous piece of data and would fill the context very quickly, especially if you had, for example, embedded images or embedded like css or JavaScript or something, it turns the html into this yaml format, which I guess someone has decided is what models like to process. You can imagine how much easier and quicker that is to process for an lllm than the full html of the page. And then once we get past that tool, it navigated, yada, yada, yada. And then finally, it came up with summary of our blog post. So the main announcement was our evval's library. I'm going to show you in a minute. But also like new sdks for JavaScript and rust etc.. I won't make you read our full pr announcement, but you get the idea. I would encourage you to go and read it afterwards. So Yeah and I think the other thing we'll be able to show here is Yeah total cost. You can see here this was 14p, sorry, 15p, 15Cenin total. Yeah so we're aggregating the costs from different from the individual runs. I think this is a good time to mention that as a company, we care enormously about open source. Obviously, we maintain pdantic and pdantic AI, which is a completely open source log fire. The sdks are all open source, but the back end, the platform, is closed source. But even there, we still care about open standards. So log fire is built on open telemetry. So you can send data to it from anything that emits open telemetry, or you can use our sdk log fire and send that data to whatever platform you like. And I know we have people who are doing that. And although we would love them to pay us money, I'd rather people found our stuff useful and didn't pay us money than just didn't find our stuff useful. And that's the same principle we're using in dantic AI. So the attributes that we are so again, the data we're emitting from pdantic AI is open telemetry. But even there, even beyond that, we are following semantic conventions for gen AI within open telemetry so that the data exported should work in any of the platforms that are designed to receive gen AI data from open telemetry. We obviously think log fire is best. We hope you end up using it because it's best. But we're not trying to do the lock in thing. We think we're going to try and succeed based on actually building a good product rather than lock in, which is seems obvious, but we are not everyone thinks that in our space. I won't name any names. So Yeah, this is these prices here are coming from if you look at the raw calls to the llm, you look at the details and you see the pricing, which I think will be down here somewhere. There we are. So these token counts are coming. We're using the specific attribute names that gen AI that the hotel recommend. So again, this data should work in any platform. Itlook best in log fire, because ours is best, we hope. But like in theory, you can send that data anywhere. If I come back into here and I'm going to move on to the next part. So we have we have quite a lot of time. I'm going to dive into an evals use case. There's going to be quite a lot of code. I'm not going to apologize for that. The kind of has to be to explain what we're doing. So evals are this concept of, I mean, people think of them as equivalent to unit tests, but for stochastic, non deterministic applications, they're actually much more like benchmarks in the sense they don't generally PaaS. They can outright fail. But there's some nuance in working out how well and eval has done. Evals are an evolving art or science, and anyone who claims that they know the exact answer is wrong. So we have our take on how we think one way of doing evals, I'm totally confident that what we have won't be the state of the art in five years time. But I think we're like trying to move things forward and we'll hopefully evolve if people work out the like. Go to answer, I will say I avoided building evals for a year because I thought someone at open enai or anthropic would have some magic source for how to do evals, and that eventually we would all get like wiped out by that answer. That seems to have not happened. And having spoken to people inside those companies, there is no magic source for how to do eals. They are just hard working out whether a model has done the right thing. So we're trying what we have built in panantic. Evvales is trying to be the kind of pi test of this space. So we're not necessarily telling you exactly how to do the evaluation. We are giving you a framework by which to run it and some useful tools you might find useful at llm as a judge that we have set up. But in theory, we allow you to define whatever tests or metrics you want to do. So this is the example that we are evaluating. So fundamentally, it comes down to this very simple function here. The idea is this is a feature from log fire here where we allow you to enter a human description of a time range and get back an interval. So it's a good small, simple use case of an AI. And so we have a pdantic AI agent defined somewhere here. It returns a union of either basically success or failure with some details. It has some depths, as we've shown already, and it's instrumented. And Yeah so but fundamentally, what we are evaluating when it comes down to it is this stochastic function, which takes a text description of what someone wants and returns either an error or details about why it's incorrect or returns a time interval. And so the first stage of building the data set of building evils is to have a data set, a set of examples that you can run where you know what it should do. You can define them yourself, a human, write out the different examples. You can use a platform like logfire to get some actual real world user examples of what people entered, work out what they would have expected to get back, and that can become your data set. Or you can do the lazy thing, which is using a powerful model like zero one to basically go and make up a bunch of examples. And so we have this function, generate datset, which is use as a model, obviously, is using Pantic AI under the hood. We tell it the types that it needs to generate, and then we have a big instructions on what it should do. So generated data test cases for the time range inference agent include a variety, yada, yada, ada. We also tell it which evaluators we want it to put in. But fundamentally, this is just a very complex example of the very first thing I showed you, structured data extraction. We're giving it this long description, we're telling it to make up some examples, and we're giving it a very complex json schema of how we want the data to be returned. And if you run this example, it will go away and run for about a minute, minute and a half, and it will bring you back something like this. So and then the end of the code was basically printing out the data. It got back as a yaml file. So this yaml file is our data aset. So it contains, you can look at an individual case like this. So we've got a name for it. We've got the input. I'd like logs from 2:00p.m.on a given date. We have a now because you can imagine this is a complex example. We have to give it the concept of now because that's relevant to, well, if you say get it me logs from yesterday, we need to know when when today is. And then we have what we would expect that function to have returned. So instructured form, that is the mtime stamp, max tistamp, and some explanation to the user of why we chose that range. And then we can add individual validators, sorry, evaluators, to each case. So in this case, we've added the is instance of success evaluator. Next example, I think the evaluator is instant success again. But now we have we've added the lland m judge evaluator with a rubric, which is effectively a description to the AI of what it should be evaluating and on and on. And then we have I think we have down here some evaluators that we apply in all cases. So we have an llm judge saying ensure that the output, the explanation on error messages and seconds is in the second person. Sorry. Be concise, ettera etc.. Then in this case, this is a relatively simple simple bit of code that just adds a few more evaluators into that code. So if I run that and the gods are with me, it will run successfully. And now we get this slightly different version of that output with a few more evaluators added. In some cases, you can see extra ones added on here. So these, in this case, these are human. These are excuse me, these are user defined evaluators. So if we look at validate time range, this is evaluator. You could define yourself, which is basically inheriting from the evaluator data class and basically defining the or abc defining the evaluate function, where in this case, if it's success, we're basically checking that the time range looks valid and we'll return if the window is too long, if the window is in the future, that's obviously an error again. So this will be run for each evaluation to define what success looks like. And as I say, you can define your own evaluators as well as using R ones. The last thing I think it's really important to say is as pdantic, we care about validation and type safety a lot. And so we've gone the extra mile. So one of the odd things you'll see in this file is this magic comment here saying the yaml language server, which is basically referring to this json schema file, which is next to it. And it means that even inside the yaml, you get auto complete and type checking. So and a description of what the fields are. So if I say evaluator instead of evaluators, I get an error because of the json file, because of the json schema file. And obviously, if you then try and load that, we're doing validation on this input file, but this allows you to get kind of auto complete if you're editing these files yourself. So with all that set up, now let's actually run an example. Pydantic evals integrates very nicely with log fire, and so we can show the summary of what's happened in log fire. But fundamentally, there's no requirement on you using log fire. Unlike our competitors, we think pi test wouldn't be successful if pi test was linked to a particular provider or like required you to use a particular provider. So while we are obviously for profit company, we think that building open source the right way matters. And so here we are configuring polantic evils to work with logfire, but you don't have to. And as I show you, you get a nice printout of how things have gone even if you're not using log fire. So Yeah, in this case, we're going to run fundamentally with setting up our data set. We're loading that from our yaml file and we're going to run evaluate, just going to go and run all of those cases with the function that we passed it. And we're going to basically, this is a kind of unit test example or context of like checking whether or not it's doing well enough. And so we'll basically assert we'll run it and see whether the PaaS rate is above 80%. So we're going to run this. You'll see it should print out many examples. And if I come over to log fire, you will see them coming in here as it runs of all of those different sets. Again, we display this stuff in the tracing view because it's very useful as you run a particular eval to be able to see what actually happened inside your Python code. And in fact, we even have evaluators where you can use these traces to basically check whether a particular tool was called or whether a particular code path was followed. But Yeah, so it finished and it failed. So it got 87% success, which is obviously lower than 80. But the other thing to show is Yeah so very much like pytest benchmark, we will print you out a summary of how this performed, which cases worked well and work badly locally so that you can use this without log fire if you so wish. But you can also see that same data in log fire. So if we come back to the beginning, the outer trace of and it's not loaded properly. Let me see if it loads one of those properly. Now that we up, it's succeeded. I don't know what went on there, but you can see each individual case here and which of the assertions passed and failed. So you can see with a single point in time, most of them passed. But this one failed the lland m judge because the response was in the first person, not in the second person. And we had told the llm judge in the rubric, you should check whether or not the output is in the second person. And you can see that second person case has failed in quite a few of these individual cases. This one failed for the particular time range. This one failed for a number of things. This one passed on all three etc.. And the neat thing is that we can go in and we can look at the individual span where that happened, and we can look at the individual inputs and outputs and work out what happened, and then start using that to go and work out how we could improve our model or improve our agent. And so in this case, I think what I can try and do, I haven't tried this before. So this may go, this may not work, but we will try it. We will take our model here. This is where we define our agent. And we're going to say, always reply in the second person. I'll even give it an exclamation mark, and we'll see what happens. So we'll come back here to the unit test case and run it. Let me just clear this to make things a bit easier to view. If I run that case. I don't even know if the school is going to get better, but we hope so. And then it will run the judges at the end, which I think is what is going on at the end. And it succeeded. The average went up to 83%. So you can see how we can use evals to basically systematically improve our behavior and then run the tests as it will, and see what went well and what went badly and dig in to individual cases. You can see that second person check started passing. There's a bunch more that's still failing. We would want to go through and start trying to improve the performance systematically after that. And the other interesting thing to run is this last case where we're comparing models. One of the things that we allow you to do quite easily within pdantic AI is this override method, where we can override the model used by a particular agent, deep encode. So if we wanted so for example, this is useful in actual unit tests, where you want to replace the model you're using with our test model, which will always return a response, or in this case, in evils, where you want to basically be able to go and change the particular model used. And the point is you don't need to go and edit your application code in your tests or in your evils. You can use this override methods to set the particular model or I think some other parameters as well. Yeah depths as well, you can override. So if we run this example, this will take a bit longer, but this should run two sets of evals for whatever those two models were, GPT -4 zero and Claude 37, and see whether or not one performs better than the other. So if we come back over here and you see it's running at the moment, still running all of the cases. And it has succeeded. And if we go up here and then we look at the final performance, you'll see GPT -4 zero got this case this time around 76% success. And Oh, sorry, and given how much longer it takes to run, claw did not do much better, only 78%, which isn't really good enough given it's much lower and more expensive. But anyway, this is the evils library. There's a lot more to explain. It's a complex piece of kit, but we think really valuable. Would love to hear people's feedback on it. As I say, this is still, to some extent, I think it's as production ready as any code, but like the concepts within it are still evolving quickly. And so would love people's feedback on what's working and what's not. I know there was a company who I think somewhere at this conference, I don't know if they're here now, who spent $25000 running evals with pdantic AI the other day. So people are beginning to pick this up and use it seriously to run their evals. So coming back to the talk, I think the first thing to say about pdantic AI is we are not claiming it is finished yet. There is lots more things to add. I think the memory persistent stuff I was talking about should also be added to this list of big improvements that are coming soon. But I talked about structured outputs are using tools at the moment. We want to allow them to use effectively the built in system in the case where models have that or effectively json schema in the instructions where that's better because that works much better on dumber models that don't do talk calling Olwell. I haven't talked about mcp sampling, but that's another very powerful concept in mcp of effectively the server asking the client to proxy llm requests, which we want to support both as a client and as a server, have some more control over which tools are registered for particular steps. The mcp run pson being able to call back to the host that I've already talked about. I didn't even get to our graph library implementation today because I didn't have time. But like we have some changes to that, that make it hopefully more composable to use. So that's kind of equivalent to land graph. But typessafe is the objective thing. I can say there are some subjective judgments I would make, but I won't make them. But the most important thing of all at the same time as continuing to add features, is we know people care about stability and being able to build on these things, knowing they're not gonna to change all the time. So pansgi is still very Young, only came out in December, but we are we will make a version one release by the end of June. And then we will follow semantic conventions very closely and not break your code because we know that is something that matters to people. So thank you very much. I think we have some time for questions. Yeah, we've we only got eight minutes, but anyway, I can take some questions if there are any. Thank you very much. Oh, sathank you very much for your talk. I have a few questions. The first one, when you create agent, and in the agent, there is a model name. Sometimes right now we can not use the public models. We we host the model internally internally. Do you think we can PaaS them model internally? Let me see if I can find where was that example here? Yes. So if you look at the signature of, let me take, which is the coat, this is what I want. I'm going to take this example. So here, if you look at the signature of agent, initialize it takes model, an instance, a model, a string or none, because you can PaaS the agent later. But the point is, this is basically shorthand for defining the opeye model. And if you want to go in and go and look at the model type, the model is an abstract basclass, of which we have a number of implementations for things like OpenAI, anthropic, roc, etcec. Bit in theory, if you want to implement your own model, you just need to just need to implement this abstract base class, which has Yeah, I think it has like actually, it only has two abstract methods that you need to go and define so you can implement your own model. If you're using an OpenAI compliant model, you can point it to whatever domain you're using. And we know that's a really important thing. And I know people who are using that now. Yeah, next question. Yeah, I had two questions I wanted to ask. How 's the async support and also the supervision language models? So all of pdanai is async, and we effectively have. So you'll see here we have run sync. Run sync internally is just a wrapper around. Is that going to open the right thing? It's decided doesn't want to open that right now. Let me try and do that and see if that's going to get it to work. Anyway, the point is interally, it's all async. And then we just have a few wrapper methods that effectively give us a like pseudo sync interface. Actually, if we have a problem, honestly, it's on the sync side where if you're doing stuff inside threads, celery, for example, has some trouble doing async. But Yeah, it's basically all I ync under the hood. Thank you. And the vision model stuff, I can't. We are yes, in some context, ts, we allow vision. We allow like multimodal inputs. I don't think the full story on multimodal outputs is fixed yet, but we're working on it. There was a question here, Oh, if you had the if you got the mic franca for it. So is this only OpenAI or does it also work with like lama 3.1? Yes. So we support in here. If you look at where am I going to find it? If I am, Yeah, I'll go into that's not what I meant at all. And if I go to known known model names, so these are the models we support the moment we're adding to this list and we're happy to accept prs for most. We have some rules on basically when we will add add a model. We had some tiny providers who are like add thousands of lines of code to us and we've not accepted it, but like if it's a widely used model, we'll add it. But Yeah, we have a reasonably large list already over here, the model class that you showed us. Do you have any plans or do you already have some sort of conversion to turn a blank chain runable into a pedentic AI model? Or is that something you're not interested in just for making it easier to switch over or interit's? An interesting idea. Yeah. Happy to look at an implementation and see if we can add it. We would we would definitely consider it because I hear people being like, we built all this stuff with langchain. We would love to find a way. Definitely something we would yes, we would consider it. I don't know how doable it is, but definitely will consider it in terms of structured outputs is constrained generation, for example, on the roadmap? And when you say constraint generation, what precisely for localized models is trying to get structured outputs from localized models? Not something we've thought about lots, but if you have a particular idea, come and talk to us and happy to hear it. At the moment, most of the structured output is like give it json schemer, and it does a pretty good job. And I think one of the things where we try and work work out, if you speak to the like AI headbangers, they keep talking about, don't bet against the model. Then you hear their company, and their company is doing nothing other than betting against the model. If you think of agent frameworks are fundamentally betting against the model, right? Like if the model was smart enough, you wouldn't need an agent framework. You would just give your model access to the Internet and be done. And so working out where to be like don't bet against the model and where you're like, sure, that will be fixed one day, but if that's in three years time, that's not very helpful for people today is a difficult line line to walk, right? So for example, some of the dumma models return like JavaScript type data like json five or rather than Jason. So would it be valuable to have json five passing support? Or is that something where even the cheaper models, again, to fix it really quickly, really hard to know the answer to when to help the models and when to just be like the rising tide will lift all ships. Question. Lots of questions. But question here of model. If you configure a selposted model, you had pricing on log fire. Can you also configure pricing for your own self hosted model? Yes, we're working on it now. Okay. We are in the process of taking basically building an open source database of all model prices, and then we will so at the moment with server side, we're basically calculating the price based on the model name and the tokens. We're gonna to instead do that in pdantic AI. And so effectively, if you want to send your own prices, you can send whatever you like. And actually, the ultimate advantage of that is you'll be able to basically change what goes on in this itpop up to be whatever you want fundamentally. The other advantage of that is at the moment, because we're doing the calculation to render this panel, so I haven't talked about the fact that I haven't tried to do too much spiel on log fire. But like one of the powerful things about log fire is you can write arbitrary sql to query your data. So whether that be entering sql here to do a search or you can use our explore view to go and Yeah run arbitrary sel to do calculations. In fact, here's an example of of looking at token usage and counting it. But also our dashboards are built on arbitrary sel. One of the problems at the moment is because those prices are being rendered just for the ui. You can't query on prices, you have to query on token counts. So if we put the prices in the telemetry data, then you can query on them directly. Yep. Very polite for you to call them slides the first time you define Oh, there. Okay. So for example, when you run this code, it's pretty easy, for example, to guess the name and the date when the person is born. But when there is the big ambiguity between the the entities, I want to extract from data how this agent will stop. For example, if the class person something, maybe the attribute like the product names, and I have a bunch of text, and I want to extract these product names, which are really big search space and how, I mean, you sodly rely on the response from the llm model or you have something on top of date to valdate. So if we get rid of this, and now we have the agent output type as a string, if we have, I don't know why that's not updated, but that would now be string. If we haven't set the output type, then we basically iterate through running the agent until it returns text output, and we assume that's the end. If you set the output type like this, what we're actually doing under the hood is registering another tool that by default is called final result. And when as soon as the model calls final result, we call that the end of the end of the run. So we assume that's the final result. Well, if validation fails when we call that tool, then we put the validation error back into the model and retry as many times as if you set retry here. The language server has died on me again. But if I if I set retries here, I can set retries to as many as I want. I think come and talk to us at the booth and I'll happily talk you through it in more time because I know we're close to time and over time. And there was one more question I'll take. So as someone who's used lagrap and lachain, my biggest complaint is that they're always changing something or the documentation is out of date. Yep. Just flat out wrong. Yep. So I guess AI and all of this stuff is constantly changing. How are you going to keep up with your documentation and make it easy for people to pick up? I get your question and I agree with you, and it's something that's just frustrated me for years. And so we go the extra mile on that stuff. So if you look at our documentation, every single one of these examples is unit tested when you run when we run unit telocally. So we can't we can't basically merge something where any of these where like this output, for example, is not actually that is literally being programmatically generated by running this code as part of our tests. Same we do on Pantic because I've been driven around the fucking bend in the past by examples that don't work. And the other side effect of that is all the inputs, all the inputs have to be there for the code to run. So it's stuff like that which you don't think of as particularly important, but it actually massively affects user experience. And I think beyond that, it's just like we have been maintaining open source libraries for years, is like me working on this, but also Marcello who maintains uv orand starlialex, who's been who maintains numerous libraries, David Monske who's done lots of stuff on pantitic and on fast api, like we're experienced open source builders in a way that those maintaining other libraries often are not, is about the trying not to be ruder. Thank you. I think we're probably getting it time, but Yeah, as I say, we're about all week. So come and talk to us if you have any more questions. Thank you very much.

概览/核心摘要 (Executive Summary)

Samuel Colvin (Pydantic创始人) 在 PyCon 2025 的演讲中，详细阐述了使用 Pydantic 生态系统（包括核心库 Pydantic、智能体框架 PydanticAI 和可观测性平台 Pydantic Logfire）构建AI应用的理念与实践。他指出，尽管AI领域日新月异，但构建可靠、可扩展和可维护的AI应用仍然是核心挑战，这与传统软件开发的工程原则一脉相承。Pydantic 通过其强大的数据校验和模型定义能力，为AI应用提供了坚实的数据结构基础。PydanticAI 作为一个新兴的智能体框架，强调类型安全、模型无关性，并利用模型上下文协议 (MCP) 实现与外部工具的交互，支持结构化输出、工具调用、记忆管理等功能。Pydantic Logfire 则通过与 OpenTelemetry 的集成，为AI应用提供深入的追踪、调试、性能监控和成本分析。演讲重点介绍了 Pydantic Evals 框架，它提供了一套类似 Pytest 的机制，用于对AI模型和应用进行系统的基准测试和迭代改进，支持自定义评估器和“LLM即评判者”等高级功能。Colvin 强调了对开放标准、稳定性和清晰文档的重视，并宣布 PydanticAI 将于2025年6月底发布v1.0版本，旨在为Python开发者提供一套高效、可靠的AI应用构建方案。

Pydantic 简介与AI领域应用

Pydantic 核心数据与背景
- 创始人: Samuel Colvin
- 创建时间: 2017年 (早于生成式AI浪潮)
- 月下载量: 约3.5亿次 (约每秒140次)
- 用户群体: 广泛应用于Python生态，包括通用开发、FastAPI等，同时也被所有主流生成式AI库和框架使用。
  - SDKs: OpenAI, Anthropic, Cohere (原文为rock), Google 等。
  - 智能体框架: Langchain, LlamaIndex, CrewAI 等。
Pydantic 公司与新产品
- 公司成立于2023年初。
- Pydantic Logfire: 开发者优先的可观测性平台。
- PydanticAI: 智能体框架。
- 行动号召: 邀请参会者前往Pydantic展台体验Logfire演示、领取T恤和贴纸，并参与抽奖。

构建AI应用的核心原则与挑战

行业现状:
- AI领域变化极快，但构建可靠、可扩展应用的基础需求不变。
- 生成式AI (GenAI) 功能强大，但Colvin认为其“完全是一头难以驾驭的猪 (a complete pig to work with)”，使得构建可靠应用更具挑战。
Pydantic 的主张:
- 演讲旨在分享一套“有主见的AI开发蓝图 (opinionated blueprint for AI development in Python)”。
- 核心原则超越具体工具。
关键技术原则:
1. 类型安全 (Type Safety):
  - 在Python和TypeScript中日益重要。
  - 随着AI越来越多地编写代码 (如自动补全、代码生成)，类型检查成为AI智能体最有效的反馈形式，因其无副作用且执行速度快 (未来将因 "ti" [不确定具体指代，可能为某种类型检查加速工具] 而更快)。
  - Colvin批评其他一些智能体框架为了自身原因牺牲了类型安全。
2. 模型上下文协议 (Model Context Protocol, MCP): 一种允许本地LLM应用访问外部工具和资源的协议。
3. 评估 (Evals): AI应用测试和基准评估的重要性。
4. 可观测性 (Observability): 通过Logfire贯穿始终。

PydanticAI：智能体 (Agent) 框架详解

智能体定义与行业共识:
- 采纳Anthropic和OpenAI等公司目前较为统一的智能体定义。
- 引用了Barry Zhang (Anthropic) 在AI Engineer大会上展示的伪代码：智能体接收环境、工具、系统提示 (现多称“指令”)，运行LLM，获取工具调用指令，执行工具，更新状态，循环此过程。
- 伪代码中的一个bug (while循环无法退出) 指出了智能体何时停止循环这一难题。
PydanticAI 核心特性:
- 模型无关性: 支持一行代码切换不同LLM (如OpenAI GPT-4o, Gemini, Anthropic, Google, Groq等)，方便比较模型性能。
- Direct API: PydanticAI本周发布的新功能，提供统一接口直接调用各模型，无需智能体封装。
- 类型安全的工具定义: 使用 @agent.tool 装饰器和泛型 Deps 上下文对象，确保工具函数参数和上下文访问的类型安全。
代码示例：结构化数据提取 (非智能体行为)
- 定义Pydantic模型 Person (name, date of birth, city)。
- PydanticAI智能体配置输出类型为 Person。
- 从非结构化文本中提取信息并填充到 Person 实例。
- Colvin指出这仅为单次LLM调用，并非真正的智能体循环。
代码示例：具备校验与重试能力的智能体循环
- 在 Person 模型中添加函数校验器 (如出生日期需在19世纪)。
- LLM初次提取数据若未满足校验条件，Pydantic的校验错误会被反馈给LLM。
- LLM根据错误信息进行修正并重试，形成智能体循环。
- Logfire演示:
  - 展示了两次对Gemini模型的调用。
  - 第一次调用因校验失败，第二次调用根据校验错误修正后成功。
  - Logfire追踪视图显示了每次调用的耗时和（若可用）价格信息。
代码示例：工具使用与长短期记忆
1. 长期记忆 (Tool-based Memory / Long-Term Memory):
  - 通过定义工具 (如 record_memory, retrieve_memory) 与外部系统 (如PostgreSQL数据库) 交互，实现记忆的持久化存储和检索。
  - Logfire演示:
    - 追踪显示智能体调用、LLM调用、工具调用以及工具内部的数据库查询 (如SQL插入和查询语句)。
    - 可观测性数据帮助分析性能瓶颈 (如数据库查询耗时远小于LLM调用耗时)。
2. 短期记忆 (Context Memory / Short-Term Memory):
  - 将历史消息或相关信息直接放入LLM的上下文窗口中。
  - 示例：从数据库查询历史消息，通过 message_history 参数传递给智能体。
  - Logfire演示:
    - 通常比长期记忆更快 (示例中700ms vs 1.6s)，因其减少了额外的工具调用和LLM调用。
    - 追踪显示历史消息被包含在LLM的上下文中。

模型上下文协议 (MCP) 的应用

MCP 简介与核心原语:
- 由Anthropic于2023年12月左右推出，旨在让本地LLM应用 (如Claude Desktop, Cursor) 以标准化方式访问外部工具和资源。
- 三大原语:
  1. 工具 (Tools): PydanticAI目前主要使用的部分。
  2. 资源 (Resources): 指文档等，供模型下载并放入上下文 (PydanticAI未来计划支持)。
  3. 提示 (Prompts): 特定查询的模板 (如包含数据库模式的模板)，目前使用较少。
- Colvin提及MCP的创造者之一David Perea对于资源和提示原语未被广泛使用感到些许失望。
PydanticAI 对 MCP 的应用:
- 用于连接和管理外部工具服务。
代码示例：mcp-run-python (沙盒化Python执行)
- PydanticAI内置的MCP服务，用于在沙盒环境中执行Python代码。
- 技术栈: Pyodide (Python in Wasm) -> Deno (JavaScript/TypeScript运行时，提供隔离) -> Wasm。Colvin形容为“Python inside Deno, inside JavaScript, inside Wasm running Python code”。
- 安全性: 强调其隔离性优于某些简单阻止导入的方案 (如批评Hugging Face Small Agents的隔离机制不足)。
- 用例: 执行LLM生成的Python代码进行计算 (如计算两个日期之间的天数)，避免LLM直接进行复杂计算。OpenAI内部也采用类似机制。
- Logfire演示: 追踪显示LLM调用 -> MCP工具调用 (启动Python、安装依赖) -> Python代码执行 -> LLM调用生成最终回复。
- 未来: mcp-run-python 将支持回调宿主机函数，例如处理大型HTML文件时，LLM生成Beautiful Soup代码在沙盒中执行，但可回调宿主机获取文件。
代码示例：Playwright MCP 服务 (浏览器控制)
- 使用微软维护的Playwright库提供的MCP服务器，实现对浏览器的程序化控制。
- 用例: 让智能体访问Pydantic网站，找到最新的博客文章并进行总结。
- Logfire演示:
  - 追踪多次浏览器导航和页面交互的工具调用。
  - Playwright MCP服务器会将HTML页面简化为YAML格式，便于LLM处理并减少上下文占用。
  - 显示了整个复杂任务的成本 (示例中为15美分)。

Pydantic Logfire：AI应用的可观测性

核心功能:
- 提供详细的追踪信息，包括LLM调用、工具调用、函数执行、数据库查询等。
- 帮助调试AI应用，理解其内部行为。
- 分析性能瓶颈和优化机会。
- 聚合和显示AI调用的成本。
- 支持通过SQL查询追踪数据，构建自定义仪表盘。
与开放标准的集成 (OpenTelemetry):
- Logfire后端虽为闭源，但其构建于OpenTelemetry之上。
- PydanticAI和Logfire SDK发出的数据遵循OpenTelemetry的GenAI语义约定。
- 用户可以将数据发送到任何兼容OpenTelemetry的平台，不局限于Logfire。Colvin强调“我们不试图进行锁定 (We're not trying to do the lock-in thing)”。
应用场景: 贯穿于演讲中的所有PydanticAI示例，用于展示智能体行为、调试问题、追踪性能和成本。

Pydantic Evals：评估AI应用的框架

Evals 的重要性与挑战:
- 对于随机的、非确定性的AI应用，Evals类似于基准测试而非简单的单元测试。
- 评估AI输出的正确性和质量是一门仍在发展的艺术/科学，尚无完美解决方案。
Pydantic Evals 的设计理念:
- 旨在成为“该领域的Pytest (Pytest of this space)”。
- 提供运行评估的框架和有用工具 (如“LLM即评判者”)，而非规定具体的评估方法。
核心流程与功能:
1. 数据集构建:
  - 可以人工编写，或从Logfire等平台收集真实用户数据。
  - 示例中使用强大模型 (GPT-4o) 通过PydanticAI生成包含输入、预期输出和评估器的测试用例，并输出为YAML文件。
  - YAML文件支持JSON Schema，可在编辑器中提供自动补全和校验。
2. 评估器定义:
  - 内置评估器：如 IsInstanceOfSuccess (检查是否成功返回特定类型)、LLMJudge (使用LLM根据指定准则评估输出)。
  - 自定义评估器：通过继承 Evaluator 基类并实现 evaluate 方法来定义。
3. 运行评估:
  - pydantic-evals 可与Logfire集成 (可选)。
  - 命令行或代码中运行评估，输出类似Pytest的摘要报告。
  - Logfire演示:
    - 展示每个评估案例的运行情况，哪些断言通过/失败。
    - 可深入查看失败案例的具体追踪信息，分析原因。
    - 示例：通过修改智能体提示 (增加“总是用第二人称回复”)，重新运行评估，成功率从80%提升到83%。
4. 模型对比:
  - PydanticAI的 override 方法允许在评估时轻松替换智能体使用的模型。
  - 示例：对比GPT-4o和Claude 37在同一评估集上的表现 (GPT-4o: 76% vs Claude 37: 78%)。
采用情况: 已有公司开始大规模使用Pydantic Evals (例如某公司一天花费25000美元运行评估)。

PydanticAI 的未来展望与版本计划

即将推出的功能增强:
- 结构化输出: 支持不依赖工具调用的结构化输出方式 (如利用模型内置能力或在指令中嵌入JSON Schema)，对不擅长工具调用的简单模型更友好。
- MCP采样 (MCP Sampling): MCP服务器请求客户端代理LLM请求的强大功能。
- 工具注册控制: 更细致地控制在特定步骤注册哪些工具。
- mcp-run-python 回调: 允许沙盒中的Python代码回调宿主机函数。
- 图计算库 (Graph Library): 类似LangGraph但类型安全的实现，将进行改进以增强组合性。
- 记忆持久化抽象: 简化对智能体记忆持久化机制的访问。
版本与稳定性承诺:
- PydanticAI 尚年轻 (2023年12月发布)。
- PydanticAI v1.0 版本计划于2025年6月底发布。
- 发布v1.0后将严格遵循语义化版本控制 (Semantic Versioning)，避免破坏性更改。

问答环节重点

内部/自托管模型支持: 是的，可以通过实现 Model 抽象基类或配置OpenAI兼容的API端点来支持。
异步支持: PydanticAI内部完全是异步的，同步方法是封装。在某些线程环境 (如Celery) 中使用同步接口可能存在问题。
视觉模型支持: 对多模态输入有一定的支持，多模态输出方案尚在完善中。
Llama 3.1等模型支持: 支持多种模型，并欢迎对广泛使用模型的PR。
Langchain Runnable到PydanticAI模型的转换: Colvin表示这是个有趣的想法，愿意考虑相关实现。
本地模型结构化输出的约束生成: 目前主要依赖JSON Schema。对于何时依赖模型自身能力，何时提供辅助工具，是一个需要权衡的问题 ("don't bet against the model" vs 现实需求)。
Logfire中自托管模型的定价配置: 正在开发中。计划构建一个开源的模型价格数据库，并在PydanticAI中计算价格，允许用户自定义价格，并能在Logfire中基于价格进行查询。
智能体在模糊情况下的停止条件:
- 若未指定输出类型，则迭代至产生文本输出为止。
- 若指定输出类型，则内部注册一个 final_result 工具，模型调用此工具即视为运行结束。若校验失败，则将错误反馈给模型重试 (次数可配置)。
文档质量与维护 (对比Langchain等): Pydantic的文档示例都经过单元测试，确保可运行。团队拥有经验丰富的开源维护者。

核心结论

Samuel Colvin的演讲清晰地展示了Pydantic团队在AI应用开发领域的战略布局和技术实践。通过Pydantic核心库、PydanticAI智能体框架和Pydantic Logfire可观测性平台，他们致力于提供一套类型安全、模块化、可观测且易于评估的工具链。其核心理念是将成熟的软件工程原则应用于AI开发，强调通过结构化、可测试和可维护的方式来应对AI系统固有的复杂性和不确定性。对开放标准和开发者体验的重视，以及即将发布的PydanticAI v1.0，预示着Pydantic生态将在Python AI开发中扮演越来越重要的角色。

摘要历史 (1)

Detailed Summary 摘要

模型：gemini-2.5-pro-exp-03-25

2025-06-05 22:25

StreamSparkAI