DjangoCon Europe 25 | KEYNOTE: Django for Data Science: Deploying Machine Learning Models with Django

Django与数据科学结合部署机器学习模型

视频 科技
数据科学 机器学习 模型部署 Django Web开发 Python 全栈数据科学 Scikit-learn Jupyter Notebook Web应用 +6

媒体详情

上传日期
2025-06-21 18:42
来源
https://www.youtube.com/watch?v=XJLvovUVlhw
处理状态
已完成
转录状态
已完成
LLM 提供商/模型
openai/gemini-2.5-pro

转录

下载为TXT
speaker 1: Good morning. Thank you. Thank you all for coming. I want to talk today about Jango and data science. So I started a new job this year as a developer advocate at jet brains. You can see here with the swag working on the pie charm ide, and this means I get to focus on web tooling, which is fun, but also around data science, which I'm not gonna na say is less fun, but it's certainly less familiar to me. And like many of us, I've been head down in the web for years now, and there's plenty to keep us occupied, as we've seen with the talks today, lots of things going on with Jango, double digit prs every day, new advancements. But it's clear that in the wider Python world, data science has taken over and is where the momentum is. So it's been very eye opening for me and really good to be surrounded by colleagues who focus more on data science and to not just have an audience like this where we all agree with each other that Jango is know what we should focus on. So jet brain started with Java ID intelj, and that's still our biggest product. Pie charm is one of the top ones, but it's definitely second fiddle. And so I come to this talk again, having spent time without a web focus and without Python is even the main focus. So hopefully that's a bit of an outside perspective. So the tldr version of this keynote is that it's surprisingly easy and quite fun to train our own machine learning model. I'll walk you through that today in a Jupiter notebook. And then while deploying a production level machine learning model like ChatGPT takes a lot of resources and engineers, we can do it in Jango very easily. And I'll show you how to do that. And I have a GitHub repo, so you don't have to take notes or anything, but I'll do it all in one talk just to prove I'm not doing the typical hand waving thing. So I will do a bunch of talking, but there's also real code involved as well. And then I want to talk about what data science even means these days, because it can often seem like everything but web, you know, is it statistics? Is it AI? Is it machine learning? Is it data analysis? Is it scientific computing? We'll get into that. At the end of the day, it's all Jango, excuse me, all Python under the hood. So it's not that foreign to us. So very briefly, this is me in the slide of three books in their fifth edition, if you go and lib gen en, the pirated book database used to train all the llms. There's double digit versions of them. So I take that as some validation that you know there's more than more than three, more like 15 out there. The last six years, I've co hosted Jango chat podcast alongside Carleton Gibson, as well as co written Jango news newsletter with Jeff Triplett, who's on the board of Jango. I spent much of last year building out learnjango dot com Commonline site for all my resources. And I want to do a future talk on building a payment ments site in Django from scratch. And then since January, I've been a developer advocate at pie charm, focusing on the iwe. Just launched a ton of AI features last week. So I'm happy to talk to any of you outside this, talk about that. But this morning the focus is on Jango and data science. So let's try to define these terms a little bit. All right, I want to do a quick show of hands. Who considers themselves here a web developer, if you could raise your hand. Okay, almost everyone. All right. How about data science? Data scientist. Okay. Three. How about so that's pretty standard for the three of you. Do you consider yourselves equally data scientists and web developers? Okay. I want to talk to you after the talk. I think that's pretty rare. Most of the time. We have web developers, we have data scientists. You throw information back and forth over a wall, but there's very little actual overlap. You know, data science can seem scary, and we'll talk about it because there's lots of data, lots of maths, but they're terrified of the web in Jango. I mean, I gave a version of this talk in Boston before to mainly data scientists. And I think I sometimes and we sometimes forget how intimidating and how much knowledge there is actually in websites and using Jungo. So we'll talk about that. Jebrains has run this annual Python survey for many years now. And if you see the top two results, you may not be able to read that the top one, 44%, says data analysis for what do Python people do, followed by web development at 42%. And these trends have continued over five or six years. So it's pretty clear, even back in 2:17 when they started this survey, the this is what Python people are doing. They're doing data science or they're doing web development. But again, what even is data science? This is an old Twitter account I followed back in the day that because it's easy to feel like data science is everything but the web, but in some sense, it is kind of just statistics on a mac. And it is, I don't know, nobody seems to use windows these days. Again, just one more. But what do you actually do? Again, 80% of time, prepdata, 20% of time, complain about preparing data like this is kind of the real world reality. If you study in University, you have these beautiful algorithms, these nice clean data sets, then you go in the real world and you spend all your time cleaning the data and fit. It doesn't match. And so I saw, even back ten years ago, people with PhDs from pica school go in the real world. And it's fairly frustrating that they wanted to use all their academic mind, and they're just cleaning data the whole time. But the basic point is we have lots of math. We have big data, which requires cleaning. And in a hand wavy sense, that's kind of what data science is. But the interesting thing is, again, the amount of data is just hard to conceptualize. So we're just going to use llms as an example here, which is just one form of data science and AI. But this shows how much data you have to be trained, how much public text is available in the world. So GPT -3, which came out a couple years ago, used ten to the eleventh number of tokens. A token is just simple explanation, is characters in a word. So dog would be dog three tokens. So we're now at ten to the fifteenth tokens for all human generated public text ever. Let me say that again, that's 1 quadrillion tokens, which we can say is roughly equivalent to a character and a word. You know, Google estimated in 2010, 15 years ago, there were 129 million books published. That's probably at least double now. There's over 1 billion websites, tens of trillions of index pages. Add in social media content, emails, forums, newspapers, it's easily one to 2 quadrillion tokens available to train. And the thing is, that's just text. We're not even talking about audio, video, real world information such as a self driving car. So hard for us to conceptualize when we talk about millions of rows in a database for Jango. But ultimately, this is kind of what you're doing in data science, right? You're taking unimaginable amounts of data, and you're trying to focus it and extract insights that you can use using statistics, machine learning and computer science. And so just common examples we see in everyday life, spam filters and email mapping technologies, right? With Google Maps, Apple Maps, recommendation systems, healthcare, early detection, llms, like we talked about, finance, fraud detection, weather prediction for farming, crop yields, on and on and on. And I do want to make the point that today, Python is the dominant, almost the dominant programming language, and certainly very dominant in the web space and the data science case. But this was not the case. It was really the rise. Like this has risen in the last ten, 15 years, if we go back all the way to two, ten. Jango was only five years old. Flask had just been released on April first as a April Fool's joke about how small can you make a framework. Jangar reframework wasn't released until 2011. Starlit, the lightweight asgy, also from Tom Christie, was not until 2018. Fast api, which uses starlet, didn't come out until 2019. And Jango starting then also rolled out its own asynchronous support. And then Jango ninja, which is quite popular for apis, only came out in 2:20. So it seems like pythons everywhere now. But certainly when I started programming, Python and Jango were not dominant choices. And the same thing is in data science actually. So again, back in your time machine to 2010, R and Matt lab were much more dominant pandas, which is now a default for data manipulation. Excuse me, only hit 0.1 release in 2010NumPy for numerical computing. So large arrays and matrices became mainstream only really in recent years. Seaborwhich used for visualization, which we I'll show in the demo 2013. Jupiter notebooks only spun off from ipython in 2014. And then the machine learning tools that again we're going to use that are common now, such as psykit learn only first came out in 2010. Tensor flow in 2015, pi torch 2016. So this talk assumes you know everyone uses Python for everything. What's important to say? Even in my not super long career, that has not been the case. All right, so let's let's train a model, the source cois available on GitHub. So you see it all at the end. Don't take notes or anything. I want to talk you about how you do this if you've never done this before. And this is an intentionally simple example, but the general process applies to training any machine learning model. And again, machine learning means we give the inputs and the outputs in the computer through algorithms, comes up with a reasoning on its own. That's what machine learning means. All right? So if you start with machine learning, you learn there's basically two big data sets for classification problems. There's titanic data set, which is, it's a bit morbid, but it's literally who lived and died. And then iris data set for what type of species of iris flower? So titanic is often considered the holworld of machine learning because it's clean enough not to overwhelm beginners, but it's messy enough that you have to do a lot of core machine learning concepts, regressions, decision trees, random forests, these types of things. For titanic, there's a total of 13, 309 passengers, 891 survival outcomes. And then you get information like passenger class, first, second, third, sex, male, female age, number of siblings, etcetera. And then you can make predictions and parse out the data from there. But the iris data set, which is the one we're going to use, is even simpler. So in this case, the. There is only 150 rows, so 50 sets of measurements for each of three species. There's no missing data. So we don't have to do any preprocessing, so we can focus just on doing the model. So in short, if you're starting with things, I would recommend starting with iris and then move to titanic. So we're skipping all of the cleaning data, which is a big part of machine learning, just to focus on doing the model itself. Okay, first step, Jupiter notebook, right? The multiple ways to do this, you can do it on the web with Jupiter dot org. You could use anaconda or you could use a text editor. Pcharm, bs code have their own versions as well. We're just going to use pie charm here, but it doesn't really matter how you use your Jupiter. So I know you can't see this, but this is if you created a new juper notebook and pie charm. It comes with data and models folders. There's a requirements. Txt file, readme file in a sample ipynb file. That's where the notebook is. It's not uncommon to train and retrain multiple models. That's why you have a models delfolder. But we're just going to focus on one here for simplicity. So this is what we're working with. This is what the iris flower looks like. There's three different species. There's cytosis, versicollar, virgenica, and they have different petal and sepal measurements for width and length. So it's a balanced data set, which makes things a lot simpler for us. And the goal is we want to train a model so that we will add in our own petal and seal information and it will predict for us what flower that is. Again, this is just another look at the csv file on the far left. You have the ID and then you have seple length and centimeters, sepal width, pedal length, pedal width, and then species. I would mention if you do this yourself, there's actually different versions of this online. So you get slightly different csv files, which blew up a couple days of my life. So just be aware this one comes from kegle, but they're all slightly different for some reason. But this is so common that it's included by default in machine learning libraries like psyhlearn R matlab. Because again, it's sort of where you start if you're getting your your feet wet. All right, let's do a little code. So what do we do? Install two packages. So right now we're just doing pandas. So it's a library for data manipulation and analysis using data frames, that kind of thing, and then psych it. Learn. Excuse me, this is a machine learning library for data mining, preprocessing, model building. So we'll use both here. And again, this is making it as simple as I can. Okay, I'm going to try not to drly talk through code, but there's a little bit of code. So this is just one Jupiter notebook, again, source codes on GitHub. But just to talk you through, how do you train a model? So we import the libraries at the start, pandas, to load and manipulate the data set, train, test, split from psykit, learn. So that's to split the data set into testing and training sets. That's something you do in machine learning, an accuracy score to evaluate our model, see how well it performs svc, so that's support vector classification to train an svm, which is a support vector machine classifier. Talk about that in a sec. And then jolib to load and save our model. Jolib is a binary file that stores a serialized Python object. It's included with psykit. The end result of what we're gonna to do is we're gonna to have a jolib file that we can move over to Jango and show to people who view the web page. So an svm classifier, so supervised means trained on labeled data. So that's what we have here where it's actually labeled. In the real world, often you have non labeled data, so that would be unsupervised. So you load the data set, creating a variable df here to read the iris csv file into a pandda's data frame. We're going to extract the features as columns and rows. So there's four feature columns, three species labels, and then we're going to split it into training and testing data. So 80% training, 20% testing, 8020 is a common split. You could use a different split, and in the real world, you would, depending on your needs. So 7030 is more conservative if you really want to be accurate. 9010 would be for large data sets. We don't need as many tests. All train the svm model. This is kind of where all the action happens. So we're creating an svm classifier, setting gamma to auto. So that's the kernel coefficient. So this is how tight our model is around the data. Around the data. A lower kernel smoother, might underfit. Higher kernel be wiggly, might overfit as a default. We can just use auto here to not worry about that. And then model fit trains the svm model on our training data. Then we are going to make the predictions to test the model and evaluate its accuracy. And then we save the model using job and reload it. So we trained it once and then we have it. We don't have to retrain it every time we want to use it. All right, this is the last little bit. So we get the user input for predictions. And I'll show you what this looks like in just a sec. So there's prompts to enter in the four values, so sepal length and width, pedal length and width, and then try to predict the species based on user input. All right, let me just show you. Whoops, whooh. I have a live demo where it go. Well, that's too bad, huh? That's two hours of my life I don't get back. Well, you just have to trust me on this. You can load it yourself that if you hit run in the Jupiter notebook, you can enter in the inputs and it will show a prediction that is deeply unsatisfying. You know, when I gave an earlier version of this talk, I was flipping between screens at the live Jupiter notebook. But from Tim's example and others, that never worked. So I thought loading a video would work. But okay, trust. Oh, Oh, Oh, it's not playing on mine. Oh, wild. Yeah. So you can see this is in the juper notebook we're entering in our four things and scroll down to the bottom. In this case, it says Vera color, 97% accuracy. Thank you, Adam. Okay, so it does work, but it doesn't show here. That's very odd. Okay, but now we can do something cool. Now we want to visualize our data in our model so we can install seaborand matte plot lib to help us do that. And then we add a new cell to our juper notebook imports, both runs a basic pair plot. There could be a whole talk on what a pair plot is, but it's basically an easy enough default to get some visualizations. I'll show you in a second. But it creates a grid of scatter plots and histograms. So this is a very basic visualization, but you can see the clumping of data, and this is actually important. So in the top left is the ID. So those are the same. But then you have visualizations for each four, so you can ignore that top row. Those are nicely separated, but you see seo length and seo width here. And you can see that orange and Green are clumped together a bit, whereas blue is separate. That's actually good because that creates a challenge essentially for our model. It's not if they were all separate, there wouldn't be much for our machine learning model to do. So it makes our classifications work a little bit harder. And then this is the bottom, the bottom two for pedal length and width. And again, you can see clumpings for orange and Green, which is vercolor and virgenica, where sitosis is on its own. The important thing is that we have a trained model here. It exists as an irs is jolip file, and now we can deploy it to jangoza web app. So now it gets the comfort space. This is our game plan. So we're going to create a new jgo project, load the job lib file, add forms so the user can make predictions, share store user info in the database, and then maybe I'll show you deployment at this time. So an earlier version of this talk, I waited until the end, but if you actually pull out your phone or laptop right now, you can see what we're gonna to build on jingofor datasscience dot com and thatmaybe help get you through talking through the code that's coming. So I recommend you check that out. Do I have is it going to work? This what this is what you will see. So you web page you can enter in your information, make a prediction and it varies depending on the type of flower. So again, this is just basically our jolip file thrown into a Jango website and we're storing the information too. Again, a different version of this talk. I would show you the live admin as you're typing things in to prove this all works. But hopefully, you trust me that it actually works. So that's what we're going to build. And this process will apply to any model that you make with a relatively basic Jango website. Yep. Get my thing over here. So let me see how here we go. Okay, so I'm I'm going to walk through the process for a new Jango project. I'm going to go a little bit fast, but I do want to show all the steps because I know it's easy. Many people in this audience are very familiar with Jango. In this part, it was not as interesting as more technical deep dies. But anyone watching or knew to Jango, I remember being so frustrated when someone waved their hands and skipped a step. So I'm gonna to go through the steps. I might go a little fast, but all the code is in the repo and I want to show how you do it because it's not that many steps and it's the same thing again and again and again. So we're gonna to create a new Jango project from scratch. So you could create a new if we did it in terminal, you could create a new directory in your computer. New Python virtual environment Jango admin start project start app to create a predict app update installed apps in settings pi create a GitHub git repo. If you're in pie charm pro, you can just do it from this screen. But again, it doesn't really matter how you do it. All right, so this is the layout of our new Jango project. And this is one I would recommend in general for four projects. So we have Jango project. That's our project folder. People can call this anything. I like to just call it all jgo project. But that's a whole separate forum discussion, predict. This is our app. This is where we're gonna to put our focus templates for our template file. And again, I think I sometimes forget, but all of this structure, this exists for us as Jango developers. Jango doesn't care. The computer doesn't care. You can have one app, no apps. Like people in this room can and do do very different things with this. But if you're new to Jango or you're just doing kind of a vanilla version, I think this is as safe as it goes. But again, this is just one approach. Jango doesn't care how we structure things, but I like to take advantage of the project and app separation, and we're going na do it that way. So you're going along, you do your project, manage that high run server, Jango welcome page, which we all know and love. And then again, we just want the iris dot jolib file so we can copy it over. You see it in the middle there to the project level this in our application. That's all we have to do. Now, again, if we had multiple models, wehave a models directory, and those are more complicated real world things, but for demonstration purposes, we just pull over the one model that we want to work with. You probably want to run Python managed dot pi to migrate and get rid of all the warnings about unapplied migrations. And then as ever, we need url's views and templates. So the order really doesn't matter. And that really trips up beginners. But I like to start off with url's. So we're gonna to do that. So we just have project level file here. We have got opens empty string, excuse me, because we're just going to have it at the home page and we're including the predict urls, which we'll down below. Again, I gave this talk previously to a data science crowd. So I focused a lot on the Jango piece, but I'm going to go a bit faster because we're a Jango crowd here. Then we add a view, again, function based view. Hold talks. You could do unfunction versus class based views, but we'll just do a function based view, call it predict, render a template that's called predict html, and then we just create the simple template. So to step through this iteratively, we don't do anything other than just have hello in it for right now. Run server, right? Bob's your uncle. You're good. Okay, so now we get to interesting stuff. So this is the views file. This is not that scary. This is really where things are happening. So at the top, we're gonna to install jolib and NumPy, and then we're gonna to load the model from our base directory. We have post request for the form four inputs to match the four options, make a prediction using a NumPy array, and then return that as a variable prediction that we send to our template. I should note that we also need to install psyit, learn if we want to load the jolib file to create it. So separately in the new requirements txt, you're going to need psych it learn again, that code is in the repo. You can see. Okay, update our template file. This is a little bit hacky, but some basic css. And then here's the form that has the information where the user can put in their guesses or their measurements. And then we would get this. So this is our basic form. Enter n predictions. And then here's the results. If you did one, two, three, four, so you would get iris for genica. Now one, two, three, four are terrible. Like that's not actually what widths and lengths are for some of these things. That's why in the live site I put some boundaries, but just for demonstration. All right, so let's keep going in the view. Now we're going to add a dictionary called form data to store user inputs because it's nice to store what people had. This is often the case where you make a machine learning model, you test sted on users, you see how it works, and then you iterate on it. So for example, if you had a recommendation engine, you would build it, test sted on users, store that information, retrain the model, and you do these feedback cycles. This is why adding storage in the database is important. And I thought was I thought it was kind of cool how easy it is to do. I'll show you in a second. So dictionaries populated by values from the form and the request. We're storing them because Jango clears form fields by default. And then we PaaS this form data to the template context at the bottom of the file so it can be rendered on the page. All right. And then finally, we're adding the inputs and the form data. Nope, that's not correct. Why am I showing this again? We'll just skip that. All right, so very ugly. Here's where we are. This is cool. All. I feel safer. Now let's talk about the models. So obviously, if we want to store data in a database, we need a models pi file. We're gonna to create one here called iris prediction. Just use float fields for the four inputs and also store the prediction. And just for the heck of it, we'll add a created out date and then the string method to show the prediction date and time when it was made. If you were doing this step by step, youmake migrations here, run, migrate. And then this is almost the last slide with code, I promise. We update the view to save the prediction. So we import the model at the top, and then we do the iriprediction objects create to save the prediction to our database. And this is the last line of code, so we would update the admin so we could view it again. If you were doing this from scratch, create a super user account, log into the admin, you know, look something like this, very vanilla, but very functional, and you can customize it as we want. All right. So I had a version of this where I was gonna na show you how to do deployment, but then I realized that's probably a 40 minute talk, but I just want to give you the short version, which is this to me is the deployment checklist you can and should use. So for example, I was able to take the live site and in 15 minutes put up the version you have now on a custom domain, because I've done this up a jillion times. So I'll quickly talk through it. You can do it differently. I think this is pretty much the bare basics to have a not wildly insecure site. So configure static files, environment variables. A lot of people like Jango environs. I'm partial to environs. It really doesn't matter as long as you have environment variables. Create A M file. Update your gignore file. So to ignore the m file, otherwise, what's the point? Update your settings, right? So debug allowed hosts secret key csrf trusted origins, update databases to run post gressing production. Install psycho pg. So if you install environs, there's a Jango configuration where it will automatically inststore dg, dg database, url, extra goodies that do that for you. So production whiskey server, gunicorn Proc file because kroku, but that would vary depending on what hosting provider you would use. Update requirements that txt file, and then just create a quick kroku project, push the code, start a dyno process again. It seems like a lot. I can never remember any of this. But that's is why we have checklists. And I would strongly, I feel pretty strongly recommending this for a basic, not wildly insecure setup. Of course, you could do a million more things, right? This is the last slide, I promise. So here's basically the takeaways. So Jango is great for deploying machine learning models. I think most data scientists, they just want what I showed you. They want store ing a database forms. They want all the basic features that Jango gives you out of the box. There's often this sense that Jango is hard to use. And so theyuse flask, or maybe fast api just because they think Jango is difficult. And no disrespect to those frameworks, they have their uses, and if you know them, use them. But Jango is built for this use case. Just take a model forms like we give everything you need out of the box, and you can more or less follow the code here and apply it to almost any basic machine learning model. Again, iris, the great data set, titanic, it's really fun to train machine learning models. Like you don't have to know all the maths to do it. In fact, you don't really I mean to use it. You don't need to know any of the maths to understand it. You do. But you can go a long way just playing around and following tutorials and then deploy it in the real world, right? If you have your machine learning model, there's no sense having a Jupiter notebook thing locally. Like you can easily share it with friends, others, colleagues and in a real world setting. This is what data scientists want to do. They want to take their model. They want to expose it to users and do that iteritative loop of retraining it. Okay. Thank you for your patience. Happy to take any questions.
speaker 2: Thanks. Will those super talk can I just pick up at that point the end do you think when failing to market jangon to the data science .
speaker 1: 1000%? Yeah, Yeah, Yeah. I mean, I don't think that I don't know that other web frameworks are doing a better job. But again, it's I think for us in this room, Jango doesn't seem so scary and difficult. But I'm telling you, people with PhDs and machine learning are scared of web development and Jango. And I think just it's just you know it's batteries included, it could be presented better. The whole point of this talk was to show you it's really not that much code to train a model or to do the Jango bit and the process is the same. So hopefully this helps market Jango a bit better for that. You and you don't have to install a third party forms ums library like Jango comes with every most of what you need in a basic setting to deploy your model. Thanks, will. In the okay example here, you trained the model and then provided that to users. Are there any gches that you can think of for your users to be able to train known models to then show or provide to other users? Is there anything different that you would follow rably? But I can't speak to it after top of my head.
speaker 2: Well, you went through all the steps to set up the know, showing off your model in Django and showing how accessible that actually is, and you didn't skip through anything. But then in the deployment checklist, Yeah.
speaker 1: as you say, you know that's .
speaker 2: that's a quick process for you. You've done it a lot. But I think to someone new to web development, that process has actually got a lot of watts and a lot of hairs. And is there anything to make that more accessible for new people?
speaker 1: I mean, there are books. I've written a few. I would like to have a more in depth step by step guide. I think the thing is, it depends on the project you have. So the steps know, because you have so much flexibility with Jango, you have to know what the project is before the steps that you say apply. And if know, one thing is off to a newcomer, they're gonna to get totally frazzled. So for example, my Jango for beginner's book, I show you how to do a bunch of projects and I show you how to do all the steps. But Yeah, I do think about this. I think Jango has a great deployment checklist. I would like to make that more accessible. But I have seen the beginners, their projects are different enough that they get tripped up. So unfortunately, it's difficult to just say this is exactly how you do it unless I know exactly what your project is. But I completely agree. You know, read a book and you got it. Thank you. Great. To just a wild idea slash suggestion, perhaps this can be a great official jungle tutorial to live alongside the other one that you can send to data science folks and say, Hey, it's not that hard. And this will help push jungle to more people, that's all. I and we could have a hello world tutorial too. That's simpler than polls while we're at it as well. Yeah. I mean, I'm not you know I'm not in the board or important anymore, so I'm happy to give it to Jango .
speaker 2: if they want it. I realize you glossed over it in the talk, but you spoke about briefly this Jango underscore project thing. Yeah where should I look to learn more about that particular convention or that idea? Because I've seen it in a few places and it looks like it solves one of the continual problems I come across myself.
speaker 1: Tutorial or book I've written on learnjgo com has that pattern. I mean, I feel less comfortable saying everyone should use that. I use that because I see that people name their djgo project different things. And to me, I just wanna know what the project and what the apps are you know, in a real world, repo, the structure is a bit different. But you know if you have six, ten apps, I just like to say the name and project. So that's more of a personal thing. I used to call it config because my friend Jeff Triplett likes that approach. But other things are config to. In projects, I often have a config directory. So Yeah, I'm partial to it. I don't feel compensing like that's the one way to do it, but for me just having having it called something project is helpful.
speaker 2: Thank you. Thanks. Well, great talk. This was a small model and you committed it in the repo.
speaker 1: Yeah most models are big and we don't want to commit them and make our get repo now giant and we have push updates like what would be the next steps you take for larger models? Yeah. So I have the same question. I mean, I would love to know where is that limit with a Jupiter notebook? Because I think it's actually a lot bigger than we think of. So in the broader world, data scientists think of themselves as not great programmers, like below web developers, which I think we're relatively low on the spectrum, or not like nuclear submarine programmers. I don't know exactly. I would love to know at what point can you like what is the limit of a Jupiter notebook? And then when do you write Python scripts and how do you do all those other things? So Yeah, if I did another version of the talk or if somebody knows, please tell me. But that's a very good question. I have the same one. Okay, thanks. We actually have an online question, so I'll read it out. If jungle were to be marketed better to data scientists, can you imagine it becoming one framework to rule them all for data science and web development? I'm old enough to say, no, I don't think there's ever going to be one to rule them all. But I do think what data scientists want is crud with auwith guardrails that just works. And so if you're not, I would say if you're not comfortable with web development, chgoes great for you because it just gives you batteries. It gives you things to do. It doesn't require you to be an expert. It doesn't ask you to make some of the decisions that other web frameworks do. So I think it could be I think it should be like the top default. I think it often is not, and that's probably related to just not having tutorials or ways to do it. I think also, again, there is this perception that Jango is really, you know its batteries included, it's really hard to learn, whereas flask is considered simpler. And you know the first part of flask is simpler. And if you need more advanced stuff, maybe you want all the flexibility of flask. But if you're right in the middle and you want crudding off and forms and stuff, that just works. I think Jango Yeah should should be more prominent. But of course, I'm biased. So I thanks for the talk. Can you comment a bit about the real world, let's say examples of using model? So if the resulting file size is big about the performance of the whole thing, so how much time would it take to actually train it? And how much time would it take to query the thing and get the results back? Thank you. Yeah, that's a great question. I don't have a great answer because I'm not a data scientist, but I'm spending this year learning a lot about data science. So hopefully maybe next year I'll have a better answer for that. Yeah, that's a very good question. Sorry, I don't have the answer is thank you. Going to a bit more in depth for long term project maintenance. You use jooplip here. Jolip uses pickle under the hood or replacement or something like that. Yeah how do you ensure that that model that you trained once will run on whatever future Python cycould learn whatsoever version? So I didn't mention so you could use pickle or jolib and jolib is preferred for larger data sets. I don't know the answer to that question. That's a really good one. I mean, there's still so much of data science that's mysterious to me, to be honest. I mean, most of what I know is up here. So but Yeah, I wanna find out. I mean, I don't I haven't found resources of people talking about deploying models outside of you know, at massive, massive scale because I think people just don't do it that much. But Yeah, that's a great question. I'll research it, but I don't know. Thank you.
speaker 2: Thank you for the talk. In this demo, you used a csv file for the source of data for training the model in a jangle project. Usually you have a lot of data in your database. How easy it is to take like a query set instead of a csv file as a source of data.
speaker 1: I don't know exactly. I could make predictions, but I haven't done it myself. So Yeah, that's another great question. Again, I mean, even for me presenting this, I propose this talk with the idea of like how hard is it? How hard can it be? I mean, to data scientists, the idea that you can just move the job with file over with sort of mind blowing because they're used to big production scale. So this, that's a great question I have. I want na push it further and see like where is that limit? How much can we put just within standard Jango structure? I know I would love to do a demo to retrain the model because that was feedback I got from an earlier version was, Hey, that's what we do in the real world. We have the model users take the data from the database, retrain the model. Maybe next year I'll .
speaker 2: have a demo showing that you started by saying these data sets of the hello world over of data science. Can you tell us a little bit more about what people in data science see as not just the hello world datset, but the hello world problem? For us, we know what it is. It's to get a web page up that's displaying something we want from from a database. How should we be thinking about data and data we could use or data we could try and construct for purposes like this? So Kagle is like .
speaker 1: a big one of the big places with tons and tons of data sets that you can use. I think it depends what you're trying to do. I mean, if you're a data scientist, again, iris is good here. Just because it's easy, you don't have to pprocess it or clean it. So much of what you do is around that. So I hope I'm answering your question correctly. Like a lot of what you're doing is cleaning data and trying to get the accuracy prediction and you know how is the data clumped? Which classifier do you use? That's a lot of what data scientists do. I feel like I'm not answering your question directly though. That's you know for me to go further. I mean, there's a number of books. I mean, the thing is you can be overwhelming if you read an entire book on you know pandas, an entire book on psychic learn. Like I would recommend people go to Kaggle and follow tutorials and learn kind of the basic cleaning and training kind of steps.
speaker 2: Yeah, I just realized that actually how world is the wrong metaphor. The right metaphor is the first thing we do in a tutorial, which is, for example, make a to do app Yeah Jango or to make a polls app in so what's .
speaker 1: the data science equivalent?
speaker 2: Yes.
speaker 1: that's what I'm I mean, it's basically what I showed that's and again, I played around with a few. That's as simple as it is. We actually do something. And then the fact that you have two of the three are clumped together means your model has to do some work because of course, if the data was just completely separate, you wouldn't need a model for it. So Yeah, iris is as simple as it gets. And then most people, they focus on titanic because it's busy enough and big enough that you can get a taste of the problems you encounter as a data scientist without it being overwhelming again. But it's so morbid. Like really thank you for the talk. It was a great talk. I'm going to put you a little bit on the spot. And I blame carton, who said we should do that today. You showed that results for the ptrm survey for the past few years. Do you know when you are going to get the results for 20, 24 soon? I've seen it and reviewed it months ago. Jeff brains does a lot of work and a lot of teams to put it out. But soon I hope and I hope that cycle improves, but it's it's there. It's just inching its way through the process. But Yeah, that's a good question. Thank you. I will say there's no there wasn't anything like crazy that changed. Like if there was maybe weput out a code read you know in the responses. But you know for next year, I think we'll probably ask about uv for package management. And maybe actually that's I need to do maybe on the forum, I want to get some more community involvement around the questions. I mean, the board runs it now, but I'm trying to help make sure we ask the right questions because it does matter. It does help you. We saw that Reddis had a lot of support, so there was worked done to make that official for caching. It is kind of our only public way to get feedback.
speaker 2: Thanks again for the talk. So this presentation was using a bunch of pre made data. And obviously, it's data science. But do you see value in there being something like some sample data sets for Jango itself, so people can use maybe the books with examples of real data to play with and things like that to demonstrate parts of the orm a bit easier?
speaker 1: Yes. I mean, I think the challenge is, I mean, you could just make up data. It's nice if it's actually real world. But yes, like especially if we had a tutorial or had sort of a get your feet wet, something beyond just iris and titanic would be great. Yeah, I don't know why. Yeah, yes, we should do that. Okay, there's no more questions. Thank you, will. Thank you, everyone.

最新摘要 (Detailed Summary)

生成于 2025-06-21 18:56

概览/核心摘要 (Executive Summary)

本次演讲由JetBrains的开发者倡导者William Vincent主讲,核心论点是:使用Django部署机器学习(ML)模型是一个出乎意料地简单且高效的过程。演讲旨在弥合Web开发者与数据科学家之间的知识鸿沟,展示Django作为“开箱即用”的框架,其内置的表单、ORM和Admin后台等功能,能为不熟悉Web开发的数据科学家提供巨大价值。

演讲通过一个完整的实例,分两步进行了演示。首先,在Jupyter Notebook中,使用Pandas和Scikit-learn库,以经典的Iris鸢尾花数据集为例,训练了一个支持向量机(SVM)分类器,并将其保存为.joblib文件。这个过程刻意简化,避开了复杂的数据清洗,以聚焦于模型训练和部署的核心流程。其次,演讲详细展示了如何从零开始创建一个Django项目,将训练好的模型文件集成进去,通过视图(Views)加载模型、处理用户表单输入、进行实时预测,并将用户的输入和预测结果利用Django Models存入数据库,最后通过Admin后台进行查看。

结论强调,尽管数据科学家可能对Web开发感到畏惧,但Django的“全家桶”特性使其成为部署ML模型的理想选择。然而,演讲也指出,Django社区在向数据科学领域推广方面做得不足,这块潜力巨大的市场有待开发。


引言:弥合Django与数据科学的鸿沟

  • 演讲者背景:William Vincent,现任JetBrains的PyCharm IDE开发者倡导者,专注于Web工具和数据科学领域。
  • 核心观察:在更广泛的Python世界里,数据科学已成为主导力量。然而,Web开发者和数据科学家之间存在明显的壁垒,许多拥有博士学位的数据科学家也常常对Web开发和Django感到“恐惧”(terrified)。
  • 历史背景:演讲者指出,Python在Web和数据科学领域的统治地位并非一蹴而就。回顾2010年左右,R和MATLAB在数据科学领域更具优势,而如今的许多核心Python库(如Pandas, Scikit-learn, TensorFlow, PyTorch)和Web框架(如FastAPI)当时或尚未发布,或远未成熟。这为理解当前生态和弥合知识鸿沟提供了重要背景。
  • 演讲目标:通过一个完整的端到端示例,证明训练一个基础的机器学习模型并使用Django进行部署,既简单又有趣,旨在打破这种隔阂。

第一部分:训练机器学习模型

本部分聚焦于在Jupyter Notebook中训练一个简单的分类模型。

工具与数据集

  • 环境与库
    • Jupyter Notebook:用于交互式地编写和执行代码。
    • Pandas:用于数据处理和分析。
    • Scikit-learn:用于构建、训练和评估机器学习模型。
    • Joblib:用于将训练好的模型序列化(保存)为文件,以便后续在其他应用中加载使用。
  • 数据集选择:Iris (鸢尾花)
    • 演讲者选择了经典的Iris数据集,因为它数据干净(150行,无缺失值),非常适合初学者专注于模型训练本身,而非复杂的数据预处理。
    • 实践提示:演讲者提醒,网络上存在多个版本的Iris数据集CSV文件,内容略有差异,实践时需注意文件来源以避免问题。

核心训练流程

  1. 加载与准备数据:使用Pandas从CSV文件加载数据。
  2. 数据分割:调用scikit-learntrain_test_split函数,将数据集按80%训练集和20%测试集的比例进行划分。
  3. 模型训练:选择支持向量机分类器 (Support Vector Machine Classifier, SVC) 作为模型,并调用.fit()方法在训练数据上进行训练。
  4. 评估与预测:根据演讲者说明,训练出的模型在测试集上达到了97%的准确率
  5. 模型保存:使用joblib.dump()将训练好的模型对象保存为一个名为iris.joblib的二进制文件。

数据可视化与模型扩展

  • 可视化分析:使用seabornmatplotlib库创建pairplot(配对图),直观展示了不同特征间的关系。图表显示部分数据点存在重叠,这为模型分类提供了一定的挑战,也证明了使用机器学习模型的必要性。
  • 模型扩展讨论 (整合Q&A)
    • 大型模型处理:当被问及如何处理体积过大、不便存入Git仓库的模型时,演讲者承认这是一个重要的实际问题,但他目前没有确切的最佳实践,并表示想进一步探索Jupyter Notebook处理能力的上限。
    • 长期兼容性:对于joblib(其底层使用pickle)在不同Python或库版本下的兼容性问题,演讲者同样承认其重要性,但表示自己尚不清楚最佳实践。

第二部分:使用Django部署模型

本部分详细介绍了如何将训练好的模型集成到一个新建的Django Web应用中。

项目设置与模型集成

  • 目标:创建一个网站,用户可以输入花瓣和花萼的四项尺寸数据,网站返回对应的鸢尾花种类预测。
  • Django项目设置:遵循了标准的Django项目创建流程(startproject, startapp)。
  • 集成模型:将之前生成的iris.joblib文件直接复制到Django项目中,并在views.py中使用joblib.load()加载该模型文件。

构建Web界面与逻辑

  1. URL, View, Template:创建了标准的URL路由、一个名为predict的函数式视图以及一个predict.html模板。
  2. 视图逻辑 (views.py):处理来自HTML表单的POST请求,获取用户输入,将其转换为NumPy数组,并传递给加载好的模型进行预测,最后将结果渲染到模板上。
  3. 数据持久化
    • models.py中定义了一个IrisPrediction模型,用于存储用户的输入和模型的预测结果。
    • 更新视图逻辑,在每次成功预测后,使用IrisPrediction.objects.create()将数据保存到数据库中。
  4. Admin后台:将IrisPrediction模型注册到Django Admin中,从而可以方便地查看所有历史预测记录。
  5. 数据源扩展 (整合Q&A):当被问及如何直接使用Django QuerySet而非CSV文件作为训练数据源时,演讲者表示这是一个很好的方向,但他自己还未尝试过,这为未来探索留下了空间。

生产环境部署清单

演讲者提供了一个简明的部署清单,用于实现一个“不算非常不安全”的生产环境:

  • 配置静态文件 (STATIC_ROOT)。
  • 使用环境变量管理敏感信息(如使用django-environ)。
  • 更新settings.py:设置DEBUG=False, ALLOWED_HOSTS, SECRET_KEY, CSRF_TRUSTED_ORIGINS等。
  • 使用生产级数据库(如PostgreSQL),并安装相应驱动(如psycopg2)。
  • 使用生产级WSGI服务器(如Gunicorn)。
  • 创建Procfile(适用于Heroku等平台)。
  • 维护requirements.txt文件。

核心观点与讨论

  • 主要结论:Django是部署ML模型的绝佳工具,其“开箱即用”的特性(表单、ORM、Admin)正是数据科学家所需要的,整个部署过程比想象中更直接。
  • 对数据科学家的营销不足:演讲者完全同意听众的观点,认为Django社区未能有效地向数据科学界推广自己,存在巨大的市场机会。
  • 官方教程建议:有听众建议将此内容制作成一个官方Django教程以吸引数据科学家。演讲者对此表示开放,并开玩笑说可以顺便做一个比投票应用更简单的“Hello World”教程。