2025-06-03 | AI Engineer | The Future of Qwen: A Generalist Agent Model — Junyang Lin, Alibaba Qwen
Qwen3发布:全新混合思维模式与多模态能力升级
标签
媒体详情
- 上传日期
- 2025-06-10 12:49
- 来源
- https://www.youtube.com/watch?app=desktop&v=b0xlsQ_6wUQ
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-preview-06-05
转录
speaker 1: Hey everyone, I'm jyoung from quanteam Oliver growth, very happy to be here in air engineer worlfare 25. And I'm very excited to share some progress about quen. Do all of you I think you guys know about quwan and maybe you are developers. I'm very happy to share some more things about quen to all of you. Qua, a series of large memory models and large mulmodel models. And we have a dream of building a genernalist model and generalist. Before I start, I would like to share some important links to all of you. Maybe you know our product, our chat interface, quenchat, which is chat dot quen AI. And it is very easy to use. You can gralatest models. You can use our mulmodel models by uploading images and videos. And you can also interact with our omni models using voice chat and video chat. And there are some important features like web taand, deep research. Welcome to enjoy it. And if you like to know more about technical details, you can check our blowhich is quam GitHub. Io. In our bloonce, we release something new. We often release a bloso. You will and know more about technical details throughout our blocks as we keep open sourcing. So we have our codes in GitHub and we have our checkpoints in huging base. So you can download our checkpoints and play with our models. Yeah so feel free to check our websites and enjoy them. Yeah in this year, just before the spring festival, we have released a very good instruction tune models. I think it's a very good basis for larger language models, which is 12.5 max. It is a very large moe models. And we find that in multiple benchmarks, it achieves very, very competitive performance to the state of the art models at that time, including coa 3.5, GPT -4 and dcb V3. But we believe that there are more potential for a large language model, not just becoming an instruction tune model, but it can be smarter and smarter with reinforcement learning. So we have dived into doing the research and reinforcement learning, and we finally found that, well, it is really amazing to see that using ican increase its performance, especially in visiting tunlike math and coding in increase the performance quite consistently, just like for amme 2024 for a 32 billion per meter model, you can find that its performance starting from like 65 and is increasing until it is 80. So it is really amazing to build a Brezing model like qw q. And also in trepod arena, you can find that its performance, it's also very competitive with even larger models and it is in top 15 for for a long time. So we would like to combine all our efforts in our research and development to build a stronger next generation models. So very, very recently, we have released quthree. Quthree is our latest large language gry models. And we have released multiple sizes of dens and moe models, large number of models. First of all, I would like to share the a flaction model to all of you, which is a 20 235 billion per meter model. It was a total number, but only activates 22 billion parameter models. It is in moe models. So it is actually efficient, but it's very effective in comparison with very top tier models like o three mini, and it's just lagging a little bit behind Gemini 25 pro. And also for our largest dense model, it's also very competitive as well. And we have a very fast moe models. It is relatively small, although it has 30 billion total parameter models, only only activate 3 billion parameter models. But if you compare the performance with q dot, q 32 billion, the moe models, which only activate 3 billion parameter models, even can outcompete two to q 32 billion in some task ks. And for a even smaller models, very small, it is only 4 billion parameter models. But this time we have a lot of techniques in distillation, distilling large, the knowledge from large models to small models. We finally build a very good small models and with thinking capabilities, it even shows competitiveness to the fractory model in our last iteration, like when 2.5 72b. Yeah. So it is a really good model and it is really worth a try for you to play with our 4 billion per meter model and the 4 billion per meter model. And you can even deploy it in mobile devices. Yeah. So for Quin three, we have some important features. And the most important features this time it is hybrid thinking mode. So what is hybrid thinking mode? Hybrid thinking mode means that you can use thinking and non thinking in a single mode. We combine the two behaviors into a mowhat is thinking mode. Thinking. It's like before you answer the question with an detailed answer, you just start thinking. You just reflect on yourself, explore possibilities. Finally, you find that while you are ready to answer the question, so you provide the answer. So like zero one and deep C R one, they have the thinking behavior Yeah and for non thinking mode, it is just the traditional instruction tune model. It's just like a chatbot. Without thinking, without lagging, it is just provide the answer. It is near instant. But this might be the first time in the open source community that we combine the two modes into a single model. So you can use prompts or hyperparameters to control its behavior, such as you like. And once we dive into the hybrid thinking modes, we find that we can create a feature of dynamic thinking budget. So what is dynamic thinking budgets? For thinking budgets, which means the maximum thinking tokens, for example, if you have 32000 tokens for your thinking budget, so for a PaaS which requires thinking, for example, if you finish the thinking with like 8000 tokens, it is below 32000. Okay, fine. You finish thinking, provide the answer, it is good. But if you only have a think budget of 4000 tokens, so if your thinking requires more than 4000 tokens, like it requires 8000 tokens to finish your thinking, but you only have a budget of 4000 tokens, so you will stop at 4000 tokens, which means that your thinking process is truncator. So we can check the performance with larger and larger thinking budgets. And you'll find that well, the performance increase quite well with the increase of thinking budgets and with a very small thinking budget. For example, in ame 24, it only achieves just over 40. But if you have a large thinking budget like 32000 tokens, you can even achieve more than 80. This is really, really amazing with the thinking capabilities. So I hope you enjoy the hybrid thinking model. You own a single model to achieve thinking and not thinking, and you'll find something good for you. For example, in your task, you require only like 95% accuracy and you will find that well, for example, you use only 8000 tokens for your thinking budget. You can achieve over 95%. So that is quite well. You don't need to waste more tokens on your thinking. So you don't can keep your thinking budget gets with 8000 tokens. This is only an example. We would like to explore the usage just Yeah the next important features is that quthree supports over 119 languages and dialects. In qutwo point five, it only supports 29 languages. But this time we support over 119 languages and dialects. We have detailed names of the languages and the dialects that we support. You can check it. I think it would be really good for global applications. And there are a lot of people previously, especially if they are in using open weight models, the open enweight models don't support many languages quite well. So there will be more people that are capable of using large language models in their domains and in their languages. Yeah. And we have specifically increased capabilities in agents and codings, and especially, we enhanced the support for mcp, which is really popular very recently. And these are two examples to show that our models, how to use tools during thinking, who will find that it can think while it is using function calls to use the tools, and it gets the feedback from the environment and it keeps thinking. So that would be a feature that we really prefer, which is the model is capable of thinking, but it is also capable of interacting with the environment and it keeps thinking. It is really good for like inference time skating. And this is another example for the model to organize the desktop. So if it has the access to the file system, it can do things like that. It thinks, which tools should they use, and then use the tools and gets the feedback and continues thinking while it finished using the tools and it finished the task and tell you that, well, we have organized the tecdesktop quite well. And these are two very simple examples to show that we have provided better and better support for agent capabilities. And we not only would like our models to be a simple chatbot, we would like it to be really productive in our working in life to become a really, really productive agent. Yeah and these are three features of three. Yeah we have open weighted a lot of sizes, including two moe models. A small moe model has a total number of 30 billion and only activates 3 billion, but another one is 235 billion, activates 20, 20, 22 billion rambmodels. We also have six dense models. And for smaller models, you can use it for testing and you can support dramodels. But for 4 billion parameter models, you can deploy it on mobile devices. And for 32 billion parameter models, that these are models that you really prefer, and especially 32 billion parameter model, it's a strong, this shows competitiveness and you can use it for doing doing oil. You can deploy it in your local environment as well. So we also open weted dance models as well. But we believe that in this year, maybe in the next years, the future trend is belong to the moe models. So later, we will release more moe models for you to use, and there will be better support in the open source community like third party frameworks for the moe models as well. Besides building large language models, we are also building multimodel models. We have focused quite a lot on vision language models. And I think many of you maybe are using qutwo vl and now using qutwo point five l, which was released this year in January, achieves a very competitive performance in vision language benchmarks like understanding, benchmarks like mmu and also benchmarks like math vista and also general a lot of general vka benchmarks have achieved a very good performance in the benchmarks for vision language understanding. We also exploore the capabilities of thinking for vision language models. So we have built qbq as well. And we find inverse time scaling with larger maximum thinking length, which is equivalent to the thinking budget that. Talked before. So if you have a larger thinking about this, it will achieve better and better performance in reasontaespecially like mathematics. Even for vision language models, it shows similar features as well. But for mulmodel models, we what we really would like to do is to build an omi model, which accepts mulmodalities for the input, and also it is capable of generating multiple modalities like text, vision and audios. But this time, this is not a perfect state, but I think it's good. It's relatively a small model, but we are really proud of this attempt. It's a 7 billion large language model is based on it. But it is capable of accepting three modalities, including text, vision, vision, include images and videos. And also it can accept audio. And this time it can generate text and audio. Maybe in the future, our models might be capable of generating images, high quality images and videos. That would be a truly omnimodels. Now for this omni model, it is, it can be used in word chat and video chat and text chat as well. It achieved savr performance in audio tafor for for the same size models, like 7 billion per memodels. But what surprised us a little bit is that it can even achieve better performance in vision, language, understanding, house in comparison with quen 2.5 vl 7 billion, which means that we can achieve very good performance for an omni model in vision language taks. But a little bit that we have not done well, but we believe we can done we can do well is that we should recover the to performance drop in language tasks, especially for its intelligence, especially for its agent tasks. I think we can recover it by improving our data quality, improving our training training methods. But now there is still some room for improving the model capabilities in different domains and different tasks. And this is the omni models. No matter what models we keep open sourcing. We love open sourcing because open sourcing really helps us quite a lot by the developers can give us some feedbacks to help us improve our models. The interaction with the keep open source community makes us happy and makes us we have more and we are more encouraged to more good models to for for all of you. Yeah. We have a lot of title models in the open source communities, including lms and also coders and qutwo. Five coders is something that many people are using for local development and something that I can tell you that we are now building quthree coders. I think you guys know about it. We have many model sizes because we just believe that for each side, there might be a lot of users. And actually, there are there are a lot of users, no matter very, very small models, like 0.6 billion brand memodels, and there are a lot models, previously 72 billion dense models and now 235 billion moe models, there are a lot of users are using it, and they need quantized models. So we just provide sometimes models in different formats, including ggf, gbq, aw, q and mox for apple as well. We try to use apopatwo point zero for most models, so you can just use it freely. You can change the models freely in your business. You don't need to worry about too main things you don't need to access for the permission. You just directly use it. We hope that large language models and large mulmodels models, models, old foundation models can help you create good applications. And this is what we like to do. And as we are becoming popular and popular during these two years, so for qumodels, it should be supported by a lot of maybe most relative third party frameworks and api platforms as well. Yeah and we are also building products. We are also building products for for you to interact with our models. We are building agents as well. And in our qucharas I mentioned before, we have some very good features. This is something that I really like, which is called wet. By entering web Dev, you just need to insert or input a very simple prompt, things like create a Twitter website and we'll find one. It just generates a code and then some effects like artifacts. So you have a website to to see how it is strong. And you can also deploy as well. You can deploy it and get the url share to your friends to show that how creative you are. You can also create a website for your product, for example. This is a very simple prompt. Create a sunscreen product introduction website and you have a very good website. You can even click on the buttons as well. And for making a card, not just making a website, just by making a card, well, it generates the card. I really like the card. We often use the card in our Twitter. So our problem is also simple. We give it a link. So based on the link, you just create a nice looking card and we provide more information to you. You just based on the information and build a car for for us. Yeah this is something I really, really web app makes me more creative and helps me quite a lot to show our things to all to people all over the world. Yeah we also have things like deep research. By doing deep research, you just need to ask that something to write a report in what you are interested in. Yeah maybe like healthcare industry, maybe like artificial intelligence. Just ask it, give it a prompt, and it will ask you what you are going to focus on. You can tell that and or you can just say as you like. And it will start doing research by making a plan first and then doing the search step by step, writing pause step by step Yeah and keep searching. And finally gives a comprehensive report to all of it. You can download our report by a pdf. We are still improving its quality by doing reinforcement learning to build a fine tool model specifically for deep research. And we believe that there's still much room in this field. It is really hard to do reinforcement learning for the start, but once you have built a good model for this product, it will be really productive for people in their working life. Yeah. So in the future, we will do many things. There are still a lot of things for us to do to achieve agi, to build a really good foundation model and foundation in agent for all of you. And the first thing is that something really different from what other people think, we still believe that there is still much room in training. I'm happy that you you have shown your preference in our pretramodels, but we still find that there are still a lot of good data we did not put into it. There are still a lot of data we did not clean it quite well. And we find that we can use mulmodel data and to make the models more capable in doing different holes in different domains. And we also have synthetic data. And maybe we will finally do something really, really different training methods for pretraining, not just like next token prediction, maybe later we use reinforcement learning in pretraining as well. And so there still much room in prere training to go a very good basis for the chatboor agents. And this is the first thing. And the second thing is that the scaling laws, there are some changes in scaling lopreviously. We are scaling in the model sizes in the ptraining data, but now we need to scale the compute in reinforcement learning. And we have a focus on long horizon reasoning with the environment feedback. So if you train the model which is capable of interacting with the environment, keep thinking this will be something really, really competitive. So it will get the feedback from the environment. Keep thinking it will become smarter and smarter with inference time scaling. Yeah. So you'll find that it will genera very long context. And you have very long context for your input, especially when you have memory. So you use scale o context. Maybe. Finally, we are moving towards Internet context. But now we need to fix the problems of 1 million tokens quite well. And then we are marching to 10 million tokens. And then Internet contso we are scaling contwe are going to scale the context at least 1 million tokens this year for most of our models. Yeah, we're also going to scale modalities. Maybe scanning modalities doesn't increase your intelligence, but if you scale modalities, you can make your models more capable and more productive, especially with the vision language understanding. If you have vision language understanding capability, you can make a like gui agent. But before that, if you have no vision capability, you can it is almost impossible for you to make a goagent and do things like a computer use. And maybe there is still much room in scaling the modalities in either inputs and outputs. We are going to unifying understanding and general generation. For example, for the image understanding image generation at the same time, well, just like GPT -4o, they generate very interesting and high quality images. That that is something what we are going to do as well. Yeah. So based on the four things that I mentioned, so if you would like me to summarize what we are going to do in this year, next year, I think we are moving from the era of training models to training agents. We are actually training the agents, not only worth skaling with pretraining, but also skaling with rl, especially with the environment. We are actually training the agents. So I think we can say that we are now staying in the era of agents, that's all. Thank you very much for listening to my talk. And if you are interested in quen, show ot me an email and talk to me in Yeah, thanks a lot.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
阿里巴巴Qwen团队的Junyang Lin在AI Engineer World Fair 2025上分享了Qwen系列大模型的最新进展及未来展望。Qwen致力于构建通用智能体模型(Generalist Agent Model)。近期发布的Qwen3系列大语言模型,通过强化学习(RL)显著提升了在数学和编码等推理任务上的性能,例如其320亿参数模型在AME 2024上性能从65%提升至80%。Qwen3引入了创新的“混合思维模式”,允许模型在单一模型中结合“思考”和“非思考”行为,并可通过提示或超参数控制。同时,“动态思维预算”特性使得模型性能随思维长度增加而提升。Qwen3支持超过119种语言和方言,大幅扩展了其全球适用性,并增强了Agent和编码能力,特别优化了对MCP([不确定,原文提及mcp,具体指代不明确])的支持。
在多模态领域,Qwen2.5-VL在视觉语言理解基准上表现优异,并探索了思维能力。更进一步的Qwen2.5-Omni模型(70亿参数)实现了文本、视觉(图像、视频)、音频三种模态的输入,以及文本和音频的输出,并在音频任务和部分视觉语言理解任务上超越了专门的视觉语言模型,但语言任务性能有待恢复。
Qwen团队坚持开源,已开源多种尺寸的稠密模型和MoE模型,并提供多种量化版本。产品层面,QwenChat提供了Web Dev(通过简单提示生成网站)和Deep Research(生成深度研究报告)等创新功能。
未来,Qwen将聚焦于:1) 改进预训练(数据质量、多模态数据、合成数据、新训练方法如RL预训练);2) 改变扩展定律,侧重强化学习中的计算扩展和基于环境反馈的长程推理;3) 扩展上下文长度(目标今年内多数模型达100万token,并向1000万token及“无限上下文”(Internet context)迈进);4) 扩展模态能力,统一理解与生成(如高质量图像/视频生成)。核心愿景是从“训练模型”转向“训练智能体”(Training Agents),强调模型与环境的交互和持续学习。
Qwen 系列模型概览与愿景
- 发言人: Junyang Lin (来自Qwen团队,阿里巴巴集团)
- 核心目标: 构建通用智能体模型 (Generalist Agent Model) 和通用模型 (Generalist Model)。
- 重要链接:
- 产品与聊天界面:chat.qwen.ai (QwenChat),支持最新模型、多模态交互(图像、视频上传)、Omni模型(语音、视频聊天)及Web Dev、Deep Research等功能。
- 技术博客:qwen.github.io,发布最新技术细节。
- 开源代码:GitHub
- 模型权重:Hugging Face
- 开源理念: 团队持续进行开源,认为开源能带来有价值的反馈,促进模型改进,并鼓励团队构建更好的模型。
Qwen 大语言模型最新进展:Qwen3
Qwen团队在春节前发布了指令微调模型 Qwen2.5 Max,其性能在多个基准测试中与当时的顶尖模型(如GPT-3.5, GPT-4, [dcb V3 - 不确定,可能是Claude 3])相当。团队认为大语言模型的潜力不止于指令微调,强化学习(RL)能使其更智能。
- 强化学习的应用:
- RL显著提升了模型在推理任务(如数学和编码)上的性能。
- 以 AME 2024 基准为例,一个320亿参数的模型,性能从约65%通过RL提升至80%。
- 在 Chatbot Arena 上也表现出与更大模型的竞争力,长时间位列前15名。
近期,团队结合研发成果发布了下一代大语言模型 Qwen3。
模型规格与性能亮点
Qwen3发布了多种尺寸的稠密模型(Dense Models)和混合专家模型(MoE Models):
- 旗舰MoE模型 (2350亿参数):
- 总参数量2350亿,但每次推理仅激活220亿参数。
- 兼具效率与效果,性能与 [o three mini - 不确定,推测为Llama 3 Mini] 相当,略逊于 Gemini 1.5 Pro。
- 最大稠密模型: 性能同样具有竞争力。
- 快速MoE模型 (300亿参数):
- 总参数量300亿,仅激活30亿参数。
- 在某些任务上,其性能甚至超过了 Qwen2的320亿参数稠密模型。
- 小型稠密模型 (40亿参数):
- 通过蒸馏技术(从大模型向小模型迁移知识)构建。
- 具备思维能力,其性能甚至能与上一代旗舰模型 Qwen2.5 72B 竞争。
- 适合移动端部署。
核心特性:混合思维模式 (Hybrid Thinking Mode)
- 定义: Qwen3首次在开源社区将“思考模式”(Thinking Mode)和“非思考模式”(Non-thinking Mode)集成到单一模型中。
- 思考模式: 在给出详细答案前,模型会进行自我反思、探索可能性,类似Zero One和DeepMind的某些模型。
- 非思考模式: 传统的指令微调模型行为,如聊天机器人,直接、近乎即时地给出答案,无明显思考延迟。
- 控制方式: 用户可以通过提示(prompts)或超参数(hyperparameters)来控制模型的行为模式。
核心特性:动态思维预算 (Dynamic Thinking Budget)
- 定义: “思维预算”指模型思考时允许的最大token数量。
- 机制:
- 如果任务思考完成所需的token数少于预算(如8000 tokens < 32000 tokens预算),则正常完成思考并输出。
- 如果所需token数超出预算(如需8000 tokens,但预算仅4000 tokens),则思考过程会在达到预算上限时被截断。
- 性能影响:
- 模型性能随思维预算的增加而显著提升。
- 以 AME 2024 为例,在思维预算较小时(如4000 tokens),模型得分仅略高于40%;当思维预算增加到32000 tokens时,得分可超过80%。
- 应用价值: 用户可以根据任务需求(如特定准确率要求)调整思维预算,以在性能和成本(token消耗)之间取得平衡。例如,若8000 tokens的思维预算已能达到95%的准确率,则无需浪费更多token。
核心特性:多语言能力增强
- 语言支持数量: Qwen3支持超过 119种语言和方言,而Qwen2.5仅支持29种。
- 意义: 大幅提升了模型在全球范围内的应用潜力,使更多非英语用户能在其领域和语言中使用大语言模型。
核心特性:Agent 与编码能力提升
- Agent能力: 增强了模型使用工具(tool use)和函数调用(function calls)的能力。
- 模型能够在思考过程中调用工具,获取环境反馈,并继续思考,这对于推理时扩展(inference time scaling)非常有利。
- 示例:模型可以接入文件系统,按指令整理桌面文件,展示了其思考、选择工具、执行、获取反馈、继续思考直至完成任务的过程。
- 编码能力: 专门增强了对 MCP ([不确定,原文提及mcp,具体指代不明确]) 的支持,该技术近期非常流行。
- 目标: 使模型不仅仅是聊天机器人,更能成为在工作和生活中高效的生产力工具 (Productive Agent)。
Qwen 多模态模型进展
除了大语言模型,Qwen团队也大力投入多模态模型的研发。
视觉语言模型 (Qwen2.5-VL)
- 发布时间: 2025年1月。
- 性能: 在多个视觉语言理解基准测试中(如MME, MathVista, 以及多种VQA基准)取得了非常有竞争力的表现。
- 思维能力探索: 团队也为视觉语言模型构建了思维能力(qbq),并观察到随最大思维长度(等同于思维预算)增加,模型在推理任务(尤其是数学)上的性能提升,表现出与语言模型相似的特性。
全方位模型 (Qwen2.5-Omni)
- 定位: 接受多模态输入,并能生成多模态输出(文本、视觉、音频)的“全能模型”。
- 当前版本 (70亿参数,基于LLM):
- 输入模态: 文本、视觉(图像、视频)、音频。
- 输出模态: 文本、音频。
- 未来展望: 未来可能生成高质量图像和视频。
- 应用场景: 可用于语音聊天、视频聊天和文本聊天。
- 性能表现:
- 在同等规模(70亿参数)模型中,音频任务上达到SOTA(State-of-the-Art)水平。
- 意外惊喜: 在视觉语言理解任务上,性能甚至优于专门的 Qwen2.5-VL 70亿参数模型。
- 待改进之处: 语言任务(尤其是智能和Agent任务)的性能有所下降,团队相信通过改进数据质量和训练方法可以恢复。
开源理念与生态建设
- 开源范围: Qwen团队已开源了多种尺寸的模型,包括:
- 两个MoE模型:一个小型(总参数300亿,激活30亿),一个大型(总参数2350亿,激活220亿)。
- 六个稠密模型,从小到40亿参数(可用于移动设备部署)到320亿参数(性能强大,适合本地部署)。
- 未来趋势: 团队认为MoE模型是未来趋势,后续将发布更多MoE模型,并期待开源社区(如第三方框架)提供更好的支持。
- 模型系列: 开源社区中拥有众多受欢迎的Qwen模型,包括LLM、Coder系列(Qwen2.5 Coder被广泛用于本地开发)。
- 预告: 团队正在构建 Qwen3 Coder 系列。
- 多尺寸策略: 坚信每种尺寸的模型都有其用户群体,从极小的0.6B模型到235B的MoE模型均有大量用户。
- 量化与格式支持: 提供多种格式的量化模型,如GGUF, GBQ, AWQ, 以及针对苹果设备的Mox。
- 许可证: 大部分模型采用 Apache 2.0 许可证,允许自由使用和商业化,无需申请许可。
- 生态兼容性: Qwen模型已得到大量(可能大部分)相关第三方框架和API平台的支持。
Qwen 产品化应用示例
Qwen团队也在构建产品,方便用户与模型交互,并开发Agent应用。
Web Dev
- 功能: 在QwenChat中,用户通过输入简单提示(如“创建一个推特网站”或“创建一个防晒霜产品介绍网站”),即可生成网站代码及预览效果。
- 特点: 用户可以部署生成的网站并分享URL。
- 应用场景: 快速创建产品介绍页、社交媒体卡片等。发言人表示该功能使其更具创造力,并帮助其向全球展示成果。
Deep Research
- 功能: 用户提出研究主题(如医疗健康行业、人工智能),模型会制定研究计划,分步搜索信息,撰写各部分内容,并最终生成一份综合报告(可下载PDF)。
- 持续改进: 团队正通过强化学习微调专门用于Deep Research的模型以提升报告质量。
- 挑战与机遇: 尽管初期进行此类强化学习难度较大,但一旦构建出优质模型,将极大提升用户在工作和生活中的生产力。
未来发展方向与展望
Junyang Lin指出,要实现AGI、构建优秀的基座模型和Agent,仍有许多工作要做。
-
预训练的持续优化 (Improving Pre-training):
- 团队认为预训练仍有巨大提升空间。
- 数据层面: 存在大量优质数据未被纳入,许多数据未被充分清洗。
- 多模态数据: 利用多模态数据增强模型在不同领域、不同任务上的能力。
- 合成数据: 将会应用。
- 训练方法创新: 可能采用与传统“下一个token预测”不同的预训练方法,例如在预训练阶段引入强化学习。
-
扩展定律的演变 (Changes in Scaling Laws):
- 过去主要关注模型尺寸和预训练数据的扩展。
- 现在需要关注强化学习中的计算扩展 (scale the compute in reinforcement learning)。
- 重点研究基于环境反馈的长程推理 (long horizon reasoning with the environment feedback)。模型通过与环境交互、持续思考,在推理时扩展能力,变得更智能。
-
上下文长度的扩展 (Scaling Context):
- 模型将生成非常长的上下文,并处理非常长的输入上下文,尤其是在有记忆的情况下。
- 目标: 今年内使大部分模型的上下文长度至少达到 100万tokens,并逐步向 1000万tokens 乃至 “无限上下文” (Internet context) 迈进。
-
模态能力的扩展 (Scaling Modalities):
- 扩展模态本身可能不直接增加“智能”,但能使模型更有能力、更具生产力。
- 视觉语言理解是关键,例如构建GUI Agent,使其能像人一样使用计算机。
- 在输入和输出两方面都存在巨大的模态扩展空间。
- 统一理解与生成: 例如,同时进行图像理解和图像生成,类似GPT-4o生成有趣高质量图像的能力,也是Qwen的目标。
结论
Junyang Lin总结,Qwen团队未来一两年的核心方向是 “从训练模型 (training models) 转向训练智能体 (training agents)”。这意味着不仅要通过预训练进行扩展,更要通过强化学习(尤其是与环境交互的RL)进行扩展,实际上是在训练能够与环境交互、持续学习的智能体。他认为,“我们现在正处于智能体的时代 (era of agents)”。