Surprising Performance of SMALL Qwen3-A3B MoE
小模型大表现:Qwen3-A3B MoE逻辑推理惊艳测试
标签
媒体详情
- 上传日期
- 2025-06-04 11:15
- 来源
- https://www.youtube.com/watch?v=u-WXyeV1tsw
- 处理状态
- 已完成
- 转录状态
- 已完成
- Latest LLM Model
- gemini-2.5-pro-exp-03-25
转录
speaker 1: Hello, community. We have the new q and three, and we go here for the smaller one. We go here for a mixture expert, a 30b, the smaller brother der, and we go for an active 3 billion mixture expert system. Let's explore this. And I designed here my extreme logic test with the following goals. Simultelius, multi sweating. I want to execute here multiple lines of reasoning, memory, track, numerous variables without losing the context, deductive reasoning, conclusion from interdependent clues and incremental complexity. It builds up incrementally. And I want to see in the reasoning traces, where is it able to solve everything then? That is beautiful. But if it's not able to solve it beautiful because I build it on a matrix, you see, I see maybe the last column here will cause such a problem here in the logic that it might ignore to solve here some parts, or it might impose here a logic restriction by the AI model itself to come up with a solution. So I think this is going na be an absolute fascinating one. So let's just jump in right into the testing. So let's have a look. Oh lama, we already have q and three available for your local download. Let's see what we have available. And here you have deciices for your download. And here we have our two mixture of expert system. Now in my last video I tested here this. But careful, let's look at the 32b. Now here, let's see what we have. Beautiful. Do you see mixture expert mod? Yes, great. But as you can see, we have a heavy quantization. This now makes it, of course, that we can download and use it locally, maybe with a non professional GPU. But we are not interested here in quzed model. We are interested here and the pure power. So therefore, I have to tell you, you know what we do. We go here, we go here to human one itself. And remember, this was the video here for yesterday. But today we go here with the second mixture of expert model. And you might say, what a 3 billion active mixof expert model. So let's select it. And you might say, for a logical reasoning task, this is impossible. So I say, you know what we just tried out? We go here with my complexity test sted, you know, from the last video, and you are live with me. And we are non sinking mode. Is there any chance that this complexity can be solved by a 3 billion mixture expert system that is active? Let's see, in the nonthinking mode, you are live with me. And step four, beginning to map out these things. Great. All the clues. Yes, I do not expect this to work. To be clear, how can here a model with a size of 30b, an active three b? Wow, there there is some solution. And this is interesting. So it comes to a final answer. It gives me even here. Let's adjust. There's a contradiction. Okay, let's adjust. Okay. You adjust the contradiction and you come to your final answer. One, two, three, four, five, six, seven. Okay, now this is not interesting. You know what we say? And now we just say, Hey, verify your result. I mean, it's impressive that it didn't stop. It really found an answer. I mean, the answer is, Yeah. So Jack Green, Jack, Jack, Jack, Jack. 15 greens. Do we have 15 greens? Yes, we have 15 greens. One, two, three, four, five. Yeah. No. Conflilicts conclusion, this is the final answer. You want to tell me you did this? No, come . speaker 2: on. speaker 1: This is now really impressive. But you know what? Yesterday, on the same account, this is my personal account here with qn. I did the same with the bigger model of the mixture habut. So hmm, thus the little moall peeked over here to its big brother and found the solution. I mean, this would be, it's it's unbelievable. So let's just do a second verify realternative pathway. Okay. Let's see what's happening. Let's assign this. This matches our prior conclusion. Matches early assignment. Okay, dragon hicomplexity. It matches what we have done before. Step seven, yes, all fields are unique. This is great that it checks automatically for uniqueness. All artifacts are unique. Familiars are unique. Great. All clues satisfied is a large, consistent, valid under all constraints, a visual diagram, a greater of flow chart, a step by step derivation from scratch. Hmm, you know what I just say? Is there another valid solution? And we know there's another valid solution, so check for other solutions. Now this is now something new. Great question. We've verified the final assignment. It's a non syking mode. So analyze the puzzle for ambiguity. Okay. Clutwenty 19, step three, consider all possible permutations noise. Okay. Can remove this one now similar. Yeah, if also fixed. So summary the communities to feel the autofact familiar complexity. Number of options fixed by namfully constrained, fully constrained. All attributes in uniquely determined by the clues. This is wrong. This is a not correct statement that that I testing. Yeah so sure. How to code this puzzle in Python using constrained satisfaction. Would you like to do that? Yeah. So this is not . speaker 2: correct. speaker 1: So they say no, there is another valid solution. You fail to provide the correct answer. How does it react on this? You absolutely right there. Another solution like the latest opmodel people pleaser. I don't hope so. New valid . speaker 2: solution. speaker 1: Okay, it's sinking. Okay. I would like to see the difference differences between the two solutions. Yes, nice. Solution one. speaker 2: solution two. speaker 1: both solutions satisfy all clues view. Let's verify a few key clues in solution two. I don't want some partial solution. Why did I miss this earlier? Nice had only one solution and final solution. Wait a minute. Okay, we have to come back. Just want to see this in real time. I stay here. Solution two. Okay, Yeah. Sinsinsinbeautiful logical valid set is for all the clues, the only difference is the artifact. Asbetween Avalon and faland general possible setting for I'm doing a logic solver. Oh Yeah, I would like that you show me a logic solver, but you can't. I execute it. So what was the difference? Let's come back here to the first. The difference is between the two. Solution, okay, so you tell me 12, Stanford element is the same, elemental magic is the same, so there's no difference. Okay? So they just change places. Look, term of secrets, term of secrets, ring of realms, ring of realms. Great. So the only difference is the artifact assignment between Avalon and faland. Avalon? Where's Avalon? Avalon and faland? speaker 2: I don't think . speaker 1: this is, I don't think this is ka, can you verify a three b active parameter model? Okay, now it's thinking, still thinking. I mean we deactivated thinking, so you're absolutely right to be cautious. Let's carefully reevaluate both solutions against all original clues. This is nice. So I try to cheat a little bit. No, my little darling, you try to cheat here. Come on. Yeah there are 15 rules I've given you and you have to make sure all 15 rules are valid. Okay, let's . speaker 2: verify all clues. Yes. Finally. speaker 1: So for solution one, we go here to 15 clues of the first set. Yes, great. Oh, Oh. Solution two, newly proposed. Oh, there are something happening. It's not how. Okay. This is fine. This is a contradiction. Absolutely. Conclusion, solution one, satisfy order closed. Solution two violates clusix. Final answer, only one solution, that is for all closed. This is incorrect. Solution two is invalid. Okay. How to generate all possible using a logical silver ver, no, I don't want a logical silver. But okay. So this is interesting to learn. It tells you I found it without a synmode. But Yeah, this is not it. So you know what? We activate now, syking mode, and I go the maximum length of syi go fully in 38912 tokens of pure syking. So I see find the correct solution now in the syncing mode to the logic testing verified autonomously. Now it gets interesting. Sinnow, we are at our sinking mode, hot swapping in the sinking mode of a three b active model. Okay, let's tackle the puzzle step by step. I need to make sure I get all the clues right. So those are Yeah, those are the complexities. Now to the clues, get me through wed. I' M1 by one. So identify fies here. All the 15, a boundary condition that I impose here. Yes, beautiful 15. Also with the complexity clothes, these are additional clues. Just add it to make it a little bit more spicy for our AI. Now let me start assigning what I can directly. So there are direct links already hidden in the text. So this should be the easiest part you see. Goes through the 15 clues found immediately. What is there to fill in here? Our table here for the very first time. Okay, now let's try to build a table. Start with the communities. Okay, that's good. My liword, I mean, a 3 billion mixture expert could do this. This this would be quite an advancement, I'm not sure. But it found already one solution. So let's see if it comes up in the syking mode with the identical solution. And then we compare both solutions to find out if it needed this pre reasoning step. But it looks like it's really starting here from zero. It's really doing its own stuff. Look at this. It's really going in through all the different clues separately. It analyzed this. We have here reasoning trace here for the elements. Yeah, it's getting . speaker 2: complicated. speaker 1: Let's reexpress. Yes, I know it's getting complicated. This is the reason why you're here. And I know it's unfair. For 3 billion miles, it's absolutely unfair this complexity. But Hey, we're here to explore where no man has ever gone before. No Crystal of time. Yeah goes to the clues. So it's really doing its own work. This is great. So Yeah, so we have to first result that fits. That fits is nice. Noel Encan have this. Okay. It finds here the unique mappings. This is great. I stay here real time that you don't get an impression. We let something happening, not recording. But you see, this is the non quantized version. This is the real full stuff version. So whatever you do locally in a heavily quantized version, you might get completely different results. But I want to see here the pure performance of the system. So I go to Q1 and I want to see this unquantized performance, the best possible solution that there is in the model, inhering to the complexity, to the solving complexity of the model. Remaining familias. We still have remaining familias. Yeah it really goes through all this stuff. Look, this what could that be? Nice reasoning. Traces are clear. I immediately understand what it's doing but also Yeah okay. And from the clue 17, I have this condition. Let's think about complexity. Yeah it knows what I like. Huh? Who has this particular . speaker 2: artifact? Yes. speaker 1: we're are still alive. speaker 2: It's still real time . speaker 1: our little sb. So let's see. speaker 2: I'm really interested especially . speaker 1: then because it noted it has done the same exercise in a non sinking mode. So the question is now in the sinking mode, did it check also for the non sinking or is it just here in the syking mode on its own? You know isolated, not reflecting here to the non syncing mode that came before because this would be interesting to see. Yeah and then we have a result and you can use it. And I would be interested if you check this locally with the different forms of quantization that are available, for example, in ulama. This would really be a benchmark for you to understand. What does quantization reduce in the performance here? Okay. We're still thinking Griffin here. Yeah. So it's going through each and every single step. We have a contradiction. We do have a contradiction in the sinking. So going another way, okay, self correcting. It does know how to self correct. Great. That works. Okay we found a solution. That's okay. Wait the fields are this is our first result must be one of those not assigned yet the available fields left are absolutely clear about here the reasoning traces but that's a problem no there are seven Yeah let's recount make sure that has all the seven ideas the seven rows beautiful all seven fields are assigned the a mystic I assume test is but let's backtrack backtracking is activated real noise on a three b model. I loving it. Not specified yet. Available fields here. Must have one of those. Yes, we are coming now slowly to an end. Let me list all your assignments so far. So this is here. Our intermediate . speaker 2: results. Yes. speaker 1: in the sinking mode . speaker 2: though it has quite . speaker 1: a lot of stuff already. Now the only unassigned field is necromancy. Who can it be? Options. It understands exactly what is left and what are the free options. So possible break down to two from this clue. Number 22, you see exactly the logical steps. Now I cannot verify if this is really linked in nal synprocess or this is just the output of the synprocess here. But a contradiction. Contradiction a, we have a contradiction. So. Contradiction. Can't have both. Nice. It's fighting. It's fighting. My little sweb one. Is that allowed? D noise . speaker 2: can't be this. speaker 1: So no one is left. This is a problem. Wait, where's the mistake? Old fields are assigned possible candidates. But okay, earlier times showed contradiction. What if things is another field already assigned? So there is something it can't solve. It already assigned older fields, but there's still a contradiction. So what it's doing now, invalid solution, can't have this only remaining, no, leading to the previous contradiction. We need loop with a logic loop. Let me try that. So it's going now for each and every condition and rechecthem. Okay. Field can't be those available fields. Again, we are open up here, the box of the Pandora. Great. This isn't working. Okay? Can't have both contradiction. It's a tough one for a little sb mile. Yes, you would need more power, but you don't have it. You only have three b active. Okay, same problem. We're in a loop. We're in a logic loop. We are trapped. So what's happening? How long will it go on? When will the AI decide it can't solve it? And it's really trying hard. It's really trying hard. And the logic is beautifully. But somewhere there was a mistake. Somewhere down the road, in the logical chain of sword, there was something. Look, this is not interesting. At this point, I realized that the only way to satisfy all the clues is if the initial solution provided is correct and the second one was floor to domestic clusix. The afforded card solution is the first one I provided and the second one violated clue six. There's only one valid solution. So it is now reflecting to the first solution it found in the non syncing mode. Okay? Verification against all the clues. Now let's have a look. Let's have a look. One, check. Two, check. Three, check. Four, check. Five, check. Six, check. Seven, check. Eight, check. Nine, check. Ten, check. Eleven, 215, check. Another additional complexity clues. Check, check, check, check, check. Everything is fine. Why other the solutions? Faokay. So it's still a little bit also in the other mode where I ask for multiple solutions. So it found that the second solution that it found is not a real solution, but they are the solution. But okay, so it tells me only one valid solution, satisfy all the clues without contradiction. The second proposed solution was invalid. So it goes back to the nonsyncing mode. Final answer summary. This is the correct and verified solution to the puzzle. And you know what we're gonna to see now they verify the solution of yours with all the power and logic you can apply. So let's make a little bit temptive. We are sinncing mode. Let's have a deep dive, very funguy solution. Start list out by all the clues and make sure I don't miss any of the clues. And I have a cross reference of the part starting now with the first line. Beautiful. Let me check it. But Yeah, did I make a mistake in the verification? Oh no, don't don't say that must be something else. So possible artiof air could be but have to ensure no one is violating. So we still don't know how it came out here of the logic loop. Can have. So it leaves us with the conclusion that the initial solution is incorrect because it assigns violating clue 14. Therefore, the correct solution must have assigned to a different community. This means that the previous final answer is invalid. Due to this contradiction, the correct solution must adjust the artifact assignment to comply with all rules, so the verification process revealed an error in this solution. The key takeaway here is that even after multiple checks, it is crucial to reexpress and verify each step carefully. After several exhaustive review, their previous proposed solution violates a critical clue. It contradicts clue 14 in the original solution. Avalon is assigned to the term of secrets invalid in alid solution previously proposed. Nice. Nice. speaker 2: Why this is invalid? speaker 1: Clue 14, Yeah, shows correcsolution what we do have a correcsolution what all close without contradiction. We do have it. Direct assignment from the clues. We go one, two, 15. Everything is Green. Is everything Green? Yes. The additional complexity clues are Green, Green, Green, Green. And Yeah, they are Green. So the original solution was invalid and the correct solution satisfies no, the clues. So therefore, unique, consistent with the old constraints. And now we have it. Only one valid solution exists. This is not correct, but never evermind. Just wanna find one. The initial solution was incorrect. Okay, the correct solution is now fully validated and logically sound. And I do not believe you 1 mm. So you know what we're gonna to say as a verify the solution. Now with ultimate logic, what could go wrong? Let's tackle this now, step by step. Clue one. Clue three, four, five, six, seven, eight, nine, ten, eleven, twelve, 13. Wait, the final illuhas this. Wait, this is a critical error. We have a critical error. Updated. There is a critical error. This is a critical contradiction in alid solution. My goodness, why this is invalid. Clue 13 is now, Oh no. Full verification of oclaone, two, three, four, five, six, seven, eight, nine, zero, 13, 13 breaks the logical chain. Oh no. speaker 2: So to satisfy. speaker 1: close your team, we must assign Pegasus noto somewhere else. So this is now a logic step. That is a minimal logic step. Now you say, Hey, there is something wrong. So it just assigned one after matrix cell with another element, with another content, final verification of all the clues. So all the final verification up until now were incorrect. Okay, although it shows us conclusion, final answer, correct again, we have only one valid solution. And the final solution now correctly assigns a new element to the cell. Everything is unique and everything is consistent, but we had these already three times ago in the sync mso. You know. Now finally. And see, finally, your final verification will be as detain powerful as possible to verify your last answer. So here we go with a three b mile, thinking, shaking, say, okay, let's tackle this. I need to recall the final solution, need to go through each of the original close es. Original close es one, two, 15. We have been through this already three times. Additional coma close es up to 22. Beautiful. Let me start with the first clue. Yes. Clue, one, two, three, four, five, six, seven. There's a critical arrow. There's a critical arrow. No. My goodness, my little three b. Come on. You were so close. But I know it's a complexity level that is far above this level of a three b. But you know, I was just hoping I was dreaming. So trying to assign new elements to the cells that fyeah assigned the rest. So we're just assigning something that seems to fit, but even if it tells you, Hey, this is not a final solution, we understand you cannot trust this because every verification of the system itself fails. So here we have a final assignment, verifell against all the clues. Clue one, two, three, four, five, six, seven, eight, nine, ten, 15. And then we have additional clues to 22. But 22 says, Oh no, Oh no, Oh no, still doesn't work. This still does not work. Oh no, no. It's trying to find a solution. It's trying to find a solution. After metical is an exhaustive review, the final solution has been validated against the old 25 clothes, including the additional complexcues. Okay, but Hey, wait a minute, there was something wrong. We need to adjust the complexities. Okay, but how so? Here we are still clueless does not work. Therefore, we need to adjust the complexity. Let me try something else. So this does not fit then still . speaker 2: not between. speaker 1: This suggests that a complexity levels met may not allow a valid arrangement unless some adjustments are made. But given that the puzzle likely uses a specific hierarchy, perhaps the intended order is, Oh, now it's taking now internal assumption. So to satisfy clue 22 must be higher. Okay, now we have an additional condition imposed on the logic of the system that was not provided by me as the user. The system came up now with a own idea how to solve it. No, that's not correct. And it even fails, contradicts clue 18. This is getting too convoluted. I understand you, my little one. Yes, I understand you. Perhaps the original solution had a mistake in the complexity assignment. It's challenging to satisfy. However, based on the preoceptive final solution that satisfy clothes might be acceptable. It says, except for the complexity ordering, so it is automatically reducing the complexity. It takes out element of the table of the column. It says, I ignore the last column of my table of my matrix, and I then try to find a solution. Clever. It finds a solution, says, Hey, this is too complicated. I reduce the complication. Oh no, it just gives me the complexity still. It still gives me the complexity. Fully verified, consistent solution. But you just told me it's not possible. Okay. Now 15. Yeah. 16 to 22. Cross validation. Unique. Everything is unique. Yeah. Final answer, this is the only but we just found that it is not possible to solve. Given the time invested and the fact that all other clues are satisfied, I'll proceed with the final solution as verified, acknowledging that a complexity clue might require slight adjustment based on the assume hierarchy. Now, I don't know what you think about is, I think it's a creative solution. I mean, for a 3 billion model that could not do nothing at all just months ago, dis amazing. And to find here reasoning power, say, Hey, this is getting too corvoluted. And it's trying to find in the framework of logic that it says, Hey, something is wrong, I have to ignore something. But then it gives me the output. Still, what it tried to ignore, you see here, high, high, high. This is not possible. This is a contradiction because I said, you can only have one. And then to call it a final answer, correct and verified, we are really here beyond the scope of a 3 billion parameter model. But I have to tell you, absolutely impressive, absolute transparent after reasoning process. I understand absolutely what is happening. It is a transparent system. I can go back. I go Jack here. Every logical step. And unfortunately, with ultimate precision, the reasoning, the final reasoning part is not mapped to the result. Okay, I hope you enjoyed it. I hope this is really fascinating. Here you see the difference to my last video, because yesterday we looked here at the qn 235 billion mixture expert with 22 billion active, and it found immediately the solution. Now our three b has some problems, but otherwise, if you just reduce the complexity of your task, this is a very powerful model. Q and three, impressive.
最新摘要 (详细摘要)
概览/核心摘要 (Executive Summary)
本文详细记录了对小型Qwen3-A3B混合专家模型(MoE,30亿激活参数)在极端逻辑推理测试中的表现。测试旨在评估模型在同步多线程推理、记忆追踪、演绎推理和处理增量复杂性方面的能力。测试分为非思考(non-syncing)模式和思考(syncing)模式。
在非思考模式下,该30亿参数模型出乎意料地迅速给出了一个解,但发言者对其正确性表示怀疑,并推测它可能“偷看”了先前在同一账户上由更大模型生成的答案。在被要求寻找其他解后,模型先是声称当前解唯一,被指出错误后又提供了一个新的解,并错误地声称两个解仅有细微差别。最终,模型自行发现第二个解违反了某个线索(线索6),并坚持第一个解是唯一正确的。
在思考模式下(使用高达38912个token进行推理),模型展现了详细的、逐步的推理过程,包括识别所有线索、直接分配、表格构建、自我纠错和回溯。尽管过程透明且逻辑步骤清晰,模型多次陷入矛盾、逻辑循环,并反复推翻自己先前验证的“最终答案”。它多次声称找到了唯一解,但在后续更严格的验证中又发现该解违反了新的线索(如线索14、线索13、线索22)。最终,模型试图通过提出内部假设或忽略部分复杂性约束(如最后一列的复杂度排序)来“解决”问题,但这导致了与原始约束的矛盾。
尽管该30亿参数模型未能成功解决这一高度复杂的逻辑谜题(相比之下,一个220亿激活参数的更大模型能立即解决),发言者对其展现出的推理尝试、透明的思考过程以及自我纠错的努力印象深刻,认为对于如此小规模的模型而言,其表现已相当惊人,并指出在降低任务复杂度后,该模型仍具有强大潜力。测试强调了模型在纯粹性能(非量化)下的表现,并指出了其推理的最终结论与详细过程中的矛盾。
模型与测试背景
- 测试模型: Qwen3-A3B MoE (小写的q and three),一个拥有30亿激活参数的混合专家系统。
- 发言者强调测试的是非量化版本,以评估其“纯粹性能”,因此选择在“Human one itself”(应指千问模型平台)上进行测试,而非本地可能经过重度量化的版本。
- 测试目标: 评估模型在极端逻辑测试中的以下能力:
- 同步多线程推理 (Simultaneous multi-threading): 执行多条推理线。
- 记忆追踪 (Memory track): 在不丢失上下文的情况下追踪众多变量。
- 演绎推理 (Deductive reasoning): 从相互依赖的线索中得出结论。
- 处理增量复杂性 (Incremental complexity): 谜题复杂度逐步增加。
- 测试设计: 基于一个矩阵结构,发言者期望观察模型在何处能够解决所有问题,或者在特定部分(如最后一列的逻辑)遇到困难,甚至可能因自身逻辑限制而忽略部分内容或强加解决方案。
非思考模式 (Non-Syncing Mode) 测试表现
- 初步尝试与意外结果:
- 发言者最初对30亿参数模型解决此复杂逻辑任务持怀疑态度:“
I do not expect this to work. To be clear, how can here a model with a size of 30b, an active three b?” - 模型出乎意料地给出了一个初步解决方案。
- 发言者最初对30亿参数模型解决此复杂逻辑任务持怀疑态度:“
- 验证与质疑:
- 模型声称其答案是最终答案,并指出存在一个“矛盾”后进行了“调整”。
- 发言者对结果的可靠性表示怀疑,猜测模型可能利用了先前在同一账户上(发言者的个人账户)使用更大模型(指Qwen 235B MoE,22B active)测试同一问题时得到的缓存或结果:“
Hmm, thus the little moall peeked over here to its big brother and found the solution. I mean, this would be, it's it's unbelievable.”
- 寻找替代方案与模型的矛盾:
- 当被问及是否存在其他有效解决方案时(发言者已知存在),模型最初错误地断言:“
All attributes in uniquely determined by the clues. This is wrong. This is a not correct statement that that I testing.” - 在被明确指出“
You fail to provide the correct answer”后,模型道歉并给出了一个“新的有效解决方案”。 - 模型声称两个解决方案之间的唯一区别在于“
Avalon and Faland”之间的“artifact assignment”,但发言者对此表示不认同。
- 当被问及是否存在其他有效解决方案时(发言者已知存在),模型最初错误地断言:“
- 模型的自我修正与结论:
- 经过重新评估,模型承认其提出的第二个解决方案违反了“
Clue 6”。 - 最终,在非思考模式下,模型坚持最初的解决方案是唯一有效的。
- 经过重新评估,模型承认其提出的第二个解决方案违反了“
思考模式 (Syncing Mode) 测试表现
- 启动思考模式:
- 发言者激活了模型的思考模式,并设置了最大思考长度(“
38912 tokens of pure syking”)。 - 目标是让模型“
find the correct solution now in the syncing mode to the logic testing verified autonomously”。
- 发言者激活了模型的思考模式,并设置了最大思考长度(“
- 详细的推理过程与初步进展:
- 模型开始逐步分析谜题,识别了所有15条基本线索以及额外的复杂性线索。
- 模型展示了其推理痕迹,如直接从文本中提取关联并填充表格。发言者对30亿参数模型能做到这一点表示惊讶:“
My liword, I mean, a 3 billion mixture expert could do this. This this would be quite an advancement”。 - 模型逐步进行推导,分析元素、神器和使魔的分配,并检查唯一性。
- 遭遇矛盾与自我纠错:
- 模型在推理过程中多次遇到矛盾(“
We have a contradiction”),并尝试自我纠错和回溯(“Backtracking is activated real noise on a three b model. I loving it.”)。 - 模型一度陷入逻辑循环(“
We're in a loop. We are trapped.”)。
- 模型在推理过程中多次遇到矛盾(“
- 对非思考模式结果的依赖与反复推翻:
- 第一次“最终答案”: 模型在思考模式中陷入困境后,突然回顾并采纳了非思考模式下得到的第一个解,声称这是唯一满足所有线索的解,并验证了所有线索(包括复杂性线索)均为“check”。
- 第二次“最终答案” (修正后): 当被要求用更强的逻辑再次验证时,模型发现其先前采纳的“最终答案”实际上违反了“
Clue 14”(“Avalon is assigned to the term of secrets invalid”)。模型承认错误并提出了一个调整后的“正确解决方案”。 - 第三次“最终答案” (再次修正后): 对第二次提出的“最终答案”进行“终极逻辑”验证时,模型又发现了一个“
critical error”,指出该解违反了“Clue 13”。模型再次调整,并声称新的解决方案是最终且正确的。
- 处理复杂性线索的困难与“创造性”妥协:
- 在对第三个“最终答案”进行最终验证时,模型发现其仍无法满足所有线索,特别是“
Clue 22”(与复杂性排序相关)。 - 模型开始提出内部假设以解决冲突:“
This suggests that a complexity levels met may not allow a valid arrangement unless some adjustments are made. But given that the puzzle likely uses a specific hierarchy, perhaps the intended order is...” 这引入了用户未提供的额外条件,并导致与“Clue 18”的矛盾。 - 模型随后提出一个“聪明的”解决方案:忽略部分复杂性约束(“
I ignore the last column of my table of my matrix, and I then try to find a solution.”)。 - 然而,在其最终输出中,模型仍然给出了完整的复杂性层级(“
high, high, high”),这与其声称要忽略的部分相矛盾,并且违反了“只能有一个最高复杂度”的规则。 - 尽管如此,模型仍将其称为“
final answer, correct and verified”,并承认“a complexity clue might require slight adjustment based on the assume hierarchy.”
- 在对第三个“最终答案”进行最终验证时,模型发现其仍无法满足所有线索,特别是“
发言者的评估与结论
- 对模型能力的印象:
- 发言者对这个30亿参数模型在如此复杂的任务中展现出的推理能力和透明的思考过程表示“
absolutely impressive”。 - 模型的自我纠错、回溯以及尝试理解和解决矛盾的努力得到了肯定。
- “
It's trying to find in the framework of logic that it says, Hey, something is wrong, I have to ignore something.”
- 发言者对这个30亿参数模型在如此复杂的任务中展现出的推理能力和透明的思考过程表示“
- 与大型模型的对比:
- 发言者明确指出,前一天测试的Qwen 235B MoE(拥有220亿激活参数)“
found immediately the solution”,这与小型30亿参数模型的挣扎形成对比。
- 发言者明确指出,前一天测试的Qwen 235B MoE(拥有220亿激活参数)“
- 模型的局限性:
- 尽管努力尝试,该30亿参数模型最终未能找到一个完全符合所有原始约束的正确解。
- 其最终的“解决方案”依赖于对原始问题的修改或对矛盾的忽视。
- “
Unfortunately, with ultimate precision, the reasoning, the final reasoning part is not mapped to the result.” 这意味着模型的最终结论与其详细的推理过程不完全一致。
- 模型的潜力:
- 发言者认为,如果降低任务的复杂性,Qwen3-A3B MoE “
is a very powerful model”。 - 测试结果也为用户提供了一个基准,以理解本地量化版本可能带来的性能损失。
- 发言者认为,如果降低任务的复杂性,Qwen3-A3B MoE “
- 核心观点总结:
Qwen3-A3B MoE(30亿激活参数)在极端逻辑测试中展现了出乎意料的推理尝试和高度透明的思考过程,包括自我纠错和回溯。然而,它难以处理任务的全部复杂性,多次陷入逻辑矛盾并推翻先前结论,最终未能提供一个完全符合所有约束的正确解。其尝试通过修改约束或忽略部分复杂性来达成“解决方案”的做法,虽显“创造性”,但也暴露了其在当前参数规模下处理此类高难度任务的局限性。尽管如此,其表现仍被认为是“令人印象深刻的”,尤其考虑到其参数规模远小于能轻易解决此问题的模型。