محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: 秀吉
التنسيق: Recurso digital
اللغة:
منشور في: Zenodo 2026
الوصول للمادة أونلاين:https://doi.org/10.5281/zenodo.20247229
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
جدول المحتويات:
  • <h2>Headline</h2> <p>Same-hardware A/B on dual RTX 3090 PCIe (vLLM 0.19.1, AWQ-Marlin). 10 prompts × 3 trials × 2 models = 60 samples. Workload mirrors the production <code>robot_brain.py</code> voice agent prompt distribution: 4 chat + 6 tool prompts, <code>tool_choice=auto</code>, <code>enable_thinking=False</code>.</p> <p>| metric | <strong>MoE 35B-A3B + MTP k=3 + TP=2</strong> (production) | Dense 27B no-spec TP=1 | MoE win | |---|---:|---:|---:| | TTFT mean (ms) | <strong>178</strong> | 771 | <strong>4.34×</strong> | | e2e mean (ms) | <strong>274</strong> | 1684 | <strong>6.13×</strong> | | tok/s mean | <strong>88.0</strong> | 16.2 | <strong>5.42×</strong> | | tool accuracy | <strong>30/30 (100 %)</strong> | 23/30 (77 %) | +23.3 pp | | chat false-fires | <strong>0/12</strong> | 7/12 | — |</p> <h2>Decision</h2> <p><strong>No production swap.</strong> Keep MoE + MTP k=3 + TP=2 + <code>--no-enable-prefix-caching</code> (the v4.0 stack). Qwen3.6-27B-AWQ is not a free upgrade on this hardware for this workload.</p> <p>The intuition that "dense 27B is smaller, fits TP=1, should be cheaper to serve" is <strong>falsified here for this hardware × workload</strong>. Dense loses on TTFT (4×), throughput (5×), and tool-call discrimination (over-fires <code>play_emotion</code> on greetings/smalltalk 7/12 chat prompts).</p> <p>The Dense win on cleaner raw zh-TW (5/5 TRAD vs MoE c1 leaking SIMP on "你好" all 3 trials) is partly artifact of Dense refusing to chat — and is independently solved by the <code>OpenCC s2t</code> post-processor shipped in the <a href="https://github.com/thc1006/reachy-mini-agent">consumer repo</a> commit <code>a7912c7</code>.</p> <h2>Scope and caveats</h2> <ul> <li><strong>N=3 trials per cell.</strong> Sufficient for the 4× / 5× / 23 pp gaps shown; not sufficient for close calls. Read the <a href="https://github.com/thc1006/qwen3.6-vllm-2x3090/blob/master/v5_2026_05_17/README.md">v5 README</a> before generalizing.</li> <li><strong>TP confound.</strong> Dense ran TP=1 (single 3090 budget; 13.5 GB weights fit cleanly). TP=2 Dense follow-up is out of scope; textbook scaling suggests ≤ 2× tok/s, which still does not close the 5.4× throughput gap.</li> <li><strong>Spec-decode asymmetry.</strong> MoE ran with MTP k=3 (v4.0 winner); Dense ran no-spec (no public MTP draft head for Qwen3.6-27B yet). This is <strong>the right</strong> comparison for "what should I serve in production"; it is the wrong comparison for "is dense base decode faster than MoE base decode", and the latter is moot — nothing in production runs base.</li> <li><strong>Single hardware.</strong> 2× RTX 3090 PCIe, no NVLink, SM 8.6. NVLink / HBM / H100 would change every absolute number; the direction of "MoE+MTP wins by a wide margin on voice-agent shape" is unlikely to flip but is unverified here.</li> </ul> <h2>Where the chat false-fires came from</h2> <p>Dense's 7 chat false-fires cluster on c1 ("你好,我是瑞奇") and c2 ("你今天好嗎?"), both 3/3 fired <code>play_emotion</code>. MoE replies to those same prompts with text (e.g. c2 → "我很好,謝謝!你呢?"). A system-prompt tweak ("only call a tool when the user names an action verb") would probably narrow this gap; that ablation is out of scope for a "production-config A/B".</p> <h2>Artifacts</h2> <ul> <li><a href="https://github.com/thc1006/qwen3.6-vllm-2x3090/blob/master/v5_2026_05_17/README.md"><code>v5_2026_05_17/README.md</code></a> — full writeup.</li> <li><a href="https://github.com/thc1006/qwen3.6-vllm-2x3090/tree/master/v5_2026_05_17/data"><code>v5_2026_05_17/data/</code></a> — 60 raw bench rows.</li> <li><a href="https://github.com/thc1006/qwen3.6-vllm-2x3090/blob/master/v5_2026_05_17/bench/v5_voice_bench.py"><code>v5_2026_05_17/bench/v5_voice_bench.py</code></a> — bench harness.</li> <li><a href="https://github.com/thc1006/qwen3.6-vllm-2x3090/tree/master/v5_2026_05_17/analysis"><code>v5_2026_05_17/analysis/</code></a> — analyzer + aggregate.json.</li> </ul> <h2>Related</h2> <ul> <li><a href="https://github.com/thc1006/qwen3.6-vllm-2x3090/releases/tag/v4.0">v4.0</a> — 9-phase factorial sweep that fixed MoE + MTP k=3 + TP=2 + cache-OFF.</li> <li><a href="https://github.com/thc1006/qwen3.6-speculative-decoding-rtx3090/releases/tag/v3.0">v3.0</a> — original MTP +27.5 % cache-OFF measurement.</li> </ul>