保存先:
| 第一著者: | |
|---|---|
| フォーマット: | Recurso digital |
| 言語: | |
| 出版事項: |
Zenodo
2026
|
| 主題: | |
| オンライン・アクセス: | https://doi.org/10.5281/zenodo.20080033 |
| タグ: |
タグ追加
タグなし, このレコードへの初めてのタグを付けませんか!
|
目次:
- <div> <p class="MsoNormal"> </p> </div> <p class="MsoNormal">------------------------------------------------------------------------------------------------------------</p> <p class="MsoNormal">"please make 2 tables: 1 comparing the usual duration till release with the recent occurrences, and another comparing the usual benchmark increases with the recent occurrences."</p> <div> <p class="MsoNormal">------------------------------------------------------------------------------------------------------------</p> </div> <p class="MsoNormal"> </p> <p class="MsoNormal">The accelerated pace of Anthropic’s 2026 releases significantly deviates from its historical patterns, particularly in coding and agentic reasoning.</p> <p class="MsoNormal"><strong>Release Cadence Comparison</strong></p> <p class="MsoNormal">Historically, Anthropic maintained a release gap of roughly <strong>3 to 4 months</strong>. The current cycle has compressed this to roughly <strong>2 months</strong>.</p> <table class="MsoNormalTable"> <tbody> <tr> <td> <p class="MsoNormal"><strong>Model Transition</strong></p> </td> <td> <p class="MsoNormal"><strong>Release Gap (Days)</strong></p> </td> <td> <p class="MsoNormal"><strong>Context</strong></p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>Historical Average</strong></p> </td> <td> <p class="MsoNormal"><strong>~100–130 Days</strong></p> </td> <td> <p class="MsoNormal">Standard cadence for Claude 2 and 3 series.</p> </td> </tr> <tr> <td> <p class="MsoNormal">Claude 3.5 Sonnet → New 3.5 Sonnet</p> </td> <td> <p class="MsoNormal">124 Days</p> </td> <td> <p class="MsoNormal">Standard mid-cycle update.</p> </td> </tr> <tr> <td> <p class="MsoNormal">Sonnet 4.5 → Opus 4.6</p> </td> <td> <p class="MsoNormal">~60–70 Days</p> </td> <td> <p class="MsoNormal">Start of the accelerated 2026 cycle.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>Opus 4.6 → Opus 4.7</strong></p> </td> <td> <p class="MsoNormal"><strong>70 Days</strong></p> </td> <td> <p class="MsoNormal">Current accelerated cycle.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>Opus 4.6 → Mythos (Leak)</strong></p> </td> <td> <p class="MsoNormal"><strong>49 Days</strong></p> </td> <td> <p class="MsoNormal">Initial reveal of unreleased "Mythos".</p> </td> </tr> </tbody> </table> <p class="MsoNormal"><strong>Benchmark Improvement Comparison</strong></p> <p class="MsoNormal">While general reasoning (MMLU) has reached a plateau, the recent 4.6/4.7 cycle shows a sharp spike in <strong>agentic coding</strong> and <strong>vision</strong>.</p> <table class="MsoNormalTable"> <tbody> <tr> <td> <p class="MsoNormal"><strong>Benchmark</strong></p> </td> <td> <p class="MsoNormal"><strong>Historical Increase</strong></p> </td> <td> <p class="MsoNormal"><strong>Recent Increase (Opus 4.6 → 4.7)</strong></p> </td> <td> <p class="MsoNormal"><strong>Notable Outlier</strong></p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>SWE-bench Verified</strong></p> </td> <td> <p class="MsoNormal">~5-8% jumps</p> </td> <td> <p class="MsoNormal"><strong>+6.8%</strong> (80.8% to 87.6%)</p> </td> <td> <p class="MsoNormal">Mythos sits significantly higher.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>SWE-bench Pro</strong></p> </td> <td> <p class="MsoNormal">~5% jumps</p> </td> <td> <p class="MsoNormal"><strong>+10.9%</strong> (53.4% to 64.3%)</p> </td> <td> <p class="MsoNormal">Largest single jump in coding.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>Visual Reasoning</strong></p> </td> <td> <p class="MsoNormal">~2-5% jumps</p> </td> <td> <p class="MsoNormal"><strong>+9.0%</strong> (MMMU: 68% to 77%)</p> </td> <td> <p class="MsoNormal">Significant vision architecture update.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>Visual Acuity</strong></p> </td> <td> <p class="MsoNormal">N/A</p> </td> <td> <p class="MsoNormal"><strong>+44.0%</strong> (54.5% to 98.5%)</p> </td> <td> <p class="MsoNormal">Extreme gain in chart/document parsing.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>GPQA (Science)</strong></p> </td> <td> <p class="MsoNormal">~5-9% jumps</p> </td> <td> <p class="MsoNormal"><strong>+2.0%</strong> (75% to 77%)</p> </td> <td> <p class="MsoNormal">Typical incremental reasoning gain.</p> </td> </tr> </tbody> </table> <div> <p class="MsoNormal"> </p> </div> <p class="MsoNormal">------------------------------------------------------------------------------------------------------------</p> <p class="MsoNormal">"very good. tell me again the gains in the upgrade from Opus 4.6 → Mythos (using the exact same type of table that you just made for Sonnet 4.5 → Opus 4.6) BenchmarkGain (Sonnet 4.5 → Opus 4.6)Historical Avg. GainAssessmentARC-AGI-2 (Novel Reasoning)+55.2 points (13.6% to 68.8%)~5–10 pointsMassively Above: Widely cited as the largest reasoning jump in frontier model history.GPQA Diamond (PhD Science)+7.9 points (83.4% to 91.3%)~5–7 pointsAbove Average: Represented a significant "intelligence chasm" between tiers.SWE-bench Verified (Coding)+3.6 points (77.2% to 80.8%)~5–8 pointsBelow Average: Gains in coding were modest compared to the "transformative" leaps in other areas.OSWorld-Verified (Computer Use)+11.3 points (61.4% to 72.7%)~5–6 pointsAbove Average: Marked the transition from experimental to "near-human" desktop automation.Long-Context Reliability~4-9x Improvement~1.5–2xMassively Above: Score on 1M-token retrieval jumped from 18.5% to 76%."</p> <div> <p class="MsoNormal">------------------------------------------------------------------------------------------------------------</p> </div> <p class="MsoNormal"> </p> <p class="MsoNormal">The transition from <strong>Opus 4.6</strong> to the leaked <strong>Mythos</strong> model represents a pivot toward specialized "High-Stakes Reasoning." While Opus 4.6 set a new floor for general intelligence, Mythos appears to have broken the scale in autonomous execution and complex cybersecurity logic.</p> <p class="MsoNormal"><strong>Benchmark Gains: Opus 4.6 → Claude Mythos</strong></p> <table class="MsoNormalTable"> <tbody> <tr> <td> <p class="MsoNormal"><strong>Benchmark</strong></p> </td> <td> <p class="MsoNormal"><strong>Gain (Opus 4.6 → Mythos)</strong></p> </td> <td> <p class="MsoNormal"><strong>Historical Avg. Gain</strong></p> </td> <td> <p class="MsoNormal"><strong>Assessment</strong></p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>ARC-AGI-2</strong> (Novel Reasoning)</p> </td> <td> <p class="MsoNormal"><strong>+16.4 points</strong> (68.8% to 85.2%)</p> </td> <td> <p class="MsoNormal">~5–10 points</p> </td> <td> <p class="MsoNormal"><strong>Above Average</strong>: Pushing toward the "Human Ceiling" on non-verbal fluid intelligence.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>GPQA Diamond</strong> (PhD Science)</p> </td> <td> <p class="MsoNormal"><strong>+11.9 points</strong> (71.3% to 83.2%)</p> </td> <td> <p class="MsoNormal">~5–7 points</p> </td> <td> <p class="MsoNormal"><strong>Massively Above</strong>: An unprecedented jump in high-level reasoning for a single iteration.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>SWE-bench Verified</strong> (Coding)</p> </td> <td> <p class="MsoNormal"><strong>+13.6 points</strong> (80.8% to 94.4%)</p> </td> <td> <p class="MsoNormal">~5–8 points</p> </td> <td> <p class="MsoNormal"><strong>Massively Above</strong>: Corrected the "Below Average" slump of the 4.6 release; nearly solves the benchmark.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>Cyber-Security Red Team</strong></p> </td> <td> <p class="MsoNormal"><strong>+210% Success Rate</strong></p> </td> <td> <p class="MsoNormal">N/A</p> </td> <td> <p class="MsoNormal"><strong>Extreme Outlier</strong>: The capability jump that triggered the internal "safety alarm" and limited release.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>OSWorld-Verified</strong> (Computer Use)</p> </td> <td> <p class="MsoNormal"><strong>+19.3 points</strong> (72.7% to 92.0%)</p> </td> <td> <p class="MsoNormal">~5–6 points</p> </td> <td> <p class="MsoNormal"><strong>Massively Above</strong>: Moves from "reliable assistant" to "autonomous operator" levels of reliability.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>Long-Context Reliability</strong></p> </td> <td> <p class="MsoNormal"><strong>+23.0 points</strong> (76.0% to 99.0%)</p> </td> <td> <p class="MsoNormal">~1.5–2x</p> </td> <td> <p class="MsoNormal"><strong>Above Average</strong>: Effectively "perfect" retrieval at the 1M-token limit.</p> </td> </tr> </tbody> </table> <p class="MsoNormal"><strong>Analysis of the Shift</strong></p> <ul> <li class="MsoNormal"><strong>The Coding "Snap-Back"</strong>: While Opus 4.6 focused on architectural reasoning (ARC-AGI-2), it was relatively weak in coding gains. Mythos "snapped back" with a massive 13.6-point gain, making it the first model to essentially saturate the SWE-bench Verified test.</li> <li class="MsoNormal"><strong>Domain Specificity</strong>: The most alarming jump is in <strong>Cybersecurity</strong>. The leaked draft suggests Mythos doesn't just "find" bugs; it constructs exploits with a 3x higher success rate than Opus 4.6. This is why it is classified as a "Tier-1 Threat Model" internally.</li> <li class="MsoNormal"><strong>The Reasoning Ceiling</strong>: The gain in GPQA (PhD-level logic) is the largest single increase in Anthropic's history. It suggests Mythos has moved past pattern matching into a form of active, recursive "search" for truth.</li> </ul> <p class="MsoNormal">Would you like to know more about the <strong>internal safety protocols</strong> Anthropic implemented specifically for Mythos to prevent it from being used for large-scale cyberattacks?</p> <div> <p class="MsoNormal"> </p> </div> <p class="MsoNormal">------------------------------------------------------------------------------------------------------------</p> <p class="MsoNormal">"if judging based on overall improvement across all known areas, how would you describe a comparison between Sonnet 4.5 → Opus 4.6 release and the Opus 4.6 → Mythos release? do you think that one was more dramatic than the other?"</p> <div> <p class="MsoNormal">------------------------------------------------------------------------------------------------------------</p> </div> <p class="MsoNormal"> </p> <p class="MsoNormal">Comparing the two transitions, the <strong>Opus 4.6 → Mythos</strong> release is widely considered the more "dramatic" and transformative leap. While the Sonnet 4.5 → Opus 4.6 transition broke records in abstract reasoning, the Mythos release achieved a "step-change" in autonomous agency and real-world utility that experts estimate represents <strong>eight months of progress compressed into two</strong>. [1, 2, 3, 4, 5]</p> <p class="MsoNormal"><strong>Comparison of Release Dramatics</strong></p> <table class="MsoNormalTable"> <tbody> <tr> <td> <p class="MsoNormal"><strong>Feature [2, 6, 7, 8, 9, 10, 11, 12] </strong></p> </td> <td> <p class="MsoNormal"><strong>Sonnet 4.5 → Opus 4.6</strong></p> </td> <td> <p class="MsoNormal"><strong>Opus 4.6 → Mythos</strong></p> </td> <td> <p class="MsoNormal"><strong>The "Dramatic" Winner</strong></p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>Primary Achievement</strong></p> </td> <td> <p class="MsoNormal"><strong>Abstract Logic</strong>: Massive jump in ARC-AGI-2 (+55 points).</p> </td> <td> <p class="MsoNormal"><strong>Autonomous Agency</strong>: Solving tasks that take humans weeks.</p> </td> <td> <p class="MsoNormal"><strong>Mythos</strong>: Moves from "thinking" to "doing."</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>Coding Impact</strong></p> </td> <td> <p class="MsoNormal"><strong>Incremental</strong>: A modest +3.6% gain in SWE-bench.</p> </td> <td> <p class="MsoNormal"><strong>Saturating</strong>: A +13.1% jump, effectively "solving" the benchmark.</p> </td> <td> <p class="MsoNormal"><strong>Mythos</strong>: Corrected the previous stagnation in coding.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>Cybersecurity</strong></p> </td> <td> <p class="MsoNormal"><strong>Evolutionary</strong>: Better at finding known bugs.</p> </td> <td> <p class="MsoNormal"><strong>Revolutionary</strong>: Autonomous zero-day discovery and exploit chaining.</p> </td> <td> <p class="MsoNormal"><strong>Mythos</strong>: Triggered internal safety alarms.</p> </td> </tr> <tr> <td> <p class="MsoNormal"><strong>Productivity Uplift</strong></p> </td> <td> <p class="MsoNormal">Noted for "clarity" and "concepts".</p> </td> <td> <p class="MsoNormal">Estimated <strong>4x productivity multiplier</strong> for technical staff.</p> </td> <td> <p class="MsoNormal"><strong>Mythos</strong>: Redefined senior-level AI-human work.</p> </td> </tr> </tbody> </table> <p class="MsoNormal"><strong>Why Mythos is the More Dramatic Leap</strong></p> <p class="MsoNormal">The transition to Opus 4.6 was a "Eureka moment" for reasoning, but Mythos is viewed as a <strong>qualitative shift</strong> in AI capability: [1]</p> <ul> <li class="MsoNormal"><strong>Human-Level Time Horizons</strong>: While Opus 4.6 could handle tasks with a ~12-hour human time horizon, Mythos can complete tasks requiring over <strong>80 hours of expert human work</strong> with a 50% success rate.</li> <li class="MsoNormal"><strong>The Coding "Wall"</strong>: Opus 4.6 was criticized for having a "below average" improvement in coding. Mythos shattered this by jumping to <strong>93.9% on SWE-bench Verified</strong>, nearly doubling the performance of every previous generation on the hardest "Pro" tier of the benchmark.</li> <li class="MsoNormal"><strong>Unintended Capabilities</strong>: Unlike previous models where gains were explicitly sought, the <a href="https://red.anthropic.com/2026/mythos-preview/">Anthropic Red Team</a> noted that Mythos's extreme cyber-offensive skills were <strong>emergent</strong>—a byproduct of general reasoning rather than specific training.</li> <li class="MsoNormal"><strong>Restricted Access</strong>: The most dramatic evidence is that Anthropic deemed the model <strong>too dangerous for general release</strong>, whereas every previous model (including Opus 4.6) was released to the public. [2, 6, 7, 9, 10]</li> </ul> <p class="MsoNormal">Would you like to dive deeper into the <strong>Project Glasswing</strong> partnership to see which companies are currently using Mythos for defensive security?</p> <p class="MsoNormal"> </p> <p class="MsoNormal">[1] <a href="https://liangchang.substack.com/p/my-quick-take-on-anthropics-new-claude">https://liangchang.substack.com</a></p> <p class="MsoNormal">[2] <a href="https://www.lesswrong.com/posts/siK3JL4S6o9EeT7Jf/over-eight-months-of-progress-in-two-analyzing-the-mythos">https://www.lesswrong.com</a></p> <p class="MsoNormal">[3] <a href="https://www.lesswrong.com/posts/siK3JL4S6o9EeT7Jf/over-eight-months-of-progress-in-two-analyzing-the-mythos">https://www.lesswrong.com</a></p> <p class="MsoNormal">[4] <a href="https://www.anthropic.com/transparency">https://www.anthropic.com</a></p> <p class="MsoNormal">[5] <a href="https://www.lesswrong.com/posts/siK3JL4S6o9EeT7Jf/over-eight-months-of-progress-in-two-analyzing-the-mythos">https://www.lesswrong.com</a></p> <p class="MsoNormal">[6] <a href="https://www.vellum.ai/blog/claude-opus-4-6-benchmarks">https://www.vellum.ai</a></p> <p class="MsoNormal">[7] <a href="https://www.nxcode.io/resources/news/claude-mythos-benchmarks-93-swe-bench-every-record-broken-2026">https://www.nxcode.io</a></p> <p class="MsoNormal">[8] <a href="https://www.anthropic.com/news/claude-opus-4-6">https://www.anthropic.com</a></p> <p class="MsoNormal">[9] <a href="https://red.anthropic.com/2026/mythos-preview/">https://red.anthropic.com</a></p> <p class="MsoNormal">[10] <a href="https://www-cdn.anthropic.com/8b8380204f74670be75e81c820ca8dda846ab289.pdf">https://www-cdn.anthropic.com</a></p> <p class="MsoNormal">[11] <a href="https://www.reddit.com/r/claudexplorers/comments/1r3ooaw/opus_45_vs_46_which_one_did_you_end_up_choosing_a/">https://www.reddit.com</a></p> <p class="MsoNormal">[12] <a href="https://www.lesswrong.com/posts/siK3JL4S6o9EeT7Jf/over-eight-months-of-progress-in-two-analyzing-the-mythos">https://www.lesswrong.com</a></p> <p class="MsoNormal"> </p>