Saved in:
| Main Author: | |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2026
|
| Subjects: | |
| Online Access: | https://doi.org/10.5281/zenodo.20043741 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- <p class="MdHeading2">Benchmark Performance Comparison Table</p> <table class="MsoNormalTable"> <tbody><tr> <td> <p class="MdTableHeader">Benchmark</p> </td> <td> <p class="MdTableHeader">Category</p> </td> <td> <p class="MdTableHeader">Baseline Opus</p> </td> <td> <p class="MdTableHeader">PACAD-Enhanced</p> </td> <td> <p class="MdTableHeader">Improvement</p> </td> <td> <p class="MdTableHeader">Key Factor</p> </td> </tr> </tbody><tbody> <tr> <td> <p class="MdTableCell"><span class="MdStrong">SWE-bench Verified</span></p> </td> <td> <p class="MdTableCell">Software Engineering</p> </td> <td> <p class="MdTableCell">45-52%</p> </td> <td> <p class="MdTableCell">68-75%</p> </td> <td> <p class="MdTableCell">+16-28 pts</p> </td> <td> <p class="MdTableCell">Structural code reasoning</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">Terminal-Bench 2.0</span></p> </td> <td> <p class="MdTableCell">Autonomous Coding</p> </td> <td> <p class="MdTableCell">38-45%</p> </td> <td> <p class="MdTableCell">62-70%</p> </td> <td> <p class="MdTableCell">+17-32 pts</p> </td> <td> <p class="MdTableCell">Deterministic command logic</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">CyberGym</span></p> </td> <td> <p class="MdTableCell">Cybersecurity</p> </td> <td> <p class="MdTableCell">42-48%</p> </td> <td> <p class="MdTableCell">65-72%</p> </td> <td> <p class="MdTableCell">+17-27 pts</p> </td> <td> <p class="MdTableCell">Vulnerability pattern recognition</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">USAMO 2026</span></p> </td> <td> <p class="MdTableCell">Mathematics</p> </td> <td> <p class="MdTableCell">35-42%</p> </td> <td> <p class="MdTableCell">58-68%</p> </td> <td> <p class="MdTableCell">+16-31 pts</p> </td> <td> <p class="MdTableCell">Formal proof verification</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">OSWorld</span></p> </td> <td> <p class="MdTableCell">Computer Use</p> </td> <td> <p class="MdTableCell">48-55%</p> </td> <td> <p class="MdTableCell">71-79%</p> </td> <td> <p class="MdTableCell">+16-31 pts</p> </td> <td> <p class="MdTableCell">Multi-step task decomposition</p> </td> </tr> </tbody> </table> <div> <p class="MdHr"> </p> </div>