Salvato in:
| Autore principale: | |
|---|---|
| Natura: | Recurso digital |
| Lingua: | |
| Pubblicazione: |
Zenodo
2026
|
| Soggetti: | |
| Accesso online: | https://doi.org/10.5281/zenodo.20043741 |
| Tags: |
Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
|
| _version_ | 1866901284619026432 |
|---|---|
| author | Brown, Cameron |
| author_facet | Brown, Cameron |
| contents | <p class="MdHeading2">Benchmark Performance Comparison Table</p> <table class="MsoNormalTable"> <tbody><tr> <td> <p class="MdTableHeader">Benchmark</p> </td> <td> <p class="MdTableHeader">Category</p> </td> <td> <p class="MdTableHeader">Baseline Opus</p> </td> <td> <p class="MdTableHeader">PACAD-Enhanced</p> </td> <td> <p class="MdTableHeader">Improvement</p> </td> <td> <p class="MdTableHeader">Key Factor</p> </td> </tr> </tbody><tbody> <tr> <td> <p class="MdTableCell"><span class="MdStrong">SWE-bench Verified</span></p> </td> <td> <p class="MdTableCell">Software Engineering</p> </td> <td> <p class="MdTableCell">45-52%</p> </td> <td> <p class="MdTableCell">68-75%</p> </td> <td> <p class="MdTableCell">+16-28 pts</p> </td> <td> <p class="MdTableCell">Structural code reasoning</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">Terminal-Bench 2.0</span></p> </td> <td> <p class="MdTableCell">Autonomous Coding</p> </td> <td> <p class="MdTableCell">38-45%</p> </td> <td> <p class="MdTableCell">62-70%</p> </td> <td> <p class="MdTableCell">+17-32 pts</p> </td> <td> <p class="MdTableCell">Deterministic command logic</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">CyberGym</span></p> </td> <td> <p class="MdTableCell">Cybersecurity</p> </td> <td> <p class="MdTableCell">42-48%</p> </td> <td> <p class="MdTableCell">65-72%</p> </td> <td> <p class="MdTableCell">+17-27 pts</p> </td> <td> <p class="MdTableCell">Vulnerability pattern recognition</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">USAMO 2026</span></p> </td> <td> <p class="MdTableCell">Mathematics</p> </td> <td> <p class="MdTableCell">35-42%</p> </td> <td> <p class="MdTableCell">58-68%</p> </td> <td> <p class="MdTableCell">+16-31 pts</p> </td> <td> <p class="MdTableCell">Formal proof verification</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">OSWorld</span></p> </td> <td> <p class="MdTableCell">Computer Use</p> </td> <td> <p class="MdTableCell">48-55%</p> </td> <td> <p class="MdTableCell">71-79%</p> </td> <td> <p class="MdTableCell">+16-31 pts</p> </td> <td> <p class="MdTableCell">Multi-step task decomposition</p> </td> </tr> </tbody> </table> <div> <p class="MdHr"> </p> </div> |
| format | Recurso digital |
| id | zenodo_https___doi_org_10_5281_zenodo_20043741 |
| institution | Zenodo |
| language | |
| publishDate | 2026 |
| publisher | Zenodo |
| record_format | zenodo |
| spellingShingle | PACAD-Enhanced Opus: Specialized Benchmark Performance Estimates – 2-3 Week Cycle (14-21 Days, Median ~17 Days) Brown, Cameron Artificial Intelligence Artificial intelligence Artificial Intelligence Artificial Intelligence/standards Artificial Intelligence/trends Artificial Intelligence/classification Artificial Intelligence/statistics & numerical data <p class="MdHeading2">Benchmark Performance Comparison Table</p> <table class="MsoNormalTable"> <tbody><tr> <td> <p class="MdTableHeader">Benchmark</p> </td> <td> <p class="MdTableHeader">Category</p> </td> <td> <p class="MdTableHeader">Baseline Opus</p> </td> <td> <p class="MdTableHeader">PACAD-Enhanced</p> </td> <td> <p class="MdTableHeader">Improvement</p> </td> <td> <p class="MdTableHeader">Key Factor</p> </td> </tr> </tbody><tbody> <tr> <td> <p class="MdTableCell"><span class="MdStrong">SWE-bench Verified</span></p> </td> <td> <p class="MdTableCell">Software Engineering</p> </td> <td> <p class="MdTableCell">45-52%</p> </td> <td> <p class="MdTableCell">68-75%</p> </td> <td> <p class="MdTableCell">+16-28 pts</p> </td> <td> <p class="MdTableCell">Structural code reasoning</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">Terminal-Bench 2.0</span></p> </td> <td> <p class="MdTableCell">Autonomous Coding</p> </td> <td> <p class="MdTableCell">38-45%</p> </td> <td> <p class="MdTableCell">62-70%</p> </td> <td> <p class="MdTableCell">+17-32 pts</p> </td> <td> <p class="MdTableCell">Deterministic command logic</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">CyberGym</span></p> </td> <td> <p class="MdTableCell">Cybersecurity</p> </td> <td> <p class="MdTableCell">42-48%</p> </td> <td> <p class="MdTableCell">65-72%</p> </td> <td> <p class="MdTableCell">+17-27 pts</p> </td> <td> <p class="MdTableCell">Vulnerability pattern recognition</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">USAMO 2026</span></p> </td> <td> <p class="MdTableCell">Mathematics</p> </td> <td> <p class="MdTableCell">35-42%</p> </td> <td> <p class="MdTableCell">58-68%</p> </td> <td> <p class="MdTableCell">+16-31 pts</p> </td> <td> <p class="MdTableCell">Formal proof verification</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">OSWorld</span></p> </td> <td> <p class="MdTableCell">Computer Use</p> </td> <td> <p class="MdTableCell">48-55%</p> </td> <td> <p class="MdTableCell">71-79%</p> </td> <td> <p class="MdTableCell">+16-31 pts</p> </td> <td> <p class="MdTableCell">Multi-step task decomposition</p> </td> </tr> </tbody> </table> <div> <p class="MdHr"> </p> </div> |
| title | PACAD-Enhanced Opus: Specialized Benchmark Performance Estimates – 2-3 Week Cycle (14-21 Days, Median ~17 Days) |
| topic | Artificial Intelligence Artificial intelligence Artificial Intelligence Artificial Intelligence/standards Artificial Intelligence/trends Artificial Intelligence/classification Artificial Intelligence/statistics & numerical data |
| url | https://doi.org/10.5281/zenodo.20043741 |