Salvato in:
Dettagli Bibliografici
Autore principale: Brown, Cameron
Natura: Recurso digital
Lingua:
Pubblicazione: Zenodo 2026
Soggetti:
Accesso online:https://doi.org/10.5281/zenodo.20043741
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866901284619026432
author Brown, Cameron
author_facet Brown, Cameron
contents <p class="MdHeading2">Benchmark Performance Comparison Table</p> <table class="MsoNormalTable"> <tbody><tr> <td> <p class="MdTableHeader">Benchmark</p> </td> <td> <p class="MdTableHeader">Category</p> </td> <td> <p class="MdTableHeader">Baseline Opus</p> </td> <td> <p class="MdTableHeader">PACAD-Enhanced</p> </td> <td> <p class="MdTableHeader">Improvement</p> </td> <td> <p class="MdTableHeader">Key Factor</p> </td> </tr> </tbody><tbody> <tr> <td> <p class="MdTableCell"><span class="MdStrong">SWE-bench Verified</span></p> </td> <td> <p class="MdTableCell">Software Engineering</p> </td> <td> <p class="MdTableCell">45-52%</p> </td> <td> <p class="MdTableCell">68-75%</p> </td> <td> <p class="MdTableCell">+16-28 pts</p> </td> <td> <p class="MdTableCell">Structural code reasoning</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">Terminal-Bench 2.0</span></p> </td> <td> <p class="MdTableCell">Autonomous Coding</p> </td> <td> <p class="MdTableCell">38-45%</p> </td> <td> <p class="MdTableCell">62-70%</p> </td> <td> <p class="MdTableCell">+17-32 pts</p> </td> <td> <p class="MdTableCell">Deterministic command logic</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">CyberGym</span></p> </td> <td> <p class="MdTableCell">Cybersecurity</p> </td> <td> <p class="MdTableCell">42-48%</p> </td> <td> <p class="MdTableCell">65-72%</p> </td> <td> <p class="MdTableCell">+17-27 pts</p> </td> <td> <p class="MdTableCell">Vulnerability pattern recognition</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">USAMO 2026</span></p> </td> <td> <p class="MdTableCell">Mathematics</p> </td> <td> <p class="MdTableCell">35-42%</p> </td> <td> <p class="MdTableCell">58-68%</p> </td> <td> <p class="MdTableCell">+16-31 pts</p> </td> <td> <p class="MdTableCell">Formal proof verification</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">OSWorld</span></p> </td> <td> <p class="MdTableCell">Computer Use</p> </td> <td> <p class="MdTableCell">48-55%</p> </td> <td> <p class="MdTableCell">71-79%</p> </td> <td> <p class="MdTableCell">+16-31 pts</p> </td> <td> <p class="MdTableCell">Multi-step task decomposition</p> </td> </tr> </tbody> </table> <div> <p class="MdHr"> </p> </div>
format Recurso digital
id zenodo_https___doi_org_10_5281_zenodo_20043741
institution Zenodo
language
publishDate 2026
publisher Zenodo
record_format zenodo
spellingShingle PACAD-Enhanced Opus: Specialized Benchmark Performance Estimates – 2-3 Week Cycle (14-21 Days, Median ~17 Days)
Brown, Cameron
Artificial Intelligence
Artificial intelligence
Artificial Intelligence
Artificial Intelligence/standards
Artificial Intelligence/trends
Artificial Intelligence/classification
Artificial Intelligence/statistics & numerical data
<p class="MdHeading2">Benchmark Performance Comparison Table</p> <table class="MsoNormalTable"> <tbody><tr> <td> <p class="MdTableHeader">Benchmark</p> </td> <td> <p class="MdTableHeader">Category</p> </td> <td> <p class="MdTableHeader">Baseline Opus</p> </td> <td> <p class="MdTableHeader">PACAD-Enhanced</p> </td> <td> <p class="MdTableHeader">Improvement</p> </td> <td> <p class="MdTableHeader">Key Factor</p> </td> </tr> </tbody><tbody> <tr> <td> <p class="MdTableCell"><span class="MdStrong">SWE-bench Verified</span></p> </td> <td> <p class="MdTableCell">Software Engineering</p> </td> <td> <p class="MdTableCell">45-52%</p> </td> <td> <p class="MdTableCell">68-75%</p> </td> <td> <p class="MdTableCell">+16-28 pts</p> </td> <td> <p class="MdTableCell">Structural code reasoning</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">Terminal-Bench 2.0</span></p> </td> <td> <p class="MdTableCell">Autonomous Coding</p> </td> <td> <p class="MdTableCell">38-45%</p> </td> <td> <p class="MdTableCell">62-70%</p> </td> <td> <p class="MdTableCell">+17-32 pts</p> </td> <td> <p class="MdTableCell">Deterministic command logic</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">CyberGym</span></p> </td> <td> <p class="MdTableCell">Cybersecurity</p> </td> <td> <p class="MdTableCell">42-48%</p> </td> <td> <p class="MdTableCell">65-72%</p> </td> <td> <p class="MdTableCell">+17-27 pts</p> </td> <td> <p class="MdTableCell">Vulnerability pattern recognition</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">USAMO 2026</span></p> </td> <td> <p class="MdTableCell">Mathematics</p> </td> <td> <p class="MdTableCell">35-42%</p> </td> <td> <p class="MdTableCell">58-68%</p> </td> <td> <p class="MdTableCell">+16-31 pts</p> </td> <td> <p class="MdTableCell">Formal proof verification</p> </td> </tr> <tr> <td> <p class="MdTableCell"><span class="MdStrong">OSWorld</span></p> </td> <td> <p class="MdTableCell">Computer Use</p> </td> <td> <p class="MdTableCell">48-55%</p> </td> <td> <p class="MdTableCell">71-79%</p> </td> <td> <p class="MdTableCell">+16-31 pts</p> </td> <td> <p class="MdTableCell">Multi-step task decomposition</p> </td> </tr> </tbody> </table> <div> <p class="MdHr"> </p> </div>
title PACAD-Enhanced Opus: Specialized Benchmark Performance Estimates – 2-3 Week Cycle (14-21 Days, Median ~17 Days)
topic Artificial Intelligence
Artificial intelligence
Artificial Intelligence
Artificial Intelligence/standards
Artificial Intelligence/trends
Artificial Intelligence/classification
Artificial Intelligence/statistics & numerical data
url https://doi.org/10.5281/zenodo.20043741