Saved in:
| Main Authors: | Sun, Simeng, Hsieh, Cheng-Ping, Ladhak, Faisal, Arakelyan, Erik, Serano, Santiago Akle, Ginsburg, Boris |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.22832 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling
by: Puvvada, Krishna C., et al.
Published: (2025)
by: Puvvada, Krishna C., et al.
Published: (2025)
RULER: What's the Real Context Size of Your Long-Context Language Models?
by: Hsieh, Cheng-Ping, et al.
Published: (2024)
by: Hsieh, Cheng-Ping, et al.
Published: (2024)
How much do contextualized representations encode long-range context?
by: Sun, Simeng, et al.
Published: (2024)
by: Sun, Simeng, et al.
Published: (2024)
Reasoning Inconsistencies and How to Mitigate Them in Deep Learning
by: Arakelyan, Erik
Published: (2025)
by: Arakelyan, Erik
Published: (2025)
SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
by: Ghazaryan, Gayane, et al.
Published: (2024)
by: Ghazaryan, Gayane, et al.
Published: (2024)
Incorporating Human Explanations for Robust Hate Speech Detection
by: Chen, Jennifer L., et al.
Published: (2024)
by: Chen, Jennifer L., et al.
Published: (2024)
Aligning Large Language Models via Fine-grained Supervision
by: Xu, Dehong, et al.
Published: (2024)
by: Xu, Dehong, et al.
Published: (2024)
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
by: Ludwig, Nikolai, et al.
Published: (2026)
by: Ludwig, Nikolai, et al.
Published: (2026)
STORYSUMM: Evaluating Faithfulness in Story Summarization
by: Subbiah, Melanie, et al.
Published: (2024)
by: Subbiah, Melanie, et al.
Published: (2024)
Extending Automatic Machine Translation Evaluation to Book-Length Documents
by: Wang, Kuang-Da, et al.
Published: (2025)
by: Wang, Kuang-Da, et al.
Published: (2025)
Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning
by: Ficek, Aleksander, et al.
Published: (2025)
by: Ficek, Aleksander, et al.
Published: (2025)
Semantic Sensitivities and Inconsistent Predictions: Measuring the Fragility of NLI Models
by: Arakelyan, Erik, et al.
Published: (2024)
by: Arakelyan, Erik, et al.
Published: (2024)
FLARE: Faithful Logic-Aided Reasoning and Exploration
by: Arakelyan, Erik, et al.
Published: (2024)
by: Arakelyan, Erik, et al.
Published: (2024)
OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique
by: Ahmad, Wasi Uddin, et al.
Published: (2025)
by: Ahmad, Wasi Uddin, et al.
Published: (2025)
ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans
by: Sadana, Ananya, et al.
Published: (2025)
by: Sadana, Ananya, et al.
Published: (2025)
HYBRIDMIND: Meta Selection of Natural Language and Symbolic Language for Enhanced LLM Reasoning
by: Han, Simeng, et al.
Published: (2024)
by: Han, Simeng, et al.
Published: (2024)
From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?
by: Ahmad, Wasi Uddin, et al.
Published: (2025)
by: Ahmad, Wasi Uddin, et al.
Published: (2025)
Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models
by: Fränken, Jan-Philipp, et al.
Published: (2024)
by: Fränken, Jan-Philipp, et al.
Published: (2024)
Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA
by: Chen, Guanhua, et al.
Published: (2026)
by: Chen, Guanhua, et al.
Published: (2026)
Symbolic Execution for Quantum Error Correction Programs
by: Fang, Wang, et al.
Published: (2023)
by: Fang, Wang, et al.
Published: (2023)
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
by: Huang, Xiyang, et al.
Published: (2026)
by: Huang, Xiyang, et al.
Published: (2026)
Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
by: Lee, Jinu, et al.
Published: (2025)
by: Lee, Jinu, et al.
Published: (2025)
nGPT: Normalized Transformer with Representation Learning on the Hypersphere
by: Loshchilov, Ilya, et al.
Published: (2024)
by: Loshchilov, Ilya, et al.
Published: (2024)
Evaluating Human-Language Model Interaction
by: Lee, Mina, et al.
Published: (2022)
by: Lee, Mina, et al.
Published: (2022)
ExecRepoBench: Multi-level Executable Code Completion Evaluation
by: Yang, Jian, et al.
Published: (2024)
by: Yang, Jian, et al.
Published: (2024)
An empirical study on the limitation of Transformers in program trace generation
by: Sun, Simeng
Published: (2025)
by: Sun, Simeng
Published: (2025)
RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues
by: Kuo, Tzu-Lin, et al.
Published: (2024)
by: Kuo, Tzu-Lin, et al.
Published: (2024)
CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models
by: Chen, Yuefei, et al.
Published: (2025)
by: Chen, Yuefei, et al.
Published: (2025)
ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure
by: Fujisawa, Ippei, et al.
Published: (2024)
by: Fujisawa, Ippei, et al.
Published: (2024)
Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems
by: Miner, Stephen, et al.
Published: (2024)
by: Miner, Stephen, et al.
Published: (2024)
DP-Bench: A Benchmark for Evaluating Data Product Creation Systems
by: Chowdhury, Faisal, et al.
Published: (2025)
by: Chowdhury, Faisal, et al.
Published: (2025)
Many-Turn Jailbreaking
by: Yang, Xianjun, et al.
Published: (2025)
by: Yang, Xianjun, et al.
Published: (2025)
The Path Not Taken: Duality in Reasoning about Program Execution
by: Hasanov, Eshgin, et al.
Published: (2026)
by: Hasanov, Eshgin, et al.
Published: (2026)
QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies
by: Khoroshilov, Alexey, et al.
Published: (2026)
by: Khoroshilov, Alexey, et al.
Published: (2026)
Learning to Reason via Mixture-of-Thought for Logical Reasoning
by: Zheng, Tong, et al.
Published: (2025)
by: Zheng, Tong, et al.
Published: (2025)
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
by: Chiu, Yu Ying, et al.
Published: (2025)
by: Chiu, Yu Ying, et al.
Published: (2025)
EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking
by: Wei, Anjiang, et al.
Published: (2025)
by: Wei, Anjiang, et al.
Published: (2025)
EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models
by: Huang, Junquan, et al.
Published: (2025)
by: Huang, Junquan, et al.
Published: (2025)
TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games
by: Mishra, Prakamya, et al.
Published: (2025)
by: Mishra, Prakamya, et al.
Published: (2025)
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
by: Panda, Sailesh, et al.
Published: (2026)
by: Panda, Sailesh, et al.
Published: (2026)
Similar Items
-
SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling
by: Puvvada, Krishna C., et al.
Published: (2025) -
RULER: What's the Real Context Size of Your Long-Context Language Models?
by: Hsieh, Cheng-Ping, et al.
Published: (2024) -
How much do contextualized representations encode long-range context?
by: Sun, Simeng, et al.
Published: (2024) -
Reasoning Inconsistencies and How to Mitigate Them in Deep Learning
by: Arakelyan, Erik
Published: (2025) -
SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
by: Ghazaryan, Gayane, et al.
Published: (2024)