Saved in:
| Main Authors: | Kohli, Harsh, Kumar, Sachin, Sun, Huan |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.04237 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
by: Kohli, Harsh, et al.
Published: (2026)
by: Kohli, Harsh, et al.
Published: (2026)
CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders
by: Gulko, Alex, et al.
Published: (2025)
by: Gulko, Alex, et al.
Published: (2025)
BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models
by: Chen, Jiangxi, et al.
Published: (2026)
by: Chen, Jiangxi, et al.
Published: (2026)
Continually Adding New Languages to Multilingual Language Models
by: Owodunni, Abraham Toluwase, et al.
Published: (2025)
by: Owodunni, Abraham Toluwase, et al.
Published: (2025)
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models
by: Kumar, Sachin
Published: (2026)
by: Kumar, Sachin
Published: (2026)
CocoaBench: Evaluating Unified Digital Agents in the Wild
by: CocoaBench Team, et al.
Published: (2026)
by: CocoaBench Team, et al.
Published: (2026)
Reasoning Up the Instruction Ladder for Controllable Language Models
by: Zheng, Zishuo, et al.
Published: (2025)
by: Zheng, Zishuo, et al.
Published: (2025)
ScholarEval: Research Idea Evaluation Grounded in Literature
by: Moussa, Hanane Nour, et al.
Published: (2025)
by: Moussa, Hanane Nour, et al.
Published: (2025)
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
by: Zou, Chengke, et al.
Published: (2024)
by: Zou, Chengke, et al.
Published: (2024)
BASS: Benchmarking Audio LMs for Musical Structure and Semantic Reasoning
by: Jang, Min, et al.
Published: (2026)
by: Jang, Min, et al.
Published: (2026)
Compositional Generalization with Grounded Language Models
by: Wold, Sondre, et al.
Published: (2024)
by: Wold, Sondre, et al.
Published: (2024)
Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation
by: Harsh, Reetu Raj, et al.
Published: (2026)
by: Harsh, Reetu Raj, et al.
Published: (2026)
CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models
by: Chun, Jon, et al.
Published: (2026)
by: Chun, Jon, et al.
Published: (2026)
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
by: Owodunni, Abraham Toluwase, et al.
Published: (2025)
by: Owodunni, Abraham Toluwase, et al.
Published: (2025)
Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula
by: Lucy, Li, et al.
Published: (2024)
by: Lucy, Li, et al.
Published: (2024)
Overriding Safety protections of Open-source Models
by: Kumar, Sachin
Published: (2024)
by: Kumar, Sachin
Published: (2024)
A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems
by: Abrahamyan, Aram, et al.
Published: (2026)
by: Abrahamyan, Aram, et al.
Published: (2026)
Compositional Causal Reasoning Evaluation in Language Models
by: Maasch, Jacqueline R. M. A., et al.
Published: (2025)
by: Maasch, Jacqueline R. M. A., et al.
Published: (2025)
SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
by: Xu, Zixiang, et al.
Published: (2025)
by: Xu, Zixiang, et al.
Published: (2025)
Steering Large Language Models between Code Execution and Textual Reasoning
by: Chen, Yongchao, et al.
Published: (2024)
by: Chen, Yongchao, et al.
Published: (2024)
TESS 2: A Large-Scale Generalist Diffusion Language Model
by: Tae, Jaesung, et al.
Published: (2025)
by: Tae, Jaesung, et al.
Published: (2025)
EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models
by: Huang, Junquan, et al.
Published: (2025)
by: Huang, Junquan, et al.
Published: (2025)
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
by: Dong, Nguyen Tien, et al.
Published: (2025)
by: Dong, Nguyen Tien, et al.
Published: (2025)
CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models
by: Wang, Song, et al.
Published: (2024)
by: Wang, Song, et al.
Published: (2024)
Metric-Dependent Annotation Saturation for Learning from Label Distributions
by: Kohli, Guneet
Published: (2026)
by: Kohli, Guneet
Published: (2026)
Role-Conditioned Refusals: Evaluating Access Control Reasoning in Large Language Models
by: Klisura, Đorđe, et al.
Published: (2025)
by: Klisura, Đorđe, et al.
Published: (2025)
How Lexical is Bilingual Lexicon Induction?
by: Kohli, Harsh, et al.
Published: (2024)
by: Kohli, Harsh, et al.
Published: (2024)
MentalBench: A DSM-Grounded Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models
by: Song, Hoyun, et al.
Published: (2026)
by: Song, Hoyun, et al.
Published: (2026)
VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
by: Hu, Jiliang, et al.
Published: (2025)
by: Hu, Jiliang, et al.
Published: (2025)
Paloma: A Benchmark for Evaluating Language Model Fit
by: Magnusson, Ian, et al.
Published: (2023)
by: Magnusson, Ian, et al.
Published: (2023)
TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning
by: Liu, Daixian, et al.
Published: (2026)
by: Liu, Daixian, et al.
Published: (2026)
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models
by: Wang, Yuqing, et al.
Published: (2024)
by: Wang, Yuqing, et al.
Published: (2024)
DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams
by: Iyengar, Anirudh Iyengar Kaniyar Narayana, et al.
Published: (2026)
by: Iyengar, Anirudh Iyengar Kaniyar Narayana, et al.
Published: (2026)
Pause or Fabricate? Training Language Models for Grounded Reasoning
by: Qiu, Yiwen, et al.
Published: (2026)
by: Qiu, Yiwen, et al.
Published: (2026)
Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings
by: Wani, Sachin Gopal, et al.
Published: (2026)
by: Wani, Sachin Gopal, et al.
Published: (2026)
Evaluating Large Language Models on the Frame and Symbol Grounding Problems: A Zero-shot Benchmark
by: Oka, Shoko
Published: (2025)
by: Oka, Shoko
Published: (2025)
SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture
by: Maji, Arijit, et al.
Published: (2025)
by: Maji, Arijit, et al.
Published: (2025)
Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task
by: Abe, Hirohiko, et al.
Published: (2026)
by: Abe, Hirohiko, et al.
Published: (2026)
Reasoning-Grounded Natural Language Explanations for Language Models
by: Cahlik, Vojtech, et al.
Published: (2025)
by: Cahlik, Vojtech, et al.
Published: (2025)
ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs
by: Wang, Zhipin, et al.
Published: (2026)
by: Wang, Zhipin, et al.
Published: (2026)
Similar Items
-
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
by: Kohli, Harsh, et al.
Published: (2026) -
CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders
by: Gulko, Alex, et al.
Published: (2025) -
BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models
by: Chen, Jiangxi, et al.
Published: (2026) -
Continually Adding New Languages to Multilingual Language Models
by: Owodunni, Abraham Toluwase, et al.
Published: (2025) -
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models
by: Kumar, Sachin
Published: (2026)