:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Guo, Dadi, Liu, Jiayu, Fan, Zhiyuan, He, Zhitao, Li, Haoran, Li, Yuxin, Wang, Yumeng, Fung, Yi R.
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2506.17114
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Diversity-Enhanced Reasoning for Subjective Questions
by: Wang, Yumeng, et al.
Published: (2025)

MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?
by: He, Zhitao, et al.
Published: (2025)

MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration
by: He, Zhitao, et al.
Published: (2025)

MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing
by: Liu, Minghao, et al.
Published: (2025)

CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions
by: Huang, Yuchen, et al.
Published: (2025)

Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward
by: Fan, Zhiyuan, et al.
Published: (2025)

ClinTutor-R1: Advancing Scalable and Robust One-to-Many Alignment in Clinical Socratic Education
by: He, Zhitao, et al.
Published: (2025)

RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
by: He, Zhitao, et al.
Published: (2026)

SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation
by: Chen, Yixiang, et al.
Published: (2025)

Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4
by: Li, Yuxin, et al.
Published: (2025)

Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability
by: Wang, Ruida, et al.
Published: (2025)

LitmusKt: Concurrency Stress Testing for Kotlin
by: Lochmelis, Denis, et al.
Published: (2025)

Towards A Litmus Test for Common Sense
by: Latapie, Hugo
Published: (2025)

CALM: Unleashing the Cross-Lingual Self-Aligning Ability of Language Model Question Answering
by: Wang, Yumeng, et al.
Published: (2025)

Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm
by: Guo, Dadi, et al.
Published: (2025)

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
by: Agarwal, Shradha, et al.
Published: (2026)

Large Language Models and Mathematical Reasoning Failures
by: Boye, Johan, et al.
Published: (2025)

MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness
by: Huang, Junsheng, et al.
Published: (2025)

Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking
by: Ju, Feng, et al.
Published: (2025)

EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
by: Fish, Sara, et al.
Published: (2025)

Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization
by: He, Zhitao, et al.
Published: (2025)

CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models
by: Qian, Cheng, et al.
Published: (2023)

Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data
by: Li, Haoran, et al.
Published: (2024)

To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control
by: Schulte, Victor, et al.
Published: (2026)

On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility
by: He, Zhitao, et al.
Published: (2026)

MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL
by: Yang, Haolin, et al.
Published: (2025)

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?
by: Guo, Dadi, et al.
Published: (2026)

Revealing Interpretable Failure Modes of VLMs
by: Chaudhary, Isha, et al.
Published: (2026)

Litmus: Fair Pricing for Serverless Computing
by: Pei, Qi, et al.
Published: (2024)

Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models
by: Liao, Haoran, et al.
Published: (2024)

Enhancing Advanced Visual Reasoning Ability of Large Language Models
by: Li, Zhiyuan, et al.
Published: (2024)

THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models
by: Li, Zhiyuan, et al.
Published: (2025)

LIBERO-X: Robustness Litmus for Vision-Language-Action Models
by: Wang, Guodong, et al.
Published: (2026)

ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models
by: Yin, Zhangyue, et al.
Published: (2025)

Rethinking Prospect Theory for LLMs: Revealing the Instability of Decision-Making under Epistemic Uncertainty
by: Wang, Rui, et al.
Published: (2025)

Higher Affinity Enables More Accurate Detection of SARS‐CoV‐2 in Human Saliva Using Aptamer‐Based Litmus Test
by: Rudi Liu, et al.
Published: (2024)

Failure Modes of LLMs for Causal Reasoning on Narratives
by: Yamin, Khurram, et al.
Published: (2024)

Gradient Variance Reveals Failure Modes in Flow-Based Generative Models
by: Reu, Teodora, et al.
Published: (2025)

Empowering Reliable Visual-Centric Instruction Following in MLLMs
by: He, Weilei, et al.
Published: (2026)

When Do Symbolic Solvers Enhance Reasoning in Large Language Models?
by: He, Zhiyuan, et al.
Published: (2025)