Saved in:
| Main Authors: | Guo, Dadi, Liu, Jiayu, Fan, Zhiyuan, He, Zhitao, Li, Haoran, Li, Yuxin, Wang, Yumeng, Fung, Yi R. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.17114 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Diversity-Enhanced Reasoning for Subjective Questions
by: Wang, Yumeng, et al.
Published: (2025)
by: Wang, Yumeng, et al.
Published: (2025)
MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?
by: He, Zhitao, et al.
Published: (2025)
by: He, Zhitao, et al.
Published: (2025)
MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration
by: He, Zhitao, et al.
Published: (2025)
by: He, Zhitao, et al.
Published: (2025)
MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing
by: Liu, Minghao, et al.
Published: (2025)
by: Liu, Minghao, et al.
Published: (2025)
CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions
by: Huang, Yuchen, et al.
Published: (2025)
by: Huang, Yuchen, et al.
Published: (2025)
Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward
by: Fan, Zhiyuan, et al.
Published: (2025)
by: Fan, Zhiyuan, et al.
Published: (2025)
ClinTutor-R1: Advancing Scalable and Robust One-to-Many Alignment in Clinical Socratic Education
by: He, Zhitao, et al.
Published: (2025)
by: He, Zhitao, et al.
Published: (2025)
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind
by: He, Zhitao, et al.
Published: (2026)
by: He, Zhitao, et al.
Published: (2026)
SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation
by: Chen, Yixiang, et al.
Published: (2025)
by: Chen, Yixiang, et al.
Published: (2025)
Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4
by: Li, Yuxin, et al.
Published: (2025)
by: Li, Yuxin, et al.
Published: (2025)
Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability
by: Wang, Ruida, et al.
Published: (2025)
by: Wang, Ruida, et al.
Published: (2025)
LitmusKt: Concurrency Stress Testing for Kotlin
by: Lochmelis, Denis, et al.
Published: (2025)
by: Lochmelis, Denis, et al.
Published: (2025)
Towards A Litmus Test for Common Sense
by: Latapie, Hugo
Published: (2025)
by: Latapie, Hugo
Published: (2025)
CALM: Unleashing the Cross-Lingual Self-Aligning Ability of Language Model Question Answering
by: Wang, Yumeng, et al.
Published: (2025)
by: Wang, Yumeng, et al.
Published: (2025)
Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm
by: Guo, Dadi, et al.
Published: (2025)
by: Guo, Dadi, et al.
Published: (2025)
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
by: Agarwal, Shradha, et al.
Published: (2026)
by: Agarwal, Shradha, et al.
Published: (2026)
Large Language Models and Mathematical Reasoning Failures
by: Boye, Johan, et al.
Published: (2025)
by: Boye, Johan, et al.
Published: (2025)
MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness
by: Huang, Junsheng, et al.
Published: (2025)
by: Huang, Junsheng, et al.
Published: (2025)
Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking
by: Ju, Feng, et al.
Published: (2025)
by: Ju, Feng, et al.
Published: (2025)
EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
by: Fish, Sara, et al.
Published: (2025)
by: Fish, Sara, et al.
Published: (2025)
Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization
by: He, Zhitao, et al.
Published: (2025)
by: He, Zhitao, et al.
Published: (2025)
CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models
by: Qian, Cheng, et al.
Published: (2023)
by: Qian, Cheng, et al.
Published: (2023)
Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data
by: Li, Haoran, et al.
Published: (2024)
by: Li, Haoran, et al.
Published: (2024)
To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control
by: Schulte, Victor, et al.
Published: (2026)
by: Schulte, Victor, et al.
Published: (2026)
On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility
by: He, Zhitao, et al.
Published: (2026)
by: He, Zhitao, et al.
Published: (2026)
MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL
by: Yang, Haolin, et al.
Published: (2025)
by: Yang, Haolin, et al.
Published: (2025)
Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?
by: Guo, Dadi, et al.
Published: (2026)
by: Guo, Dadi, et al.
Published: (2026)
Revealing Interpretable Failure Modes of VLMs
by: Chaudhary, Isha, et al.
Published: (2026)
by: Chaudhary, Isha, et al.
Published: (2026)
Litmus: Fair Pricing for Serverless Computing
by: Pei, Qi, et al.
Published: (2024)
by: Pei, Qi, et al.
Published: (2024)
Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models
by: Liao, Haoran, et al.
Published: (2024)
by: Liao, Haoran, et al.
Published: (2024)
Enhancing Advanced Visual Reasoning Ability of Large Language Models
by: Li, Zhiyuan, et al.
Published: (2024)
by: Li, Zhiyuan, et al.
Published: (2024)
THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models
by: Li, Zhiyuan, et al.
Published: (2025)
by: Li, Zhiyuan, et al.
Published: (2025)
LIBERO-X: Robustness Litmus for Vision-Language-Action Models
by: Wang, Guodong, et al.
Published: (2026)
by: Wang, Guodong, et al.
Published: (2026)
ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models
by: Yin, Zhangyue, et al.
Published: (2025)
by: Yin, Zhangyue, et al.
Published: (2025)
Rethinking Prospect Theory for LLMs: Revealing the Instability of Decision-Making under Epistemic Uncertainty
by: Wang, Rui, et al.
Published: (2025)
by: Wang, Rui, et al.
Published: (2025)
Higher Affinity Enables More Accurate Detection of SARS‐CoV‐2 in Human Saliva Using Aptamer‐Based Litmus Test
by: Rudi Liu, et al.
Published: (2024)
by: Rudi Liu, et al.
Published: (2024)
Failure Modes of LLMs for Causal Reasoning on Narratives
by: Yamin, Khurram, et al.
Published: (2024)
by: Yamin, Khurram, et al.
Published: (2024)
Gradient Variance Reveals Failure Modes in Flow-Based Generative Models
by: Reu, Teodora, et al.
Published: (2025)
by: Reu, Teodora, et al.
Published: (2025)
Empowering Reliable Visual-Centric Instruction Following in MLLMs
by: He, Weilei, et al.
Published: (2026)
by: He, Weilei, et al.
Published: (2026)
When Do Symbolic Solvers Enhance Reasoning in Large Language Models?
by: He, Zhiyuan, et al.
Published: (2025)
by: He, Zhiyuan, et al.
Published: (2025)
Similar Items
-
Diversity-Enhanced Reasoning for Subjective Questions
by: Wang, Yumeng, et al.
Published: (2025) -
MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?
by: He, Zhitao, et al.
Published: (2025) -
MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration
by: He, Zhitao, et al.
Published: (2025) -
MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing
by: Liu, Minghao, et al.
Published: (2025) -
CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions
by: Huang, Yuchen, et al.
Published: (2025)