Saved in:
| Main Authors: | Ma, Zhiming, Xiao, Xiayang, Dong, Sihao, Wang, Peidong, Wang, HaiPeng, Pan, Qingyun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.08168 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Accurate Risk Prediction Model for Surgical Site Infection After Lower Third Molar Surgery
by: HaiPeng Yu
Published: (2024)
by: HaiPeng Yu
Published: (2024)
OSAD: Open-Set Aircraft Detection in SAR Images
by: Xiao, Xiayang, et al.
Published: (2024)
by: Xiao, Xiayang, et al.
Published: (2024)
Language Models as Continuous Self-Evolving Data Engineers
by: Wang, Peidong, et al.
Published: (2024)
by: Wang, Peidong, et al.
Published: (2024)
HI-TransPA: Hearing Impairments Translation Personal Assistant
by: Ma, Zhiming, et al.
Published: (2025)
by: Ma, Zhiming, et al.
Published: (2025)
oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning
by: Xu, Ruiling, et al.
Published: (2025)
by: Xu, Ruiling, et al.
Published: (2025)
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models
by: Wang, Shengkang, et al.
Published: (2024)
by: Wang, Shengkang, et al.
Published: (2024)
PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models
by: Wang, Yuwen, et al.
Published: (2026)
by: Wang, Yuwen, et al.
Published: (2026)
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
by: Wang, Pei, et al.
Published: (2024)
by: Wang, Pei, et al.
Published: (2024)
TaskBench: Benchmarking Large Language Models for Task Automation
by: Shen, Yongliang, et al.
Published: (2023)
by: Shen, Yongliang, et al.
Published: (2023)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
by: Jia, Mengzhao, et al.
Published: (2024)
by: Jia, Mengzhao, et al.
Published: (2024)
CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity
by: Wei, Xuefeng, et al.
Published: (2026)
by: Wei, Xuefeng, et al.
Published: (2026)
Beyond Prompting: Efficient and Robust Contextual Biasing for Speech LLMs via Logit-Space Integration (LOGIC)
by: Wang, Peidong
Published: (2026)
by: Wang, Peidong
Published: (2026)
SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science
by: Ying, Jie, et al.
Published: (2025)
by: Ying, Jie, et al.
Published: (2025)
MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions
by: Huang, Zeyu, et al.
Published: (2025)
by: Huang, Zeyu, et al.
Published: (2025)
PsychiatryBench: A Multi-Task Benchmark for LLMs in Psychiatry
by: Fouda, Aya E., et al.
Published: (2025)
by: Fouda, Aya E., et al.
Published: (2025)
Struct-Bench: A Benchmark for Differentially Private Structured Text Generation
by: Wang, Shuaiqi, et al.
Published: (2025)
by: Wang, Shuaiqi, et al.
Published: (2025)
PM4Bench: Benchmarking Large Vision-Language Models with Parallel Multilingual Multi-Modal Multi-task Corpus
by: Gao, Junyuan, et al.
Published: (2025)
by: Gao, Junyuan, et al.
Published: (2025)
VL-RouterBench: A Benchmark for Vision-Language Model Routing
by: Huang, Zhehao, et al.
Published: (2025)
by: Huang, Zhehao, et al.
Published: (2025)
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models
by: Hu, He, et al.
Published: (2025)
by: Hu, He, et al.
Published: (2025)
CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders
by: Gulko, Alex, et al.
Published: (2025)
by: Gulko, Alex, et al.
Published: (2025)
Vision-Language Interpreter for Robot Task Planning
by: Shirai, Keisuke, et al.
Published: (2023)
by: Shirai, Keisuke, et al.
Published: (2023)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)
by: Yang, Rui, et al.
Published: (2025)
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models
by: Du, Mengfei, et al.
Published: (2024)
by: Du, Mengfei, et al.
Published: (2024)
CharacterBench: Benchmarking Character Customization of Large Language Models
by: Zhou, Jinfeng, et al.
Published: (2024)
by: Zhou, Jinfeng, et al.
Published: (2024)
Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks
by: He, Jiayi, et al.
Published: (2024)
by: He, Jiayi, et al.
Published: (2024)
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
by: Yang, Sihan, et al.
Published: (2025)
by: Yang, Sihan, et al.
Published: (2025)
ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models
by: Li, Bohan, et al.
Published: (2025)
by: Li, Bohan, et al.
Published: (2025)
BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature
by: Wei, Sibo, et al.
Published: (2025)
by: Wei, Sibo, et al.
Published: (2025)
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
by: Zhang, Alexander, et al.
Published: (2025)
by: Zhang, Alexander, et al.
Published: (2025)
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback
by: Li, Youquan, et al.
Published: (2024)
by: Li, Youquan, et al.
Published: (2024)
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
by: Wang, Zhaowei, et al.
Published: (2025)
by: Wang, Zhaowei, et al.
Published: (2025)
TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues
by: Ghazarian, Sarik, et al.
Published: (2025)
by: Ghazarian, Sarik, et al.
Published: (2025)
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents
by: Xiao, Ruixuan, et al.
Published: (2024)
by: Xiao, Ruixuan, et al.
Published: (2024)
MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation
by: Li, Xiaoyuan, et al.
Published: (2025)
by: Li, Xiaoyuan, et al.
Published: (2025)
Electromagnetic Scattering Kernel Guided Reciprocal Point Learning for SAR Open-Set Recognition
by: Xiao, Xiayang, et al.
Published: (2024)
by: Xiao, Xiayang, et al.
Published: (2024)
AlignBench: Benchmarking Chinese Alignment of Large Language Models
by: Liu, Xiao, et al.
Published: (2023)
by: Liu, Xiao, et al.
Published: (2023)
SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning
by: Wang, Peidong, et al.
Published: (2026)
by: Wang, Peidong, et al.
Published: (2026)
Conceptual and Unbiased Reasoning in Language Models
by: Zhou, Ben, et al.
Published: (2024)
by: Zhou, Ben, et al.
Published: (2024)
BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems
by: Wang, Wei, et al.
Published: (2024)
by: Wang, Wei, et al.
Published: (2024)
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
by: Ma, Yubo, et al.
Published: (2024)
by: Ma, Yubo, et al.
Published: (2024)
Similar Items
-
Accurate Risk Prediction Model for Surgical Site Infection After Lower Third Molar Surgery
by: HaiPeng Yu
Published: (2024) -
OSAD: Open-Set Aircraft Detection in SAR Images
by: Xiao, Xiayang, et al.
Published: (2024) -
Language Models as Continuous Self-Evolving Data Engineers
by: Wang, Peidong, et al.
Published: (2024) -
HI-TransPA: Hearing Impairments Translation Personal Assistant
by: Ma, Zhiming, et al.
Published: (2025) -
oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning
by: Xu, Ruiling, et al.
Published: (2025)