Saved in:
| Main Authors: | Pu, Sophia Xiao, Cheng, Sitao, Wang, Xin Eric, Wang, William Yang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.19005 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
by: Li, Xirui, et al.
Published: (2024)
by: Li, Xirui, et al.
Published: (2024)
Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models
by: Cheng, Sitao, et al.
Published: (2024)
by: Cheng, Sitao, et al.
Published: (2024)
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
by: Zhou, Ruiwen, et al.
Published: (2024)
by: Zhou, Ruiwen, et al.
Published: (2024)
Disentangling Memory and Reasoning Ability in Large Language Models
by: Jin, Mingyu, et al.
Published: (2024)
by: Jin, Mingyu, et al.
Published: (2024)
Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge
by: Cai, Yunna, et al.
Published: (2025)
by: Cai, Yunna, et al.
Published: (2025)
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
by: Pu, Xiao, et al.
Published: (2025)
by: Pu, Xiao, et al.
Published: (2025)
Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies
by: Cheng, Sitao, et al.
Published: (2025)
by: Cheng, Sitao, et al.
Published: (2025)
From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs
by: Ni, Jiliang, et al.
Published: (2025)
by: Ni, Jiliang, et al.
Published: (2025)
How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation
by: Xiao, Yang, et al.
Published: (2023)
by: Xiao, Yang, et al.
Published: (2023)
LEDOM: Reverse Language Model
by: Yin, Xunjian, et al.
Published: (2025)
by: Yin, Xunjian, et al.
Published: (2025)
MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs
by: Liu, Zhiwei, et al.
Published: (2025)
by: Liu, Zhiwei, et al.
Published: (2025)
Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs
by: Kabir, Mohsinul, et al.
Published: (2025)
by: Kabir, Mohsinul, et al.
Published: (2025)
Can LLMs Solve longer Math Word Problems Better?
by: Xu, Xin, et al.
Published: (2024)
by: Xu, Xin, et al.
Published: (2024)
TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data
by: Huang, Xiang, et al.
Published: (2024)
by: Huang, Xiang, et al.
Published: (2024)
Designing and Evaluating Dialogue LLMs for Co-Creative Improvised Theatre
by: Branch, Boyd, et al.
Published: (2024)
by: Branch, Boyd, et al.
Published: (2024)
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
by: Long, Lin, et al.
Published: (2024)
by: Long, Lin, et al.
Published: (2024)
WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
by: Liu, Chengzhi, et al.
Published: (2026)
by: Liu, Chengzhi, et al.
Published: (2026)
Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia
by: Maria, Sophia
Published: (2025)
by: Maria, Sophia
Published: (2025)
Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
by: Xiao, Yang, et al.
Published: (2025)
by: Xiao, Yang, et al.
Published: (2025)
LLMs vs. Chinese Anime Enthusiasts: A Comparative Study on Emotionally Supportive Role-Playing
by: Qiu, Lanlan, et al.
Published: (2025)
by: Qiu, Lanlan, et al.
Published: (2025)
Differentiable Evolutionary Reinforcement Learning
by: Cheng, Sitao, et al.
Published: (2025)
by: Cheng, Sitao, et al.
Published: (2025)
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation
by: Zhang, Enze, et al.
Published: (2025)
by: Zhang, Enze, et al.
Published: (2025)
MCEval: A Dynamic Framework for Fair Multilingual Cultural Evaluation of LLMs
by: Huang, Shulin, et al.
Published: (2025)
by: Huang, Shulin, et al.
Published: (2025)
PFID: Privacy First Inference Delegation Framework for LLMs
by: Yang, Haoyan, et al.
Published: (2024)
by: Yang, Haoyan, et al.
Published: (2024)
Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing
by: Yu, Zeping, et al.
Published: (2025)
by: Yu, Zeping, et al.
Published: (2025)
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
by: Yu, Zeping, et al.
Published: (2024)
by: Yu, Zeping, et al.
Published: (2024)
Bridging the Knowledge-Action Gap by Evaluating LLMs in Dynamic Dental Clinical Scenarios
by: Ma, Hongyang, et al.
Published: (2026)
by: Ma, Hongyang, et al.
Published: (2026)
Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments
by: Cheng, Sitao, et al.
Published: (2024)
by: Cheng, Sitao, et al.
Published: (2024)
CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching
by: Liu, Heyang, et al.
Published: (2025)
by: Liu, Heyang, et al.
Published: (2025)
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
by: Wang, Weixuan, et al.
Published: (2024)
by: Wang, Weixuan, et al.
Published: (2024)
Understanding the Effects of Domain Finetuning on LLMs
by: Tanwar, Eshaan, et al.
Published: (2025)
by: Tanwar, Eshaan, et al.
Published: (2025)
Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs
by: Yu, Zeping, et al.
Published: (2025)
by: Yu, Zeping, et al.
Published: (2025)
Dynamic Depth Decoding: Faster Speculative Decoding for LLMs
by: Brown, Oscar, et al.
Published: (2024)
by: Brown, Oscar, et al.
Published: (2024)
How Good are LLMs at Relation Extraction under Low-Resource Scenario? Comprehensive Evaluation
by: Jinensibieke, Dawulie, et al.
Published: (2024)
by: Jinensibieke, Dawulie, et al.
Published: (2024)
MolViBench: Evaluating LLMs on Molecular Vibe Coding
by: Li, Jiatong, et al.
Published: (2026)
by: Li, Jiatong, et al.
Published: (2026)
Evaluating Role-Consistency in LLMs for Counselor Training
by: Rudolph, Eric, et al.
Published: (2026)
by: Rudolph, Eric, et al.
Published: (2026)
XCR-Bench: A Multi-Task Benchmark for Evaluating Cultural Reasoning in LLMs
by: Kabir, Mohsinul, et al.
Published: (2026)
by: Kabir, Mohsinul, et al.
Published: (2026)
EvoWiki: Evaluating LLMs on Evolving Knowledge
by: Tang, Wei, et al.
Published: (2024)
by: Tang, Wei, et al.
Published: (2024)
LLM-based NLG Evaluation: Current Status and Challenges
by: Gao, Mingqi, et al.
Published: (2024)
by: Gao, Mingqi, et al.
Published: (2024)
Unveiling the Competitive Dynamics: A Comparative Evaluation of American and Chinese LLMs
by: Jiang, Zhenhui, et al.
Published: (2024)
by: Jiang, Zhenhui, et al.
Published: (2024)
Similar Items
-
MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
by: Li, Xirui, et al.
Published: (2024) -
Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models
by: Cheng, Sitao, et al.
Published: (2024) -
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
by: Zhou, Ruiwen, et al.
Published: (2024) -
Disentangling Memory and Reasoning Ability in Large Language Models
by: Jin, Mingyu, et al.
Published: (2024) -
Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge
by: Cai, Yunna, et al.
Published: (2025)