Saved in:
| Main Authors: | Ma, Shijian, Lin, Yan, Yang, Yi |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.09142 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Adversarial Evasion Attack Efficiency against Large Language Models
by: Vitorino, João, et al.
Published: (2024)
by: Vitorino, João, et al.
Published: (2024)
DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs
by: Ganon, Ben, et al.
Published: (2024)
by: Ganon, Ben, et al.
Published: (2024)
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
by: Yan, Yuping, et al.
Published: (2025)
by: Yan, Yuping, et al.
Published: (2025)
DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection
by: Chakraborty, Joymallya, et al.
Published: (2024)
by: Chakraborty, Joymallya, et al.
Published: (2024)
SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection
by: Stefan, Gabriel, et al.
Published: (2026)
by: Stefan, Gabriel, et al.
Published: (2026)
Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale
by: Noever, David, et al.
Published: (2025)
by: Noever, David, et al.
Published: (2025)
Characterizing, Detecting, and Predicting Online Ban Evasion
by: Niverthi, Manoj, et al.
Published: (2022)
by: Niverthi, Manoj, et al.
Published: (2022)
MiMIC: Multi-Modal Indian Earnings Calls Dataset to Predict Stock Prices
by: Ghosh, Sohom, et al.
Published: (2025)
by: Ghosh, Sohom, et al.
Published: (2025)
Struct-Bench: A Benchmark for Differentially Private Structured Text Generation
by: Wang, Shuaiqi, et al.
Published: (2025)
by: Wang, Shuaiqi, et al.
Published: (2025)
Instruction-Guided Bullet Point Summarization of Long Financial Earnings Call Transcripts
by: Khatuya, Subhendu, et al.
Published: (2024)
by: Khatuya, Subhendu, et al.
Published: (2024)
DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models
by: Tiwari, Utkarsh, et al.
Published: (2025)
by: Tiwari, Utkarsh, et al.
Published: (2025)
VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
by: Xin, Yutong, et al.
Published: (2026)
by: Xin, Yutong, et al.
Published: (2026)
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
by: Lin, Zicheng, et al.
Published: (2024)
by: Lin, Zicheng, et al.
Published: (2024)
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
by: Li, Lijun, et al.
Published: (2024)
by: Li, Lijun, et al.
Published: (2024)
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
by: Li, Jianling, et al.
Published: (2025)
by: Li, Jianling, et al.
Published: (2025)
LingVarBench: Benchmarking LLMs on Entity Recognitions and Linguistic Verbalization Patterns in Phone-Call Transcripts
by: Mohammadi, Seyedali, et al.
Published: (2025)
by: Mohammadi, Seyedali, et al.
Published: (2025)
AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models
by: Lee, Jaeho, et al.
Published: (2025)
by: Lee, Jaeho, et al.
Published: (2025)
DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation
by: Ma, Shijian, et al.
Published: (2025)
by: Ma, Shijian, et al.
Published: (2025)
Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
by: Tang, Zirui, et al.
Published: (2026)
by: Tang, Zirui, et al.
Published: (2026)
AlignBench: Benchmarking Chinese Alignment of Large Language Models
by: Liu, Xiao, et al.
Published: (2023)
by: Liu, Xiao, et al.
Published: (2023)
VL-RouterBench: A Benchmark for Vision-Language Model Routing
by: Huang, Zhehao, et al.
Published: (2025)
by: Huang, Zhehao, et al.
Published: (2025)
ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection
by: Khairallah, Ali, et al.
Published: (2025)
by: Khairallah, Ali, et al.
Published: (2025)
CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making
by: Ma, Mingyu Derek, et al.
Published: (2024)
by: Ma, Mingyu Derek, et al.
Published: (2024)
T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation
by: He, Yuze, et al.
Published: (2023)
by: He, Yuze, et al.
Published: (2023)
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
by: Song, Xiaoshuai, et al.
Published: (2024)
by: Song, Xiaoshuai, et al.
Published: (2024)
QuantumBench: A Benchmark for Quantum Problem Solving
by: Minami, Shunya, et al.
Published: (2025)
by: Minami, Shunya, et al.
Published: (2025)
WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications
by: Li, Xin, et al.
Published: (2025)
by: Li, Xin, et al.
Published: (2025)
ControBench: An Interaction-Aware Benchmark for Controversial Discourse Analysis on Social Networks
by: Thuy, Ta Thanh, et al.
Published: (2026)
by: Thuy, Ta Thanh, et al.
Published: (2026)
AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
by: Wu, Xiaobao, et al.
Published: (2024)
by: Wu, Xiaobao, et al.
Published: (2024)
UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches
by: Wang, Chao, et al.
Published: (2024)
by: Wang, Chao, et al.
Published: (2024)
NLP-ADBench: NLP Anomaly Detection Benchmark
by: Li, Yuangang, et al.
Published: (2024)
by: Li, Yuangang, et al.
Published: (2024)
AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis
by: Davoudi, Saeedeh, et al.
Published: (2026)
by: Davoudi, Saeedeh, et al.
Published: (2026)
Bench to the Future: A Pastcasting Benchmark for Forecasting Agents
by: FutureSearch, et al.
Published: (2025)
by: FutureSearch, et al.
Published: (2025)
Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference
by: Liu, Yi
Published: (2026)
by: Liu, Yi
Published: (2026)
Fast and the Furious: Hot Starts in Pursuit-Evasion Games
by: Smithline, Gabriel, et al.
Published: (2025)
by: Smithline, Gabriel, et al.
Published: (2025)
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
by: Zhou, Yujun, et al.
Published: (2024)
by: Zhou, Yujun, et al.
Published: (2024)
FusionBench: A Unified Library and Comprehensive Benchmark for Deep Model Fusion
by: Tang, Anke, et al.
Published: (2024)
by: Tang, Anke, et al.
Published: (2024)
Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage
by: He, Ziyi, et al.
Published: (2026)
by: He, Ziyi, et al.
Published: (2026)
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
by: Wang, Hongyu, et al.
Published: (2024)
by: Wang, Hongyu, et al.
Published: (2024)
TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering
by: Ji, An-Yang, et al.
Published: (2026)
by: Ji, An-Yang, et al.
Published: (2026)
Similar Items
-
Adversarial Evasion Attack Efficiency against Large Language Models
by: Vitorino, João, et al.
Published: (2024) -
DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs
by: Ganon, Ben, et al.
Published: (2024) -
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
by: Yan, Yuping, et al.
Published: (2025) -
DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection
by: Chakraborty, Joymallya, et al.
Published: (2024) -
SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection
by: Stefan, Gabriel, et al.
Published: (2026)