Saved in:
| Main Authors: | Jia, Qi, Yue, Xiang, Huang, Shanshan, Qin, Ziheng, Liu, Yizhu, Lin, Bill Yuchen, You, Yang, Zhai, Guangtao |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.01733 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SimulBench: Evaluating Language Models with Creative Simulation Tasks
by: Jia, Qi, et al.
Published: (2024)
by: Jia, Qi, et al.
Published: (2024)
ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
by: Yang, Guan-Yan, et al.
Published: (2025)
by: Yang, Guan-Yan, et al.
Published: (2025)
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
by: Jiang, Fengqing, et al.
Published: (2024)
by: Jiang, Fengqing, et al.
Published: (2024)
Information Density Principle for MLLM Benchmarks
by: Li, Chunyi, et al.
Published: (2025)
by: Li, Chunyi, et al.
Published: (2025)
Testing the Depth of ChatGPT's Comprehension via Cross-Modal Tasks Based on ASCII-Art: GPT3.5's Abilities in Regard to Recognizing and Generating ASCII-Art Are Not Totally Lacking
by: Bayani, David
Published: (2023)
by: Bayani, David
Published: (2023)
Boosting LLM via Learning from Data Iteratively and Selectively
by: Jia, Qi, et al.
Published: (2024)
by: Jia, Qi, et al.
Published: (2024)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
by: Liu, Junpeng, et al.
Published: (2024)
by: Liu, Junpeng, et al.
Published: (2024)
Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
by: Song, Yifan, et al.
Published: (2024)
by: Song, Yifan, et al.
Published: (2024)
TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency
by: Wang, Juntong, et al.
Published: (2025)
by: Wang, Juntong, et al.
Published: (2025)
Movie101v2: Improved Movie Narration Benchmark
by: Yue, Zihao, et al.
Published: (2024)
by: Yue, Zihao, et al.
Published: (2024)
One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework
by: Jia, Qi, et al.
Published: (2025)
by: Jia, Qi, et al.
Published: (2025)
TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks
by: Jiang, Dongfu, et al.
Published: (2023)
by: Jiang, Dongfu, et al.
Published: (2023)
MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis
by: Zhou, Yingjie, et al.
Published: (2024)
by: Zhou, Yingjie, et al.
Published: (2024)
EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory
by: Shen, Ye, et al.
Published: (2026)
by: Shen, Ye, et al.
Published: (2026)
Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception
by: Zhao, Jihao, et al.
Published: (2024)
by: Zhao, Jihao, et al.
Published: (2024)
LitVISTA: A Benchmark for Narrative Orchestration in Literary Text
by: Lu, Mingzhe, et al.
Published: (2026)
by: Lu, Mingzhe, et al.
Published: (2026)
Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA
by: Zhang, Kaiwei, et al.
Published: (2025)
by: Zhang, Kaiwei, et al.
Published: (2025)
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
by: Huang, Chengsong, et al.
Published: (2023)
by: Huang, Chengsong, et al.
Published: (2023)
Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems
by: Berezin, Sergey, et al.
Published: (2024)
by: Berezin, Sergey, et al.
Published: (2024)
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
by: Xu, Zhangchen, et al.
Published: (2024)
by: Xu, Zhangchen, et al.
Published: (2024)
SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking
by: Zhu, Xiangyang, et al.
Published: (2025)
by: Zhu, Xiangyang, et al.
Published: (2025)
QoNext: Towards Next-generation QoE for Foundation Models
by: Guo, Yijin, et al.
Published: (2025)
by: Guo, Yijin, et al.
Published: (2025)
User-centric Subjective Leaderboard by Customizable Reward Modeling
by: Jia, Qi, et al.
Published: (2025)
by: Jia, Qi, et al.
Published: (2025)
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
by: Wang, Zilong, et al.
Published: (2024)
by: Wang, Zilong, et al.
Published: (2024)
Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs
by: Wang, Junying, et al.
Published: (2025)
by: Wang, Junying, et al.
Published: (2025)
CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting
by: Li, Huihan, et al.
Published: (2024)
by: Li, Huihan, et al.
Published: (2024)
A Multi-To-One Interview Paradigm for Efficient MLLM Evaluation
by: Shen, Ye, et al.
Published: (2025)
by: Shen, Ye, et al.
Published: (2025)
Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models
by: Chen, Zijian, et al.
Published: (2025)
by: Chen, Zijian, et al.
Published: (2025)
Statistical Analysis of Sentence Structures through ASCII, Lexical Alignment and PCA
by: Sahdev, Abhijeet
Published: (2025)
by: Sahdev, Abhijeet
Published: (2025)
Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning
by: Dong, Qi, et al.
Published: (2026)
by: Dong, Qi, et al.
Published: (2026)
Teaching LMMs for Image Quality Scoring and Interpreting
by: Zhang, Zicheng, et al.
Published: (2025)
by: Zhang, Zicheng, et al.
Published: (2025)
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
by: Zheng, Tianyu, et al.
Published: (2024)
by: Zheng, Tianyu, et al.
Published: (2024)
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
by: Song, Yifan, et al.
Published: (2024)
by: Song, Yifan, et al.
Published: (2024)
Are AI-Generated Text Detectors Robust to Adversarial Perturbations?
by: Huang, Guanhua, et al.
Published: (2024)
by: Huang, Guanhua, et al.
Published: (2024)
Affordance Benchmark for MLLMs
by: Wang, Junying, et al.
Published: (2025)
by: Wang, Junying, et al.
Published: (2025)
Redundancy Principles for MLLMs Benchmarks
by: Zhang, Zicheng, et al.
Published: (2025)
by: Zhang, Zicheng, et al.
Published: (2025)
Knowledge Fusion via Bidirectional Information Aggregation
by: Zhai, Songlin, et al.
Published: (2025)
by: Zhai, Songlin, et al.
Published: (2025)
MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models
by: Huang, Zhongzhan, et al.
Published: (2025)
by: Huang, Zhongzhan, et al.
Published: (2025)
LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation
by: Wang, Jiarui, et al.
Published: (2025)
by: Wang, Jiarui, et al.
Published: (2025)
AssoCiAm: A Benchmark for Evaluating Association Thinking while Circumventing Ambiguity
by: Liu, Yifan, et al.
Published: (2025)
by: Liu, Yifan, et al.
Published: (2025)
Similar Items
-
SimulBench: Evaluating Language Models with Creative Simulation Tasks
by: Jia, Qi, et al.
Published: (2024) -
ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
by: Yang, Guan-Yan, et al.
Published: (2025) -
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
by: Jiang, Fengqing, et al.
Published: (2024) -
Information Density Principle for MLLM Benchmarks
by: Li, Chunyi, et al.
Published: (2025) -
Testing the Depth of ChatGPT's Comprehension via Cross-Modal Tasks Based on ASCII-Art: GPT3.5's Abilities in Regard to Recognizing and Generating ASCII-Art Are Not Totally Lacking
by: Bayani, David
Published: (2023)