:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Jia, Qi, Yue, Xiang, Huang, Shanshan, Qin, Ziheng, Liu, Yizhu, Lin, Bill Yuchen, You, Yang, Zhai, Guangtao
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2410.01733
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SimulBench: Evaluating Language Models with Creative Simulation Tasks
by: Jia, Qi, et al.
Published: (2024)

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
by: Yang, Guan-Yan, et al.
Published: (2025)

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
by: Jiang, Fengqing, et al.
Published: (2024)

Information Density Principle for MLLM Benchmarks
by: Li, Chunyi, et al.
Published: (2025)

Testing the Depth of ChatGPT's Comprehension via Cross-Modal Tasks Based on ASCII-Art: GPT3.5's Abilities in Regard to Recognizing and Generating ASCII-Art Are Not Totally Lacking
by: Bayani, David
Published: (2023)

Boosting LLM via Learning from Data Iteratively and Selectively
by: Jia, Qi, et al.
Published: (2024)

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
by: Liu, Junpeng, et al.
Published: (2024)

Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
by: Song, Yifan, et al.
Published: (2024)

TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency
by: Wang, Juntong, et al.
Published: (2025)

Movie101v2: Improved Movie Narration Benchmark
by: Yue, Zihao, et al.
Published: (2024)

One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework
by: Jia, Qi, et al.
Published: (2025)

TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks
by: Jiang, Dongfu, et al.
Published: (2023)

MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis
by: Zhou, Yingjie, et al.
Published: (2024)

EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory
by: Shen, Ye, et al.
Published: (2026)

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception
by: Zhao, Jihao, et al.
Published: (2024)

LitVISTA: A Benchmark for Narrative Orchestration in Literary Text
by: Lu, Mingzhe, et al.
Published: (2026)

Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA
by: Zhang, Kaiwei, et al.
Published: (2025)

LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
by: Huang, Chengsong, et al.
Published: (2023)

Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems
by: Berezin, Sergey, et al.
Published: (2024)

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
by: Xu, Zhangchen, et al.
Published: (2024)

SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking
by: Zhu, Xiangyang, et al.
Published: (2025)

QoNext: Towards Next-generation QoE for Foundation Models
by: Guo, Yijin, et al.
Published: (2025)

User-centric Subjective Leaderboard by Customizable Reward Modeling
by: Jia, Qi, et al.
Published: (2025)

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
by: Wang, Zilong, et al.
Published: (2024)

Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs
by: Wang, Junying, et al.
Published: (2025)

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting
by: Li, Huihan, et al.
Published: (2024)

A Multi-To-One Interview Paradigm for Efficient MLLM Evaluation
by: Shen, Ye, et al.
Published: (2025)

Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models
by: Chen, Zijian, et al.
Published: (2025)

Statistical Analysis of Sentence Structures through ASCII, Lexical Alignment and PCA
by: Sahdev, Abhijeet
Published: (2025)

Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning
by: Dong, Qi, et al.
Published: (2026)

Teaching LMMs for Image Quality Scoring and Interpreting
by: Zhang, Zicheng, et al.
Published: (2025)

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
by: Zheng, Tianyu, et al.
Published: (2024)

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
by: Song, Yifan, et al.
Published: (2024)

Are AI-Generated Text Detectors Robust to Adversarial Perturbations?
by: Huang, Guanhua, et al.
Published: (2024)

Affordance Benchmark for MLLMs
by: Wang, Junying, et al.
Published: (2025)

Redundancy Principles for MLLMs Benchmarks
by: Zhang, Zicheng, et al.
Published: (2025)

Knowledge Fusion via Bidirectional Information Aggregation
by: Zhai, Songlin, et al.
Published: (2025)

MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models
by: Huang, Zhongzhan, et al.
Published: (2025)

LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation
by: Wang, Jiarui, et al.
Published: (2025)

AssoCiAm: A Benchmark for Evaluating Association Thinking while Circumventing Ambiguity
by: Liu, Yifan, et al.
Published: (2025)