Saved in:
| Main Authors: | Saxon, Michael, Jahara, Fatima, Khoshnoodi, Mahsa, Lu, Yujie, Sharma, Aditya, Wang, William Yang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.04251 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Evaluating Implicit Biases in LLM Reasoning through Logic Grid Puzzles
by: Jahara, Fatima, et al.
Published: (2025)
by: Jahara, Fatima, et al.
Published: (2025)
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
by: Sharma, Aditya, et al.
Published: (2024)
by: Sharma, Aditya, et al.
Published: (2024)
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles
by: Budagam, Devichand, et al.
Published: (2024)
by: Budagam, Devichand, et al.
Published: (2024)
TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency
by: Wang, Juntong, et al.
Published: (2025)
by: Wang, Juntong, et al.
Published: (2025)
CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation
by: Yayavaram, Arnav, et al.
Published: (2025)
by: Yayavaram, Arnav, et al.
Published: (2025)
VideoScore2: Think before You Score in Generative Video Evaluation
by: He, Xuan, et al.
Published: (2025)
by: He, Xuan, et al.
Published: (2025)
CrossScore: Towards Multi-View Image Evaluation and Scoring
by: Wang, Zirui, et al.
Published: (2024)
by: Wang, Zirui, et al.
Published: (2024)
A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models
by: Khoshnoodi, Mahsa, et al.
Published: (2024)
by: Khoshnoodi, Mahsa, et al.
Published: (2024)
TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models
by: Sampaio, Georgia Gabriela, et al.
Published: (2024)
by: Sampaio, Georgia Gabriela, et al.
Published: (2024)
Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation
by: Kasaei, Seyed Amir, et al.
Published: (2025)
by: Kasaei, Seyed Amir, et al.
Published: (2025)
Alignment Scores: Robust Metrics for Multiview Pose Accuracy Evaluation
by: Lee, Seong Hun, et al.
Published: (2024)
by: Lee, Seong Hun, et al.
Published: (2024)
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
by: Wiles, Olivia, et al.
Published: (2024)
by: Wiles, Olivia, et al.
Published: (2024)
DiffScore: Text Evaluation Beyond Autoregressive Likelihood
by: Lai, Wen, et al.
Published: (2026)
by: Lai, Wen, et al.
Published: (2026)
Good Scores, Bad Data: A Metric for Multimodal Coherence
by: Srinivasan, Vasundra
Published: (2026)
by: Srinivasan, Vasundra
Published: (2026)
RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores
by: Li, Yingshu, et al.
Published: (2025)
by: Li, Yingshu, et al.
Published: (2025)
FedDriveScore: Federated Scoring Driving Behavior with a Mixture of Metric Distributions
by: Lu, Lin
Published: (2024)
by: Lu, Lin
Published: (2024)
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
by: Tang, Zeyu, et al.
Published: (2026)
by: Tang, Zeyu, et al.
Published: (2026)
MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation
by: Kan, Shichao, et al.
Published: (2026)
by: Kan, Shichao, et al.
Published: (2026)
Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts
by: Saxon, Michael, et al.
Published: (2024)
by: Saxon, Michael, et al.
Published: (2024)
ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
by: Wang, Xiao, et al.
Published: (2025)
by: Wang, Xiao, et al.
Published: (2025)
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
by: Feng, Weixi, et al.
Published: (2024)
by: Feng, Weixi, et al.
Published: (2024)
Towards Evaluating Robustness of Prompt Adherence in Text to Image Models
by: Vemishetty, Sujith, et al.
Published: (2025)
by: Vemishetty, Sujith, et al.
Published: (2025)
Prompt Stability Scoring for Text Annotation with Large Language Models
by: Barrie, Christopher, et al.
Published: (2024)
by: Barrie, Christopher, et al.
Published: (2024)
SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
by: Chen, Xiaofu, et al.
Published: (2025)
by: Chen, Xiaofu, et al.
Published: (2025)
World Consistency Score: A Unified Metric for Video Generation Quality
by: Rakheja, Akshat, et al.
Published: (2025)
by: Rakheja, Akshat, et al.
Published: (2025)
Image Score: Learning and Evaluating Human Preferences for Mercari Search
by: Oinar, Chingis, et al.
Published: (2024)
by: Oinar, Chingis, et al.
Published: (2024)
Evaluation Metrics for Text Data Augmentation in NLP
by: Amadeus, Marcellus, et al.
Published: (2024)
by: Amadeus, Marcellus, et al.
Published: (2024)
Image Interpolation with Score-based Riemannian Metrics of Diffusion Models
by: Saito, Shinnosuke, et al.
Published: (2025)
by: Saito, Shinnosuke, et al.
Published: (2025)
A Quantitative Evaluation of Score Distillation Sampling Based Text-to-3D
by: Fei, Xiaohan, et al.
Published: (2024)
by: Fei, Xiaohan, et al.
Published: (2024)
OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation
by: Lage, Lucas Fonseca, et al.
Published: (2025)
by: Lage, Lucas Fonseca, et al.
Published: (2025)
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
by: Liu, Yiqi, et al.
Published: (2023)
by: Liu, Yiqi, et al.
Published: (2023)
Scendi Score: Prompt-Aware Diversity Evaluation via Schur Complement of CLIP Embeddings
by: Ospanov, Azim, et al.
Published: (2024)
by: Ospanov, Azim, et al.
Published: (2024)
Mean Opinion Score as a New Metric for User-Evaluation of XAI Methods
by: Yu, Hyeon, et al.
Published: (2024)
by: Yu, Hyeon, et al.
Published: (2024)
Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scoring of Texts with Large Language Models
by: Wu, Patrick Y., et al.
Published: (2023)
by: Wu, Patrick Y., et al.
Published: (2023)
Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)
by: Li, Qingquan, et al.
Published: (2025)
Ran Score: a LLM-based Evaluation Score for Radiology Report Generation
by: Zhang, Ran, et al.
Published: (2026)
by: Zhang, Ran, et al.
Published: (2026)
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?
by: Fu, Xingyu, et al.
Published: (2024)
by: Fu, Xingyu, et al.
Published: (2024)
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
by: Hong, Minki, et al.
Published: (2026)
by: Hong, Minki, et al.
Published: (2026)
TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation
by: Grimal, Paul, et al.
Published: (2023)
by: Grimal, Paul, et al.
Published: (2023)
Culture is Everywhere: A Call for Intentionally Cultural Evaluation
by: Oh, Juhyun, et al.
Published: (2025)
by: Oh, Juhyun, et al.
Published: (2025)
Similar Items
-
Evaluating Implicit Biases in LLM Reasoning through Logic Grid Puzzles
by: Jahara, Fatima, et al.
Published: (2025) -
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
by: Sharma, Aditya, et al.
Published: (2024) -
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles
by: Budagam, Devichand, et al.
Published: (2024) -
TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency
by: Wang, Juntong, et al.
Published: (2025) -
CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation
by: Yayavaram, Arnav, et al.
Published: (2025)