:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Khatore, Manas, Sridharan, Sumana, Sulahian, Kevork, Smith, Benjamin J., Feng, Shi
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2601.08849
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering
by: Li, Zongxia, et al.
Published: (2024)

Automated Answer Validation using Text Similarity
by: Ganesan, Balaji, et al.
Published: (2024)

How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
by: Tu, Minzhu, et al.
Published: (2026)

Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated
by: Zhu, Tiffany, et al.
Published: (2024)

The Veln(ia)s is in the Details: Evaluating LLM Judgment on Latvian and Lithuanian Short Answer Matching
by: Kostiuk, Yevhen, et al.
Published: (2025)

LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation
by: Yin, Yongjing, et al.
Published: (2024)

Automated Long Answer Grading with RiceChem Dataset
by: Sonkar, Shashank, et al.
Published: (2024)

Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs
by: Yamamoto, Victor Eiti, et al.
Published: (2025)

SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches
by: Deguchi, Hiroyuki, et al.
Published: (2025)

JSTR: Judgment Improves Scene Text Recognition
by: Fujitake, Masato
Published: (2024)

BioACE: An Automated Framework for Biomedical Answer and Citation Evaluations
by: Gupta, Deepak, et al.
Published: (2026)

Reasons to Reject? Aligning Language Models with Judgments
by: Xu, Weiwen, et al.
Published: (2023)

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
by: Wang, Xinpeng, et al.
Published: (2024)

REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
by: Pugachev, Alexander, et al.
Published: (2025)

LLMs Provide Unstable Answers to Legal Questions
by: Blair-Stanek, Andrew, et al.
Published: (2025)

Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas
by: Schopf, Tim, et al.
Published: (2026)

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora
by: Yoneda, Masataka, et al.
Published: (2026)

Focusing on Students, not Machines: Grounded Question Generation and Automated Answer Grading
by: Meyer, Gérôme, et al.
Published: (2025)

AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment
by: Li, Jiazheng, et al.
Published: (2024)

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
by: Schleifer, Abigail Victoria Gurin, et al.
Published: (2026)

Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?
by: Balepur, Nishant, et al.
Published: (2024)

Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters
by: Chen, Feng, et al.
Published: (2026)

JUSTICE: Judicial Unified Synthesis Through Intermediate Conclusion Emulation for Automated Judgment Document Generation
by: Wu, Binglin, et al.
Published: (2026)

A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts
by: Rayo, Jhon, et al.
Published: (2025)

Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models
by: Park, Bumjin, et al.
Published: (2025)

What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length
by: Tjuatja, Lindia, et al.
Published: (2024)

Monte Carlo Planning with Large Language Model for Text-Based Game Agents
by: Shi, Zijing, et al.
Published: (2025)

From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems
by: Rabbani, Parisa, et al.
Published: (2025)

Human-LLM Hybrid Text Answer Aggregation for Crowd Annotations
by: Li, Jiyi
Published: (2024)

Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems
by: Yarmohammadtoosky, Sahar, et al.
Published: (2025)

Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery
by: Jiang, Yifan, et al.
Published: (2026)

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition
by: Kim, Seungju, et al.
Published: (2024)

Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games
by: Lim, Seungwon, et al.
Published: (2025)

Employing Label Models on ChatGPT Answers Improves Legal Text Entailment Performance
by: Nguyen, Chau, et al.
Published: (2024)

Answer is All You Need: Instruction-following Text Embedding via Answering the Question
by: Peng, Letian, et al.
Published: (2024)

Enhancing Answer Attribution for Faithful Text Generation with Large Language Models
by: Vladika, Juraj, et al.
Published: (2024)

Grade Guard: A Smart System for Short Answer Automated Grading
by: Dadu, Niharika, et al.
Published: (2025)

Improving LLM-as-a-Judge Inference with the Judgment Distribution
by: Wang, Victor, et al.
Published: (2025)

From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG
by: Chu, Yucheng, et al.
Published: (2026)

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
by: Yao, Siyang, et al.
Published: (2026)