Saved in:
| Main Authors: | Khatore, Manas, Sridharan, Sumana, Sulahian, Kevork, Smith, Benjamin J., Feng, Shi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.08849 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering
by: Li, Zongxia, et al.
Published: (2024)
by: Li, Zongxia, et al.
Published: (2024)
Automated Answer Validation using Text Similarity
by: Ganesan, Balaji, et al.
Published: (2024)
by: Ganesan, Balaji, et al.
Published: (2024)
How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
by: Tu, Minzhu, et al.
Published: (2026)
by: Tu, Minzhu, et al.
Published: (2026)
Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated
by: Zhu, Tiffany, et al.
Published: (2024)
by: Zhu, Tiffany, et al.
Published: (2024)
The Veln(ia)s is in the Details: Evaluating LLM Judgment on Latvian and Lithuanian Short Answer Matching
by: Kostiuk, Yevhen, et al.
Published: (2025)
by: Kostiuk, Yevhen, et al.
Published: (2025)
LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation
by: Yin, Yongjing, et al.
Published: (2024)
by: Yin, Yongjing, et al.
Published: (2024)
Automated Long Answer Grading with RiceChem Dataset
by: Sonkar, Shashank, et al.
Published: (2024)
by: Sonkar, Shashank, et al.
Published: (2024)
Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs
by: Yamamoto, Victor Eiti, et al.
Published: (2025)
by: Yamamoto, Victor Eiti, et al.
Published: (2025)
SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches
by: Deguchi, Hiroyuki, et al.
Published: (2025)
by: Deguchi, Hiroyuki, et al.
Published: (2025)
JSTR: Judgment Improves Scene Text Recognition
by: Fujitake, Masato
Published: (2024)
by: Fujitake, Masato
Published: (2024)
BioACE: An Automated Framework for Biomedical Answer and Citation Evaluations
by: Gupta, Deepak, et al.
Published: (2026)
by: Gupta, Deepak, et al.
Published: (2026)
Reasons to Reject? Aligning Language Models with Judgments
by: Xu, Weiwen, et al.
Published: (2023)
by: Xu, Weiwen, et al.
Published: (2023)
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
by: Wang, Xinpeng, et al.
Published: (2024)
by: Wang, Xinpeng, et al.
Published: (2024)
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
by: Pugachev, Alexander, et al.
Published: (2025)
by: Pugachev, Alexander, et al.
Published: (2025)
LLMs Provide Unstable Answers to Legal Questions
by: Blair-Stanek, Andrew, et al.
Published: (2025)
by: Blair-Stanek, Andrew, et al.
Published: (2025)
Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas
by: Schopf, Tim, et al.
Published: (2026)
by: Schopf, Tim, et al.
Published: (2026)
SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora
by: Yoneda, Masataka, et al.
Published: (2026)
by: Yoneda, Masataka, et al.
Published: (2026)
Focusing on Students, not Machines: Grounded Question Generation and Automated Answer Grading
by: Meyer, Gérôme, et al.
Published: (2025)
by: Meyer, Gérôme, et al.
Published: (2025)
AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment
by: Li, Jiazheng, et al.
Published: (2024)
by: Li, Jiazheng, et al.
Published: (2024)
Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
by: Schleifer, Abigail Victoria Gurin, et al.
Published: (2026)
by: Schleifer, Abigail Victoria Gurin, et al.
Published: (2026)
Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?
by: Balepur, Nishant, et al.
Published: (2024)
by: Balepur, Nishant, et al.
Published: (2024)
Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters
by: Chen, Feng, et al.
Published: (2026)
by: Chen, Feng, et al.
Published: (2026)
JUSTICE: Judicial Unified Synthesis Through Intermediate Conclusion Emulation for Automated Judgment Document Generation
by: Wu, Binglin, et al.
Published: (2026)
by: Wu, Binglin, et al.
Published: (2026)
A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts
by: Rayo, Jhon, et al.
Published: (2025)
by: Rayo, Jhon, et al.
Published: (2025)
Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models
by: Park, Bumjin, et al.
Published: (2025)
by: Park, Bumjin, et al.
Published: (2025)
What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length
by: Tjuatja, Lindia, et al.
Published: (2024)
by: Tjuatja, Lindia, et al.
Published: (2024)
Monte Carlo Planning with Large Language Model for Text-Based Game Agents
by: Shi, Zijing, et al.
Published: (2025)
by: Shi, Zijing, et al.
Published: (2025)
From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems
by: Rabbani, Parisa, et al.
Published: (2025)
by: Rabbani, Parisa, et al.
Published: (2025)
Human-LLM Hybrid Text Answer Aggregation for Crowd Annotations
by: Li, Jiyi
Published: (2024)
by: Li, Jiyi
Published: (2024)
Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems
by: Yarmohammadtoosky, Sahar, et al.
Published: (2025)
by: Yarmohammadtoosky, Sahar, et al.
Published: (2025)
Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery
by: Jiang, Yifan, et al.
Published: (2026)
by: Jiang, Yifan, et al.
Published: (2026)
Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition
by: Kim, Seungju, et al.
Published: (2024)
by: Kim, Seungju, et al.
Published: (2024)
Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games
by: Lim, Seungwon, et al.
Published: (2025)
by: Lim, Seungwon, et al.
Published: (2025)
Employing Label Models on ChatGPT Answers Improves Legal Text Entailment Performance
by: Nguyen, Chau, et al.
Published: (2024)
by: Nguyen, Chau, et al.
Published: (2024)
Answer is All You Need: Instruction-following Text Embedding via Answering the Question
by: Peng, Letian, et al.
Published: (2024)
by: Peng, Letian, et al.
Published: (2024)
Enhancing Answer Attribution for Faithful Text Generation with Large Language Models
by: Vladika, Juraj, et al.
Published: (2024)
by: Vladika, Juraj, et al.
Published: (2024)
Grade Guard: A Smart System for Short Answer Automated Grading
by: Dadu, Niharika, et al.
Published: (2025)
by: Dadu, Niharika, et al.
Published: (2025)
Improving LLM-as-a-Judge Inference with the Judgment Distribution
by: Wang, Victor, et al.
Published: (2025)
by: Wang, Victor, et al.
Published: (2025)
From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG
by: Chu, Yucheng, et al.
Published: (2026)
by: Chu, Yucheng, et al.
Published: (2026)
When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
by: Yao, Siyang, et al.
Published: (2026)
by: Yao, Siyang, et al.
Published: (2026)
Similar Items
-
CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering
by: Li, Zongxia, et al.
Published: (2024) -
Automated Answer Validation using Text Similarity
by: Ganesan, Balaji, et al.
Published: (2024) -
How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
by: Tu, Minzhu, et al.
Published: (2026) -
Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated
by: Zhu, Tiffany, et al.
Published: (2024) -
The Veln(ia)s is in the Details: Evaluating LLM Judgment on Latvian and Lithuanian Short Answer Matching
by: Kostiuk, Yevhen, et al.
Published: (2025)