:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Bai, Fan, Harrigian, Keith, Stremmel, Joel, Hassanzadeh, Hamid, Saeedi, Ardavan, Dredze, Mark
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2412.04573
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LLMs are Better Than You Think: Label-Guided In-Context Learning for Named Entity Recognition
by: Bai, Fan, et al.
Published: (2025)

Are Clinical T5 Models Better for Clinical Text?
by: Li, Yahan, et al.
Published: (2024)

Generative Active Testing: Efficient LLM Evaluation via Proxy Task Adaptation
by: Ramakrishnan, Aashish Anantha, et al.
Published: (2026)

Task Matters: Knowledge Requirements Shape LLM Responses to Context-Memory Conflict
by: Sun, Kaiser, et al.
Published: (2025)

Consistency Training by Synthetic Question Generation for Conversational Question Answering
by: Hemati, Hamed Hematian, et al.
Published: (2024)

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions
by: Chen, Hanjie, et al.
Published: (2024)

RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
by: An, Bang, et al.
Published: (2025)

Amuro and Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
by: Sun, Kaiser, et al.
Published: (2024)

DnDScore: Decontextualization and Decomposition for Factuality Verification in Long-Form Text Generation
by: Wanner, Miriam, et al.
Published: (2024)

Schema-Driven Information Extraction from Heterogeneous Tables
by: Bai, Fan, et al.
Published: (2023)

Evaluating Biases in Context-Dependent Health Questions
by: Levy, Sharon, et al.
Published: (2024)

Syn-QA2: Evaluating False Assumptions in Long-tail Questions with Synthetic QA Datasets
by: Daswani, Ashwin, et al.
Published: (2024)

Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer
by: DeLucia, Alexandra, et al.
Published: (2025)

Evaluating the Evaluators: Are readability metrics good measures of readability?
by: Cachola, Isabel, et al.
Published: (2025)

Evaluating Implicit Biases in LLM Reasoning through Logic Grid Puzzles
by: Jahara, Fatima, et al.
Published: (2025)

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
by: Sun, Kaiser, et al.
Published: (2026)

Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model
by: Kim, Taehee, et al.
Published: (2024)

LLMs in Biomedicine: A study on clinical Named Entity Recognition
by: Monajatipoor, Masoud, et al.
Published: (2024)

From Policy to Logic for Efficient and Interpretable Coverage Assessment
by: Pokharel, Rhitabrat, et al.
Published: (2026)

ExpertQA: Expert-Curated Questions and Attributed Answers
by: Malaviya, Chaitanya, et al.
Published: (2023)

Towards Better Question Generation in QA-based Event Extraction
by: Hong, Zijin, et al.
Published: (2024)

Weird Generalization is Weirdly Brittle
by: Wanner, Miriam, et al.
Published: (2026)

NeoQA: Evidence-based Question Answering with Generated News Events
by: Glockner, Max, et al.
Published: (2025)

Prompting-based Synthetic Data Generation for Few-Shot Question Answering
by: Schmidt, Maximilian, et al.
Published: (2024)

SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation
by: Kundu, Debarshi
Published: (2024)

ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics
by: Yifei, Li S., et al.
Published: (2025)

PolQA: Polish Question Answering Dataset
by: Rybak, Piotr, et al.
Published: (2022)

Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions
by: Vlachos, Christos, et al.
Published: (2025)

Synthetic Context Generation for Question Generation
by: Liu, Naiming, et al.
Published: (2024)

Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats
by: Sasse, Kuleen, et al.
Published: (2024)

MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification
by: Huang, Heyuan, et al.
Published: (2025)

JDocQA: Japanese Document Question Answering Dataset for Generative Language Models
by: Onami, Eri, et al.
Published: (2024)

On the Failure of Latent State Persistence in Large Language Models
by: Huang, Jen-tse, et al.
Published: (2025)

Assessing The Potential Of Mid-Sized Language Models For Clinical QA
by: Bolton, Elliot, et al.
Published: (2024)

pdfQA: Diverse, Challenging, and Realistic Question Answering over PDFs
by: Schimanski, Tobias, et al.
Published: (2026)

DebateQA: Evaluating Question Answering on Debatable Knowledge
by: Xu, Rongwu, et al.
Published: (2024)

A Closer Look at Claim Decomposition
by: Wanner, Miriam, et al.
Published: (2024)

Give me a hint: Can LLMs take a hint to solve math problems?
by: Agrawal, Vansh, et al.
Published: (2024)

Synthetic Multimodal Question Generation
by: Wu, Ian, et al.
Published: (2024)

Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data
by: Chen, Shan, et al.
Published: (2024)