:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Rammouz, Veronica, Gonzalez, Aaron, Cruzportillo, Carlos, Tan, Adrian, Beebe, Nicole, Rios, Anthony
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2510.09519
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Telling Speculative Stories to Help Humans Imagine the Harms of Healthcare AI
by: Zhao, Xingmeng, et al.
Published: (2025)

Can We Predict Performance of Large Models across Vision-Language Tasks?
by: Zhao, Qinyu, et al.
Published: (2024)

Can We Trust Machine Learning? The Reliability of Features from Open-Source Speech Analysis Tools for Speech Modeling
by: Chowdhury, Tahiya, et al.
Published: (2025)

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
by: Atil, Berk, et al.
Published: (2025)

Fact or Fiction? Can LLMs be Reliable Annotators for Political Truths?
by: Chatrath, Veronica, et al.
Published: (2024)

Crossing Domains without Labels: Distant Supervision for Term Extraction
by: Senger, Elena, et al.
Published: (2025)

Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index
by: McDonald, Tyler, et al.
Published: (2024)

Lateral Phishing With Large Language Models: A Large Organization Comparative Study
by: Bethany, Mazal, et al.
Published: (2024)

When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
by: Badawi, Abeer, et al.
Published: (2025)

LLM Compression: How Far Can We Go in Balancing Size and Performance?
by: Sk, Sahil, et al.
Published: (2025)

Instruction-tuned Large Language Models for Machine Translation in the Medical Domain
by: Rios, Miguel
Published: (2024)

Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need
by: Dedhia, Bhishma, et al.
Published: (2025)

How Much of Your Data Can Suck? Thresholds for Domain Performance and Emergent Misalignment in LLMs
by: Ouyang, Jian, et al.
Published: (2025)

Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization?
by: He, Jianfeng, et al.
Published: (2024)

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?
by: Fang, Qingkai, et al.
Published: (2024)

Exploring the Performance of ML/DL Architectures on the MNIST-1D Dataset
by: Beebe, Michael, et al.
Published: (2026)

Can Language Models Represent the Past without Anachronism?
by: Underwood, Ted, et al.
Published: (2025)

LLMs as Data Annotators: How Close Are We to Human Performance
by: Haq, Muhammad Uzair Ul, et al.
Published: (2025)

Extracting Biomedical Entities from Noisy Audio Transcripts
by: Ebadi, Nima, et al.
Published: (2024)

How Much Can We Forget about Data Contamination?
by: Bordt, Sebastian, et al.
Published: (2024)

A Comprehensive Study of Gender Bias in Chemical Named Entity Recognition Models
by: Zhao, Xingmeng, et al.
Published: (2022)

Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary
by: Zhao, Xingmeng, et al.
Published: (2024)

Can We Trust LLM Detectors?
by: Sandhan, Jivnesh, et al.
Published: (2026)

Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach
by: Ford, James, et al.
Published: (2025)

Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems
by: Klisura, Đorđe, et al.
Published: (2024)

Team UTSA-NLP at SemEval 2024 Task 5: Prompt Ensembling for Argument Reasoning in Civil Procedures with GPT4
by: Schumacher, Dan, et al.
Published: (2024)

Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement
by: Su, Zijin, et al.
Published: (2025)

Can We Evaluate Domain Adaptation Models Without Target-Domain Labels?
by: Yang, Jianfei, et al.
Published: (2023)

Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data
by: Li, Rumeng, et al.
Published: (2023)

Can We Infer Confidential Properties of Training Data from LLMs?
by: Huang, Pengrun, et al.
Published: (2025)

Ranking Large Language Models without Ground Truth
by: Dhurandhar, Amit, et al.
Published: (2024)

How Far Can We Extract Diverse Perspectives from Large Language Models?
by: Hayati, Shirley Anugrah, et al.
Published: (2023)

Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?
by: Wang, Shaobo, et al.
Published: (2025)

Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
by: Zverev, Egor, et al.
Published: (2024)

LabelCoRank: Revolutionizing Long Tail Multi-Label Classification with Co-Occurrence Reranking
by: Yan, Yan, et al.
Published: (2025)

Editing Arbitrary Propositions in LLMs without Subject Labels
by: Feigenbaum, Itai, et al.
Published: (2024)

Can We Locate and Prevent Stereotypes in LLMs?
by: D'Souza, Alex
Published: (2026)

Generative Pseudo-Labeling for Pre-Ranking with LLMs
by: Bi, Junyu, et al.
Published: (2026)

Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data
by: Zhang, Jihong, et al.
Published: (2025)

Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments
by: De la Iglesia, Iker, et al.
Published: (2024)