Enregistré dans:
| Auteurs principaux: | Nery, Lorenzo Alfred, Catignas, Ronald Dawson, Tiam-Lee, Thomas James |
|---|---|
| Format: | Preprint |
| Publié: |
2025
|
| Sujets: | |
| Accès en ligne: | https://arxiv.org/abs/2509.06065 |
| Tags: |
Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
|
Documents similaires
Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models
par: Gamboa, Lance Calvin Lim, et autres
Publié: (2026)
par: Gamboa, Lance Calvin Lim, et autres
Publié: (2026)
Batayan: A Filipino NLP benchmark for evaluating Large Language Models
par: Montalan, Jann Railey, et autres
Publié: (2025)
par: Montalan, Jann Railey, et autres
Publié: (2025)
Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages
par: Gamboa, Lance Calvin Lim, et autres
Publié: (2025)
par: Gamboa, Lance Calvin Lim, et autres
Publié: (2025)
Filipino Benchmarks for Measuring Sexist and Homophobic Bias in Multilingual Language Models from Southeast Asia
par: Gamboa, Lance Calvin Lim, et autres
Publié: (2024)
par: Gamboa, Lance Calvin Lim, et autres
Publié: (2024)
FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)
par: Maminta, Carlos Jude G., et autres
Publié: (2025)
par: Maminta, Carlos Jude G., et autres
Publié: (2025)
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
par: He, Yancheng, et autres
Publié: (2024)
par: He, Yancheng, et autres
Publié: (2024)
Truth Knows No Language: Evaluating Truthfulness Beyond English
par: Figueras, Blanca Calvo, et autres
Publié: (2025)
par: Figueras, Blanca Calvo, et autres
Publié: (2025)
SwaQuAD-24: QA Benchmark Dataset in Swahili
par: Kondoro, Alfred Malengo
Publié: (2024)
par: Kondoro, Alfred Malengo
Publié: (2024)
RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering
par: Qiu, Weikang, et autres
Publié: (2025)
par: Qiu, Weikang, et autres
Publié: (2025)
TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models
par: Yu, Ping, et autres
Publié: (2024)
par: Yu, Ping, et autres
Publié: (2024)
BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models
par: Chen, Jiangxi, et autres
Publié: (2026)
par: Chen, Jiangxi, et autres
Publié: (2026)
Representational and Behavioral Stability of Truth in Large Language Models
par: Dies, Samantha, et autres
Publié: (2025)
par: Dies, Samantha, et autres
Publié: (2025)
Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation
par: Adib, Shefayat E Shams, et autres
Publié: (2026)
par: Adib, Shefayat E Shams, et autres
Publié: (2026)
Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models
par: Xiong, Guangzhi, et autres
Publié: (2025)
par: Xiong, Guangzhi, et autres
Publié: (2025)
Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models
par: Vazhentsev, Artem, et autres
Publié: (2024)
par: Vazhentsev, Artem, et autres
Publié: (2024)
Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA
par: Lamba, Naveen, et autres
Publié: (2025)
par: Lamba, Naveen, et autres
Publié: (2025)
FilBench: Can LLMs Understand and Generate Filipino?
par: Miranda, Lester James V., et autres
Publié: (2025)
par: Miranda, Lester James V., et autres
Publié: (2025)
FinanceQA: A Benchmark for Evaluating Financial Analysis Capabilities of Large Language Models
par: Mateega, Spencer, et autres
Publié: (2025)
par: Mateega, Spencer, et autres
Publié: (2025)
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
par: Zhang, Shaolei, et autres
Publié: (2024)
par: Zhang, Shaolei, et autres
Publié: (2024)
MalAlgoQA: Pedagogical Evaluation of Counterfactual Reasoning in Large Language Models and Implications for AI in Education
par: Liu, Naiming, et autres
Publié: (2024)
par: Liu, Naiming, et autres
Publié: (2024)
ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models
par: Chen, Haibin, et autres
Publié: (2025)
par: Chen, Haibin, et autres
Publié: (2025)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension
par: Yin, Fan, et autres
Publié: (2024)
par: Yin, Fan, et autres
Publié: (2024)
DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
par: Doris, Anna C., et autres
Publié: (2024)
par: Doris, Anna C., et autres
Publié: (2024)
Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
par: Mastrokostas, Charalampos, et autres
Publié: (2026)
par: Mastrokostas, Charalampos, et autres
Publié: (2026)
CodeSimpleQA: Scaling Factuality in Code Large Language Models
par: Yang, Jian, et autres
Publié: (2025)
par: Yang, Jian, et autres
Publié: (2025)
Unipa-GPT: Large Language Models for university-oriented QA in Italian
par: Siragusa, Irene, et autres
Publié: (2024)
par: Siragusa, Irene, et autres
Publié: (2024)
SportQA: A Benchmark for Sports Understanding in Large Language Models
par: Xia, Haotian, et autres
Publié: (2024)
par: Xia, Haotian, et autres
Publié: (2024)
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
par: Liang, Kaiqu, et autres
Publié: (2025)
par: Liang, Kaiqu, et autres
Publié: (2025)
Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning
par: Chen, Zhongzhi, et autres
Publié: (2023)
par: Chen, Zhongzhi, et autres
Publié: (2023)
EconLogicQA: A Question-Answering Benchmark for Evaluating Large Language Models in Economic Sequential Reasoning
par: Quan, Yinzhu, et autres
Publié: (2024)
par: Quan, Yinzhu, et autres
Publié: (2024)
Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models
par: Rahman, Subhey Sadi, et autres
Publié: (2025)
par: Rahman, Subhey Sadi, et autres
Publié: (2025)
TruthPrInt: Mitigating Large Vision-Language Models Object Hallucination Via Latent Truthful-Guided Pre-Intervention
par: Duan, Jinhao, et autres
Publié: (2025)
par: Duan, Jinhao, et autres
Publié: (2025)
SymLoc: Symbolic Localization of Hallucination across HaluEval and TruthfulQA
par: Lamba, Naveen, et autres
Publié: (2025)
par: Lamba, Naveen, et autres
Publié: (2025)
KG-Rank: Enhancing Large Language Models for Medical QA with Knowledge Graphs and Ranking Techniques
par: Yang, Rui, et autres
Publié: (2024)
par: Yang, Rui, et autres
Publié: (2024)
Ranking Large Language Models without Ground Truth
par: Dhurandhar, Amit, et autres
Publié: (2024)
par: Dhurandhar, Amit, et autres
Publié: (2024)
When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models
par: Wang, Keyu, et autres
Publié: (2025)
par: Wang, Keyu, et autres
Publié: (2025)
ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models
par: Dasgupta, Sharanya, et autres
Publié: (2026)
par: Dasgupta, Sharanya, et autres
Publié: (2026)
QA-prompting: Improving Summarization with Large Language Models using Question-Answering
par: Sinha, Neelabh
Publié: (2025)
par: Sinha, Neelabh
Publié: (2025)
PRIV-QA: Privacy-Preserving Question Answering for Cloud Large Language Models
par: Li, Guangwei, et autres
Publié: (2025)
par: Li, Guangwei, et autres
Publié: (2025)
On The Truthfulness of 'Surprisingly Likely' Responses of Large Language Models
par: Goel, Naman
Publié: (2023)
par: Goel, Naman
Publié: (2023)
Documents similaires
-
Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models
par: Gamboa, Lance Calvin Lim, et autres
Publié: (2026) -
Batayan: A Filipino NLP benchmark for evaluating Large Language Models
par: Montalan, Jann Railey, et autres
Publié: (2025) -
Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages
par: Gamboa, Lance Calvin Lim, et autres
Publié: (2025) -
Filipino Benchmarks for Measuring Sexist and Homophobic Bias in Multilingual Language Models from Southeast Asia
par: Gamboa, Lance Calvin Lim, et autres
Publié: (2024) -
FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)
par: Maminta, Carlos Jude G., et autres
Publié: (2025)