Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Khan, Humam, Nafis, Md Tabrez, Sohail, Shahab Saquib, Khalique, Aqeel, Khan, Rehan Hasan
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2605.04171
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909014852370432
author	Khan, Humam Nafis, Md Tabrez Sohail, Shahab Saquib Khalique, Aqeel Khan, Rehan Hasan
author_facet	Khan, Humam Nafis, Md Tabrez Sohail, Shahab Saquib Khalique, Aqeel Khan, Rehan Hasan
contents	Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_04171
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing Khan, Humam Nafis, Md Tabrez Sohail, Shahab Saquib Khalique, Aqeel Khan, Rehan Hasan Computation and Language Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.
title	Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing
topic	Computation and Language
url	https://arxiv.org/abs/2605.04171

Similar Items