Saved in:
Bibliographic Details
Main Authors: Khan, Humam, Nafis, Md Tabrez, Sohail, Shahab Saquib, Khalique, Aqeel, Khan, Rehan Hasan
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.04171
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909014852370432
author Khan, Humam
Nafis, Md Tabrez
Sohail, Shahab Saquib
Khalique, Aqeel
Khan, Rehan Hasan
author_facet Khan, Humam
Nafis, Md Tabrez
Sohail, Shahab Saquib
Khalique, Aqeel
Khan, Rehan Hasan
contents Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.
format Preprint
id arxiv_https___arxiv_org_abs_2605_04171
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing
Khan, Humam
Nafis, Md Tabrez
Sohail, Shahab Saquib
Khalique, Aqeel
Khan, Rehan Hasan
Computation and Language
Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.
title Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing
topic Computation and Language
url https://arxiv.org/abs/2605.04171