Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Haoxuan, Li, Ruochi, Shrestha, Sarthak, Mamidala, Shree Harshini, Putta, Revanth, Aggarwal, Arka Krishan, Xiao, Ting, Ding, Junhua, Chen, Haihua
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2510.16549
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917096356577280
author	Zhang, Haoxuan Li, Ruochi Shrestha, Sarthak Mamidala, Shree Harshini Putta, Revanth Aggarwal, Arka Krishan Xiao, Ting Ding, Junhua Chen, Haihua
author_facet	Zhang, Haoxuan Li, Ruochi Shrestha, Sarthak Mamidala, Shree Harshini Putta, Revanth Aggarwal, Arka Krishan Xiao, Ting Ding, Junhua Chen, Haihua
contents	Peer review serves as the gatekeeper of science, yet the surge in submissions and widespread adoption of large language models (LLMs) in scholarly evaluation present unprecedented challenges. While recent work has focused on using LLMs to improve review efficiency, unchecked deficient reviews from both human experts and AI systems threaten to systematically undermine academic integrity. To address this issue, we introduce ReviewGuard, an automated system for detecting and categorizing deficient reviews through a four-stage LLM-driven framework: data collection from ICLR and NeurIPS on OpenReview, GPT-4.1 annotation with human validation, synthetic data augmentation yielding 6,634 papers with 24,657 real and 46,438 synthetic reviews, and fine-tuning of encoder-based models and open-source LLMs. Feature analysis reveals that deficient reviews exhibit lower rating scores, higher self-reported confidence, reduced structural complexity, and more negative sentiment than sufficient reviews. AI-generated text detection shows dramatic increases in AI-authored reviews since ChatGPT's emergence. Mixed training with synthetic and real data substantially improves detection performance - for example, Qwen 3-8B achieves recall of 0.6653 and F1 of 0.7073, up from 0.5499 and 0.5606 respectively. This study presents the first LLM-driven system for detecting deficient peer reviews, providing evidence to inform AI governance in peer review. Code, prompts, and data are available at https://github.com/haoxuan-unt2024/ReviewGuard
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_16549
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation Zhang, Haoxuan Li, Ruochi Shrestha, Sarthak Mamidala, Shree Harshini Putta, Revanth Aggarwal, Arka Krishan Xiao, Ting Ding, Junhua Chen, Haihua Computation and Language Peer review serves as the gatekeeper of science, yet the surge in submissions and widespread adoption of large language models (LLMs) in scholarly evaluation present unprecedented challenges. While recent work has focused on using LLMs to improve review efficiency, unchecked deficient reviews from both human experts and AI systems threaten to systematically undermine academic integrity. To address this issue, we introduce ReviewGuard, an automated system for detecting and categorizing deficient reviews through a four-stage LLM-driven framework: data collection from ICLR and NeurIPS on OpenReview, GPT-4.1 annotation with human validation, synthetic data augmentation yielding 6,634 papers with 24,657 real and 46,438 synthetic reviews, and fine-tuning of encoder-based models and open-source LLMs. Feature analysis reveals that deficient reviews exhibit lower rating scores, higher self-reported confidence, reduced structural complexity, and more negative sentiment than sufficient reviews. AI-generated text detection shows dramatic increases in AI-authored reviews since ChatGPT's emergence. Mixed training with synthetic and real data substantially improves detection performance - for example, Qwen 3-8B achieves recall of 0.6653 and F1 of 0.7073, up from 0.5499 and 0.5606 respectively. This study presents the first LLM-driven system for detecting deficient peer reviews, providing evidence to inform AI governance in peer review. Code, prompts, and data are available at https://github.com/haoxuan-unt2024/ReviewGuard
title	ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation
topic	Computation and Language
url	https://arxiv.org/abs/2510.16549

Similar Items