Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Karaca, Kemal Sami, Eravcı, Bahaeddin
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2509.21907
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911245713539072
author	Karaca, Kemal Sami Eravcı, Bahaeddin
author_facet	Karaca, Kemal Sami Eravcı, Bahaeddin
contents	Understanding the qualitative intent of citations is essential for a comprehensive assessment of academic research, a task that poses unique challenges for agglutinative languages like Turkish. This paper introduces a systematic methodology and a foundational dataset to address this problem. We first present a new, publicly available dataset of Turkish citation intents, created with a purpose-built annotation tool. We then evaluate the performance of standard In-Context Learning (ICL) with Large Language Models (LLMs), demonstrating that its effectiveness is limited by inconsistent results caused by manually designed prompts. To address this core limitation, we introduce a programmable classification pipeline built on the DSPy framework, which automates prompt optimization systematically. For final classification, we employ a stacked generalization ensemble to aggregate outputs from multiple optimized models, ensuring stable and reliable predictions. This ensemble, with an XGBoost meta-model, achieves a state-of-the-art accuracy of 91.3\%. Ultimately, this study provides the Turkish NLP community and the broader academic circles with a foundational dataset and a robust classification framework paving the way for future qualitative citation studies.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_21907
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	A Large-Scale Dataset and Citation Intent Classification in Turkish with LLMs Karaca, Kemal Sami Eravcı, Bahaeddin Computation and Language Artificial Intelligence Understanding the qualitative intent of citations is essential for a comprehensive assessment of academic research, a task that poses unique challenges for agglutinative languages like Turkish. This paper introduces a systematic methodology and a foundational dataset to address this problem. We first present a new, publicly available dataset of Turkish citation intents, created with a purpose-built annotation tool. We then evaluate the performance of standard In-Context Learning (ICL) with Large Language Models (LLMs), demonstrating that its effectiveness is limited by inconsistent results caused by manually designed prompts. To address this core limitation, we introduce a programmable classification pipeline built on the DSPy framework, which automates prompt optimization systematically. For final classification, we employ a stacked generalization ensemble to aggregate outputs from multiple optimized models, ensuring stable and reliable predictions. This ensemble, with an XGBoost meta-model, achieves a state-of-the-art accuracy of 91.3\%. Ultimately, this study provides the Turkish NLP community and the broader academic circles with a foundational dataset and a robust classification framework paving the way for future qualitative citation studies.
title	A Large-Scale Dataset and Citation Intent Classification in Turkish with LLMs
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2509.21907

Similar Items