Saved in:
Bibliographic Details
Main Authors: Dobranić, Filip, Munda, Tina, Pejić, Oliver, Gorjanc, Vojko, Šmajdek, Uroš, Bordon, David, Lenardič, Jakob, Konovšek, Tjaša, Tekavčič, Kristina Pahor de Maiti, Bohak, Ciril, Fišer, Darja
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.25051
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917364176519168
author Dobranić, Filip
Munda, Tina
Pejić, Oliver
Gorjanc, Vojko
Šmajdek, Uroš
Bordon, David
Lenardič, Jakob
Konovšek, Tjaša
Tekavčič, Kristina Pahor de Maiti
Bohak, Ciril
Fišer, Darja
author_facet Dobranić, Filip
Munda, Tina
Pejić, Oliver
Gorjanc, Vojko
Šmajdek, Uroš
Bordon, David
Lenardič, Jakob
Konovšek, Tjaša
Tekavčič, Kristina Pahor de Maiti
Bohak, Ciril
Fišer, Darja
contents This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.
format Preprint
id arxiv_https___arxiv_org_abs_2603_25051
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Approaches to Analysing Historical Newspapers Using LLMs
Dobranić, Filip
Munda, Tina
Pejić, Oliver
Gorjanc, Vojko
Šmajdek, Uroš
Bordon, David
Lenardič, Jakob
Konovšek, Tjaša
Tekavčič, Kristina Pahor de Maiti
Bohak, Ciril
Fišer, Darja
Computation and Language
This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.
title Approaches to Analysing Historical Newspapers Using LLMs
topic Computation and Language
url https://arxiv.org/abs/2603.25051