Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liang, Yiming, Zhao, Fang
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2601.07008
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909987492593664
author	Liang, Yiming Zhao, Fang
author_facet	Liang, Yiming Zhao, Fang
contents	Recent years have seen growing interest in applying neural networks and contextualized word embeddings to the parsing of historical languages. However, most advances have focused on dependency parsing, while constituency parsing for low-resource historical languages like Middle Dutch has received little attention. In this paper, we adapt a transformer-based constituency parser to Middle Dutch, a highly heterogeneous and low-resource language, and investigate methods to improve both its in-domain and cross-domain performance. We show that joint training with higher-resource auxiliary languages increases F1 scores by up to 0.73, with the greatest gains achieved from languages that are geographically and temporally closer to Middle Dutch. We further evaluate strategies for leveraging newly annotated data from additional domains, finding that fine-tuning and data combination yield comparable improvements, and our neural parser consistently outperforms the currently used PCFG-based parser for Middle Dutch. We further explore feature-separation techniques for domain adaptation and demonstrate that a minimum threshold of approximately 200 examples per domain is needed to effectively enhance cross-domain performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_07008
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Lexicalized Constituency Parsing for Middle Dutch: Low-resource Training and Cross-Domain Generalization Liang, Yiming Zhao, Fang Computation and Language Recent years have seen growing interest in applying neural networks and contextualized word embeddings to the parsing of historical languages. However, most advances have focused on dependency parsing, while constituency parsing for low-resource historical languages like Middle Dutch has received little attention. In this paper, we adapt a transformer-based constituency parser to Middle Dutch, a highly heterogeneous and low-resource language, and investigate methods to improve both its in-domain and cross-domain performance. We show that joint training with higher-resource auxiliary languages increases F1 scores by up to 0.73, with the greatest gains achieved from languages that are geographically and temporally closer to Middle Dutch. We further evaluate strategies for leveraging newly annotated data from additional domains, finding that fine-tuning and data combination yield comparable improvements, and our neural parser consistently outperforms the currently used PCFG-based parser for Middle Dutch. We further explore feature-separation techniques for domain adaptation and demonstrate that a minimum threshold of approximately 200 examples per domain is needed to effectively enhance cross-domain performance.
title	Lexicalized Constituency Parsing for Middle Dutch: Low-resource Training and Cross-Domain Generalization
topic	Computation and Language
url	https://arxiv.org/abs/2601.07008

Similar Items