Saved in:
Bibliographic Details
Main Authors: Woloszyn, Hanna, Gagl, Benjamin
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.13769
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912544227065856
author Woloszyn, Hanna
Gagl, Benjamin
author_facet Woloszyn, Hanna
Gagl, Benjamin
contents The role of large language models (LLMs) in education is increasing, yet little attention has been paid to whether LLM-generated text resembles child language. This study evaluates how LLMs replicate child-like language by comparing LLM-generated texts to a collection of German children's descriptions of picture stories. We generated two LLM-based corpora using the same picture stories and two prompt types: zero-shot and few-shot prompts specifying a general age from the children corpus. We conducted a comparative analysis across psycholinguistic text properties, including word frequency, lexical richness, sentence and word length, part-of-speech tags, and semantic similarity with word embeddings. The results show that LLM-generated texts are longer but less lexically rich, rely more on high-frequency words, and under-represent nouns. Semantic vector space analysis revealed low similarity, highlighting differences between the two corpora on the level of corpus semantics. Few-shot prompt increased similarities between children and LLM text to a minor extent, but still failed to replicate lexical and semantic patterns. The findings contribute to our understanding of how LLMs approximate child language through multimodal prompting (text + image) and give insights into their use in psycholinguistic research and education while raising important questions about the appropriateness of LLM-generated language in child-directed educational tools.
format Preprint
id arxiv_https___arxiv_org_abs_2508_13769
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study
Woloszyn, Hanna
Gagl, Benjamin
Computation and Language
The role of large language models (LLMs) in education is increasing, yet little attention has been paid to whether LLM-generated text resembles child language. This study evaluates how LLMs replicate child-like language by comparing LLM-generated texts to a collection of German children's descriptions of picture stories. We generated two LLM-based corpora using the same picture stories and two prompt types: zero-shot and few-shot prompts specifying a general age from the children corpus. We conducted a comparative analysis across psycholinguistic text properties, including word frequency, lexical richness, sentence and word length, part-of-speech tags, and semantic similarity with word embeddings. The results show that LLM-generated texts are longer but less lexically rich, rely more on high-frequency words, and under-represent nouns. Semantic vector space analysis revealed low similarity, highlighting differences between the two corpora on the level of corpus semantics. Few-shot prompt increased similarities between children and LLM text to a minor extent, but still failed to replicate lexical and semantic patterns. The findings contribute to our understanding of how LLMs approximate child language through multimodal prompting (text + image) and give insights into their use in psycholinguistic research and education while raising important questions about the appropriateness of LLM-generated language in child-directed educational tools.
title Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study
topic Computation and Language
url https://arxiv.org/abs/2508.13769