Saved in:
Bibliographic Details
Main Author: Golino, Hudson
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.17010
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908785111465984
author Golino, Hudson
author_facet Golino, Hudson
contents Large language model (LLM) embeddings are increasingly used to estimate dimensional structure in psychological item pools prior to data collection, yet current applications treat embeddings as static, cross-sectional representations. This approach implicitly assumes uniform contribution across all embedding coordinates and overlooks the possibility that optimal structural information may be concentrated in specific regions of the embedding space. This study reframes embeddings as searchable landscapes and adapts Dynamic Exploratory Graph Analysis (DynEGA) to systematically traverse embedding coordinates, treating the dimension index as a pseudo-temporal ordering analogous to intensive longitudinal trajectories. A large-scale Monte Carlo simulation embedded items representing five dimensions of grandiose narcissism using OpenAI's text-embedding-3-small model, generating network estimations across systematically varied item pool sizes (3-40 items per dimension) and embedding depths (3-1,298 dimensions). Results reveal that Total Entropy Fit Index (TEFI) and Normalized Mutual Information (NMI) leads to competing optimization trajectories across the embedding landscape. TEFI achieves minima at deep embedding ranges (900--1,200 dimensions) where entropy-based organization is maximal but structural accuracy degrades, whereas NMI peaks at shallow depths where dimensional recovery is strongest but entropy-based fit remains suboptimal. Single-metric optimization produces structurally incoherent solutions, whereas a weighted composite criterion identifies embedding dimensions depth regions that jointly balance accuracy and organization. Optimal embedding depth scales systematically with item pool size. These findings establish embedding landscapes as non-uniform semantic spaces requiring principled optimization rather than default full-vector usage.
format Preprint
id arxiv_https___arxiv_org_abs_2601_17010
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Optimizing the Landscape of LLM Embeddings with Dynamic Exploratory Graph Analysis for Generative Psychometrics: A Monte Carlo Study
Golino, Hudson
Machine Learning
Applications
Large language model (LLM) embeddings are increasingly used to estimate dimensional structure in psychological item pools prior to data collection, yet current applications treat embeddings as static, cross-sectional representations. This approach implicitly assumes uniform contribution across all embedding coordinates and overlooks the possibility that optimal structural information may be concentrated in specific regions of the embedding space. This study reframes embeddings as searchable landscapes and adapts Dynamic Exploratory Graph Analysis (DynEGA) to systematically traverse embedding coordinates, treating the dimension index as a pseudo-temporal ordering analogous to intensive longitudinal trajectories. A large-scale Monte Carlo simulation embedded items representing five dimensions of grandiose narcissism using OpenAI's text-embedding-3-small model, generating network estimations across systematically varied item pool sizes (3-40 items per dimension) and embedding depths (3-1,298 dimensions). Results reveal that Total Entropy Fit Index (TEFI) and Normalized Mutual Information (NMI) leads to competing optimization trajectories across the embedding landscape. TEFI achieves minima at deep embedding ranges (900--1,200 dimensions) where entropy-based organization is maximal but structural accuracy degrades, whereas NMI peaks at shallow depths where dimensional recovery is strongest but entropy-based fit remains suboptimal. Single-metric optimization produces structurally incoherent solutions, whereas a weighted composite criterion identifies embedding dimensions depth regions that jointly balance accuracy and organization. Optimal embedding depth scales systematically with item pool size. These findings establish embedding landscapes as non-uniform semantic spaces requiring principled optimization rather than default full-vector usage.
title Optimizing the Landscape of LLM Embeddings with Dynamic Exploratory Graph Analysis for Generative Psychometrics: A Monte Carlo Study
topic Machine Learning
Applications
url https://arxiv.org/abs/2601.17010