Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Qi, Ruiyan, Wen, Congding, Zhou, Weibo, Li, Jiwei, Liang, Shangsong, Li, Lingbo
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2508.11280
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911118408024064
author	Qi, Ruiyan Wen, Congding Zhou, Weibo Li, Jiwei Liang, Shangsong Li, Lingbo
author_facet	Qi, Ruiyan Wen, Congding Zhou, Weibo Li, Jiwei Liang, Shangsong Li, Lingbo
contents	Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $\textbf{L}$able-Free $\textbf{E}$valuation of LLM on $\textbf{T}$ourism using Expert $\textbf{T}$ree-$\textbf{o}$f-$\textbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15\% relative quality gains over baselines. Second, we apply LETToT's optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_11280
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought Qi, Ruiyan Wen, Congding Zhou, Weibo Li, Jiwei Liang, Shangsong Li, Lingbo Computation and Language Artificial Intelligence Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $\textbf{L}$able-Free $\textbf{E}$valuation of LLM on $\textbf{T}$ourism using Expert $\textbf{T}$ree-$\textbf{o}$f-$\textbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15\% relative quality gains over baselines. Second, we apply LETToT's optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.
title	LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2508.11280

Similar Items