:: Library Catalog

Copertina

Salvato in:

Dettagli Bibliografici
Autori principali:	Liu, Jing, Fourtassi, Abdellah
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Computation and Language Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2412.09318
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

Documenti analoghi

Automatic Annotation of Grammaticality in Child-Caregiver Conversations
di: Nikolaus, Mitja, et al.
Pubblicazione: (2024)

EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs
di: Naeem, Numaan, et al.
Pubblicazione: (2025)

TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains
di: Wang, Wanying, et al.
Pubblicazione: (2024)

UD-English-CHILDES: A Collected Resource of Gold and Silver Universal Dependencies Trees for Child Language Interactions
di: Yang, Xiulin, et al.
Pubblicazione: (2025)

Benchmarking Concept-Spilling Across Languages in LLMs
di: Badanin, Ilia, et al.
Pubblicazione: (2026)

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
di: Bian, Haonan, et al.
Pubblicazione: (2026)

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning
di: Zheng, Huaixiu Steven, et al.
Pubblicazione: (2024)

Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages
di: Azam, Gulfarogh, et al.
Pubblicazione: (2025)

Do Large Language Models Understand Logic or Just Mimick Context?
di: Yan, Junbing, et al.
Pubblicazione: (2024)

A Language-agnostic Model of Child Language Acquisition
di: Mahon, Louis, et al.
Pubblicazione: (2024)

MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning
di: Li, Shuyue Stella, et al.
Pubblicazione: (2024)

MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
di: Chen, Zehui, et al.
Pubblicazione: (2024)

Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models
di: Fettach, Yousra, et al.
Pubblicazione: (2026)

Entropy-Gated Branching for Efficient Test-Time Reasoning
di: Li, Xianzhi, et al.
Pubblicazione: (2025)

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks
di: Jiang, Botian, et al.
Pubblicazione: (2024)

Advancing and Benchmarking Personalized Tool Invocation for LLMs
di: Huang, Xu, et al.
Pubblicazione: (2025)

AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability
di: Yang, Siwei, et al.
Pubblicazione: (2024)

DHP Benchmark: Are LLMs Good NLG Evaluators?
di: Wang, Yicheng, et al.
Pubblicazione: (2024)

Flames: Benchmarking Value Alignment of LLMs in Chinese
di: Huang, Kexin, et al.
Pubblicazione: (2023)

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
di: Guo, Qianhong, et al.
Pubblicazione: (2025)

Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables
di: Zhou, Yitong, et al.
Pubblicazione: (2025)

Child vs. machine language learning: Can the logical structure of human language unleash LLMs?
di: Sauerland, Uli, et al.
Pubblicazione: (2025)

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback
di: Wang, Xingyao, et al.
Pubblicazione: (2023)

Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap
di: Wang, Jun, et al.
Pubblicazione: (2025)

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
di: Xu, Zhao, et al.
Pubblicazione: (2024)

Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction
di: Tsaknakis, Ioannis, et al.
Pubblicazione: (2025)

Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs
di: Tie, Guiyao, et al.
Pubblicazione: (2025)

DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding
di: Zhu, Hengchuan, et al.
Pubblicazione: (2025)

CangjieBench: Benchmarking LLMs on a Low-Resource General-Purpose Programming Language
di: Cheng, Junhang, et al.
Pubblicazione: (2026)

MatExpert: Decomposing Materials Discovery by Mimicking Human Experts
di: Ding, Qianggang, et al.
Pubblicazione: (2024)

Benchmarking LLMs and SLMs for patient reported outcomes
di: Marengo, Matteo, et al.
Pubblicazione: (2024)

Are You Human? An Adversarial Benchmark to Expose LLMs
di: Gressel, Gilad, et al.
Pubblicazione: (2024)

WebWalker: Benchmarking LLMs in Web Traversal
di: Wu, Jialong, et al.
Pubblicazione: (2025)

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
di: Lunardi, Riccardo, et al.
Pubblicazione: (2025)

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis
di: Li, Hanyu, et al.
Pubblicazione: (2025)

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages
di: Hanif, Ikhlasul Akmal, et al.
Pubblicazione: (2026)

Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with SocKET Benchmark
di: Choi, Minje, et al.
Pubblicazione: (2023)

Interactive Benchmarks
di: Yue, Baoqing, et al.
Pubblicazione: (2026)

LLMs Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
di: Hu, Xuhao, et al.
Pubblicazione: (2025)

Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation
di: Shu, Peng, et al.
Pubblicazione: (2024)