:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Vyas, Kaustubh, Graux, Damien, Montella, Sébastien, Vougiouklis, Pavlos, Lai, Ruofei, Li, Keshuang, Ren, Yang, Pan, Jeff Z.
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2502.20175
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle
by: Vyas, Kaustubh, et al.
Published: (2024)

Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning
by: Shen, Zhili, et al.
Published: (2024)

Prompting Large Language Models with Knowledge Graphs for Question Answering Involving Long-tail Facts
by: Huang, Wenyu, et al.
Published: (2024)

GeAR: Graph-enhanced Agent for Retrieval-augmented Generation
by: Shen, Zhili, et al.
Published: (2024)

A Usage-centric Take on Intent Understanding in E-Commerce
by: Zhou, Wendi, et al.
Published: (2024)

Millions of $\text{GeAR}$-s: Extending GraphRAG to Millions of Documents
by: Shen, Zhili, et al.
Published: (2025)

Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation
by: Huang, Wenyu, et al.
Published: (2025)

Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA
by: Huang, Wenyu, et al.
Published: (2024)

OpenSIR: Open-Ended Self-Improving Reasoner
by: Kwan, Wai-Chung, et al.
Published: (2025)

PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking
by: Zhu, Wang Bill, et al.
Published: (2026)

How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency
by: Zheng, Danna, et al.
Published: (2024)

Long-Form Information Alignment Evaluation Beyond Atomic Facts
by: Zheng, Danna, et al.
Published: (2025)

Evaluating and Safeguarding the Adversarial Robustness of Retrieval-Based In-Context Learning
by: Yu, Simon, et al.
Published: (2024)

Funny or Persuasive, but Not Both: Evaluating Fine-Grained Multi-Concept Control in LLMs
by: Labroo, Arya, et al.
Published: (2026)

Automating the Generation of Prompts for LLM-based Action Choice in PDDL Planning
by: Stein, Katharina, et al.
Published: (2023)

Can Language Models Analyze Data? Evaluating Large Language Models for Question Answering over Datasets
by: Xenofontos, Andreas, et al.
Published: (2026)

Rethinking Memory in LLM based Agents: Representations, Operations, and Emerging Topics
by: Du, Yiming, et al.
Published: (2025)

Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation
by: Dhole, Kaustubh
Published: (2025)

Are LLMs Effective Negotiators? Systematic Evaluation of the Multifaceted Capabilities of LLMs in Negotiation Dialogues
by: Kwon, Deuksin, et al.
Published: (2024)

Evaluating LLMs' Divergent Thinking Capabilities for Scientific Idea Generation with Minimal Context
by: Ruan, Kai, et al.
Published: (2024)

Evaluating the Capabilities of LLMs for Supporting Anticipatory Impact Assessment
by: Allaham, Mowafak, et al.
Published: (2024)

Spectral Attention Steering for Prompt Highlighting
by: Li, Weixian Waylon, et al.
Published: (2026)

PlanGenLLMs: A Modern Survey of LLM Planning Capabilities
by: Wei, Hui, et al.
Published: (2025)

How Does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective
by: Zhang, Shimao, et al.
Published: (2025)

EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs
by: Xu, Wanghan, et al.
Published: (2025)

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning
by: Jiang, Zhihuan, et al.
Published: (2024)

Automated Capability Discovery via Foundation Model Self-Exploration
by: Lu, Cong, et al.
Published: (2025)

CharacterBox: Evaluating the Role-Playing Capabilities of LLMs in Text-Based Virtual Worlds
by: Wang, Lei, et al.
Published: (2024)

Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation
by: Sakabe, Ritsu, et al.
Published: (2025)

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents
by: Turk, Matt
Published: (2026)

Are Your LLMs Capable of Stable Reasoning?
by: Liu, Junnan, et al.
Published: (2024)

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
by: Sirdeshmukh, Ved, et al.
Published: (2025)

Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples
by: Vougiouklis, Pavlos, et al.
Published: (2017)

BabyReasoningBench: Generating Developmentally-Inspired Reasoning Tasks for Evaluating Baby Language Models
by: Dhole, Kaustubh D.
Published: (2026)

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
by: Chen, Mingyang, et al.
Published: (2025)

PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities
by: Li, Haoming, et al.
Published: (2025)

Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data
by: Liu, Xiao, et al.
Published: (2024)

Resolving Intent Ambiguities by Retrieving Discriminative Clarifying Questions
by: Dhole, Kaustubh D.
Published: (2020)

Evaluation of Multilingual LLMs Personalized Text Generation Capabilities Targeting Groups and Social-Media Platforms
by: Macko, Dominik
Published: (2026)

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability
by: Zhang, Leizhen, et al.
Published: (2026)