:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Gupta, Kshitij
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.07747
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks
by: Wang, Chonghua, et al.
Published: (2024)

DeduCE: Deductive Consistency as a Framework to Evaluate LLM Reasoning
by: Pandey, Atharva, et al.
Published: (2025)

SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets
by: Mishra, Kshitij, et al.
Published: (2026)

CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios
by: Garg, Raghav, et al.
Published: (2025)

Suvach -- Generated Hindi QA benchmark
by: Narayanan, Vaishak, et al.
Published: (2024)

RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
by: Xu, Xinnuo, et al.
Published: (2025)

LongStory: Coherent, Complete and Length Controlled Long story Generation
by: Park, Kyeongman, et al.
Published: (2023)

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
by: Katsis, Yannis, et al.
Published: (2025)

Dynamic benchmarking framework for LLM-based conversational data capture
by: Aluffi, Pietro Alessandro, et al.
Published: (2025)

LLMzSzŁ: a comprehensive LLM benchmark for Polish
by: Jassem, Krzysztof, et al.
Published: (2025)

Batayan: A Filipino NLP benchmark for evaluating Large Language Models
by: Montalan, Jann Railey, et al.
Published: (2025)

LongTail-Swap: benchmarking language models' abilities on rare words
by: Algayres, Robin, et al.
Published: (2025)

Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
by: Gatti, Prajwal, et al.
Published: (2025)

Polish-English medical knowledge transfer: A new benchmark and results
by: Grzybowski, Łukasz, et al.
Published: (2024)

Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs
by: Abdaljalil, Samir, et al.
Published: (2026)

How good is my story? Towards quantitative metrics for evaluating LLM-generated XAI narratives
by: Ichmoukhamedov, Timour, et al.
Published: (2024)

Simple and Scalable Strategies to Continually Pre-train Large Language Models
by: Ibrahim, Adam, et al.
Published: (2024)

MinorBench: A hand-built benchmark for content-based risks for children
by: Khoo, Shaun, et al.
Published: (2025)

TelcoLM: collecting data, adapting, and benchmarking language models for the telecommunication domain
by: Barboule, Camille, et al.
Published: (2024)

Systematic Evaluation of Long-Context LLMs on Financial Concepts
by: Gupta, Lavanya, et al.
Published: (2024)

Digital Twin Ecosystem for Oncology Clinical Operations
by: Pandey, Himanshu, et al.
Published: (2024)

Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation
by: Gupta, Ashray, et al.
Published: (2025)

COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain
by: Panagoulias, Dimitrios P., et al.
Published: (2024)

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design
by: Snegirev, Artem, et al.
Published: (2024)

A benchmark for joint dialogue satisfaction, emotion recognition, and emotion state transition prediction
by: Bian, Jing, et al.
Published: (2026)

Multilingual Controlled Generation And Gold-Standard-Agnostic Evaluation of Code-Mixed Sentences
by: Gupta, Ayushman, et al.
Published: (2024)

A thorough benchmark of automatic text classification: From traditional approaches to large language models
by: Cunha, Washington, et al.
Published: (2025)

QuanTemp: A real-world open-domain benchmark for fact-checking numerical claims
by: V, Venktesh, et al.
Published: (2024)

Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers
by: Pacchiardi, Lorenzo, et al.
Published: (2024)

Towards Understanding the Robustness of LLM-based Evaluations under Perturbations
by: Chaudhary, Manav, et al.
Published: (2024)

BEARCUBS: A benchmark for computer-using web agents
by: Song, Yixiao, et al.
Published: (2025)

Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
by: Surikuchi, Aditya K, et al.
Published: (2024)

Creativity Benchmark: A benchmark for marketing creativity for large language models
by: Bhat, Ninad, et al.
Published: (2025)

The Oracle Has Spoken: A Multi-Aspect Evaluation of Dialogue in Pythia
by: Chen, Zixun, et al.
Published: (2025)

ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models
by: Narsupalli, Yaswanth, et al.
Published: (2024)

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets
by: Gupta, Vatsal, et al.
Published: (2023)

A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking
by: Brådland, Henrik, et al.
Published: (2025)

Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D
by: Gupta, Arsh, et al.
Published: (2025)

MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning
by: Ghosh, Rajarshi, et al.
Published: (2025)

Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities
by: Latifi, Payam
Published: (2025)