Saved in:
| Main Author: | Gupta, Kshitij |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.07747 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks
by: Wang, Chonghua, et al.
Published: (2024)
by: Wang, Chonghua, et al.
Published: (2024)
DeduCE: Deductive Consistency as a Framework to Evaluate LLM Reasoning
by: Pandey, Atharva, et al.
Published: (2025)
by: Pandey, Atharva, et al.
Published: (2025)
SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets
by: Mishra, Kshitij, et al.
Published: (2026)
by: Mishra, Kshitij, et al.
Published: (2026)
CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios
by: Garg, Raghav, et al.
Published: (2025)
by: Garg, Raghav, et al.
Published: (2025)
Suvach -- Generated Hindi QA benchmark
by: Narayanan, Vaishak, et al.
Published: (2024)
by: Narayanan, Vaishak, et al.
Published: (2024)
RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
by: Xu, Xinnuo, et al.
Published: (2025)
by: Xu, Xinnuo, et al.
Published: (2025)
LongStory: Coherent, Complete and Length Controlled Long story Generation
by: Park, Kyeongman, et al.
Published: (2023)
by: Park, Kyeongman, et al.
Published: (2023)
MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
by: Katsis, Yannis, et al.
Published: (2025)
by: Katsis, Yannis, et al.
Published: (2025)
Dynamic benchmarking framework for LLM-based conversational data capture
by: Aluffi, Pietro Alessandro, et al.
Published: (2025)
by: Aluffi, Pietro Alessandro, et al.
Published: (2025)
LLMzSzŁ: a comprehensive LLM benchmark for Polish
by: Jassem, Krzysztof, et al.
Published: (2025)
by: Jassem, Krzysztof, et al.
Published: (2025)
Batayan: A Filipino NLP benchmark for evaluating Large Language Models
by: Montalan, Jann Railey, et al.
Published: (2025)
by: Montalan, Jann Railey, et al.
Published: (2025)
LongTail-Swap: benchmarking language models' abilities on rare words
by: Algayres, Robin, et al.
Published: (2025)
by: Algayres, Robin, et al.
Published: (2025)
Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
by: Gatti, Prajwal, et al.
Published: (2025)
by: Gatti, Prajwal, et al.
Published: (2025)
Polish-English medical knowledge transfer: A new benchmark and results
by: Grzybowski, Łukasz, et al.
Published: (2024)
by: Grzybowski, Łukasz, et al.
Published: (2024)
Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs
by: Abdaljalil, Samir, et al.
Published: (2026)
by: Abdaljalil, Samir, et al.
Published: (2026)
How good is my story? Towards quantitative metrics for evaluating LLM-generated XAI narratives
by: Ichmoukhamedov, Timour, et al.
Published: (2024)
by: Ichmoukhamedov, Timour, et al.
Published: (2024)
Simple and Scalable Strategies to Continually Pre-train Large Language Models
by: Ibrahim, Adam, et al.
Published: (2024)
by: Ibrahim, Adam, et al.
Published: (2024)
MinorBench: A hand-built benchmark for content-based risks for children
by: Khoo, Shaun, et al.
Published: (2025)
by: Khoo, Shaun, et al.
Published: (2025)
TelcoLM: collecting data, adapting, and benchmarking language models for the telecommunication domain
by: Barboule, Camille, et al.
Published: (2024)
by: Barboule, Camille, et al.
Published: (2024)
Systematic Evaluation of Long-Context LLMs on Financial Concepts
by: Gupta, Lavanya, et al.
Published: (2024)
by: Gupta, Lavanya, et al.
Published: (2024)
Digital Twin Ecosystem for Oncology Clinical Operations
by: Pandey, Himanshu, et al.
Published: (2024)
by: Pandey, Himanshu, et al.
Published: (2024)
Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation
by: Gupta, Ashray, et al.
Published: (2025)
by: Gupta, Ashray, et al.
Published: (2025)
COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain
by: Panagoulias, Dimitrios P., et al.
Published: (2024)
by: Panagoulias, Dimitrios P., et al.
Published: (2024)
The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design
by: Snegirev, Artem, et al.
Published: (2024)
by: Snegirev, Artem, et al.
Published: (2024)
A benchmark for joint dialogue satisfaction, emotion recognition, and emotion state transition prediction
by: Bian, Jing, et al.
Published: (2026)
by: Bian, Jing, et al.
Published: (2026)
Multilingual Controlled Generation And Gold-Standard-Agnostic Evaluation of Code-Mixed Sentences
by: Gupta, Ayushman, et al.
Published: (2024)
by: Gupta, Ayushman, et al.
Published: (2024)
A thorough benchmark of automatic text classification: From traditional approaches to large language models
by: Cunha, Washington, et al.
Published: (2025)
by: Cunha, Washington, et al.
Published: (2025)
QuanTemp: A real-world open-domain benchmark for fact-checking numerical claims
by: V, Venktesh, et al.
Published: (2024)
by: V, Venktesh, et al.
Published: (2024)
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers
by: Pacchiardi, Lorenzo, et al.
Published: (2024)
by: Pacchiardi, Lorenzo, et al.
Published: (2024)
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations
by: Chaudhary, Manav, et al.
Published: (2024)
by: Chaudhary, Manav, et al.
Published: (2024)
BEARCUBS: A benchmark for computer-using web agents
by: Song, Yixiao, et al.
Published: (2025)
by: Song, Yixiao, et al.
Published: (2025)
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
by: Surikuchi, Aditya K, et al.
Published: (2024)
by: Surikuchi, Aditya K, et al.
Published: (2024)
Creativity Benchmark: A benchmark for marketing creativity for large language models
by: Bhat, Ninad, et al.
Published: (2025)
by: Bhat, Ninad, et al.
Published: (2025)
The Oracle Has Spoken: A Multi-Aspect Evaluation of Dialogue in Pythia
by: Chen, Zixun, et al.
Published: (2025)
by: Chen, Zixun, et al.
Published: (2025)
ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models
by: Narsupalli, Yaswanth, et al.
Published: (2024)
by: Narsupalli, Yaswanth, et al.
Published: (2024)
Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets
by: Gupta, Vatsal, et al.
Published: (2023)
by: Gupta, Vatsal, et al.
Published: (2023)
A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking
by: Brådland, Henrik, et al.
Published: (2025)
by: Brådland, Henrik, et al.
Published: (2025)
Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D
by: Gupta, Arsh, et al.
Published: (2025)
by: Gupta, Arsh, et al.
Published: (2025)
MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning
by: Ghosh, Rajarshi, et al.
Published: (2025)
by: Ghosh, Rajarshi, et al.
Published: (2025)
Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities
by: Latifi, Payam
Published: (2025)
by: Latifi, Payam
Published: (2025)
Similar Items
-
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks
by: Wang, Chonghua, et al.
Published: (2024) -
DeduCE: Deductive Consistency as a Framework to Evaluate LLM Reasoning
by: Pandey, Atharva, et al.
Published: (2025) -
SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets
by: Mishra, Kshitij, et al.
Published: (2026) -
CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios
by: Garg, Raghav, et al.
Published: (2025) -
Suvach -- Generated Hindi QA benchmark
by: Narayanan, Vaishak, et al.
Published: (2024)