MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autore principale:	Gupta, Satish Kumar
Natura:	Recurso digital
Lingua:	inglese
Pubblicazione:	Zenodo 2026
Accesso online:	https://doi.org/10.5281/zenodo.20005784
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866902042242449408
author	Gupta, Satish Kumar
author_facet	Gupta, Satish Kumar
contents	<p class="abstract"><span>As large language models advance toward deployment in military mission critical information system applications, existing evaluation frameworks prove inadequate for assessing their suitability in defence contexts. Traditional benchmarks like MMLU-Pro and HELM, while effective for general capabilities, don't take into account the special needs of military operations: domain-specific knowledge, operational constraints in air-gapped environments, adversarial robustness under attack and resource limitations in tactical edge deployments. This paper presents MEBL (Military Evaluation Benchmarking for LLM), a comprehensive evaluation framework designed specifically for defence applications. Our methodology evaluates models across four critical axes - foundational capabilities, mission-specific performance, operational robustness, and resource efficiency - using Multi-Criteria Decision Analysis (MCDA) with hardware-specific normalization. We assessed ten prominent open-source models using our large carefully curated Military Golden Dataset, derived from several publically available defense documents and military scenarios, with evaluations conducted on representative military hardware configurations i.e. RTX 3060 GPU workstation for strategic deployments and Intel Core i7 CPU systems for tactical edge scenarios. Results show that model size is only weakly correlated with military utility, highlighting the importance of efficiency, mission specific performance and operational robustness over the raw scale. MEBL further assigns deployment-readiness categories—Strategic, Tactical, and Limited Use—to guide procurement and fielding decisions. The findings show notable disparities in deployment preparedness: 60% of the assessed models need considerable improvements prior to military deployment, while only Gemma 12B and Mistral 7B models achieved operational readiness. These findings suggest that existing LLM evaluation benchmarks inadequately predict performance in military contexts, where operational constraints often outweigh raw accuracy. MEBL addresses this evaluation gap by providing defence organisations with a framework tailored to their unique deployment requirements and constraints.<br><br>Datasets: <br>1. MEBL Custom Military Dataset<br>2. Mil Law Dataset<br>3. Fine-tunign IFT Dataset</span></p>
format	Recurso digital
id	zenodo_https___doi_org_10_5281_zenodo_20005784
institution	Zenodo
language	eng
publishDate	2026
publisher	Zenodo
record_format	zenodo
spellingShingle	Custom Mil Dataset Gupta, Satish Kumar <p class="abstract"><span>As large language models advance toward deployment in military mission critical information system applications, existing evaluation frameworks prove inadequate for assessing their suitability in defence contexts. Traditional benchmarks like MMLU-Pro and HELM, while effective for general capabilities, don't take into account the special needs of military operations: domain-specific knowledge, operational constraints in air-gapped environments, adversarial robustness under attack and resource limitations in tactical edge deployments. This paper presents MEBL (Military Evaluation Benchmarking for LLM), a comprehensive evaluation framework designed specifically for defence applications. Our methodology evaluates models across four critical axes - foundational capabilities, mission-specific performance, operational robustness, and resource efficiency - using Multi-Criteria Decision Analysis (MCDA) with hardware-specific normalization. We assessed ten prominent open-source models using our large carefully curated Military Golden Dataset, derived from several publically available defense documents and military scenarios, with evaluations conducted on representative military hardware configurations i.e. RTX 3060 GPU workstation for strategic deployments and Intel Core i7 CPU systems for tactical edge scenarios. Results show that model size is only weakly correlated with military utility, highlighting the importance of efficiency, mission specific performance and operational robustness over the raw scale. MEBL further assigns deployment-readiness categories—Strategic, Tactical, and Limited Use—to guide procurement and fielding decisions. The findings show notable disparities in deployment preparedness: 60% of the assessed models need considerable improvements prior to military deployment, while only Gemma 12B and Mistral 7B models achieved operational readiness. These findings suggest that existing LLM evaluation benchmarks inadequately predict performance in military contexts, where operational constraints often outweigh raw accuracy. MEBL addresses this evaluation gap by providing defence organisations with a framework tailored to their unique deployment requirements and constraints.<br><br>Datasets: <br>1. MEBL Custom Military Dataset<br>2. Mil Law Dataset<br>3. Fine-tunign IFT Dataset</span></p>
title	Custom Mil Dataset
url	https://doi.org/10.5281/zenodo.20005784

Documenti analoghi