Salvato in:
Dettagli Bibliografici
Autore principale: Gupta, Satish Kumar
Natura: Recurso digital
Lingua:inglese
Pubblicazione: Zenodo 2026
Accesso online:https://doi.org/10.5281/zenodo.20005784
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866902042242449408
author Gupta, Satish Kumar
author_facet Gupta, Satish Kumar
contents <p class="abstract"><span>As large language models advance toward deployment in military mission critical information system applications, existing evaluation frameworks prove inadequate for assessing their suitability in defence contexts. Traditional benchmarks like MMLU-Pro and HELM, while effective for general capabilities, don't take into account the special needs of military operations: domain-specific knowledge, operational constraints in air-gapped environments, adversarial robustness under attack and resource limitations in tactical edge deployments. This paper presents MEBL (Military Evaluation Benchmarking for LLM), a comprehensive evaluation framework designed specifically for defence applications. Our methodology evaluates models across four critical axes - foundational capabilities, mission-specific performance, operational robustness, and resource efficiency - using Multi-Criteria Decision Analysis (MCDA) with hardware-specific normalization. We assessed ten prominent open-source models using our large carefully curated Military Golden Dataset, derived from several publically available defense documents and military scenarios, with evaluations conducted on representative military hardware configurations i.e. RTX 3060 GPU workstation for strategic deployments and Intel Core i7 CPU systems for tactical edge scenarios. Results show that model size is only weakly correlated with military utility, highlighting the importance of efficiency, mission specific performance and operational robustness over the raw scale. MEBL further assigns deployment-readiness categories—Strategic, Tactical, and Limited Use—to guide procurement and fielding decisions. The findings show notable disparities in deployment preparedness: 60% of the assessed models need considerable improvements prior to military deployment, while only Gemma 12B and Mistral 7B models achieved operational readiness. These findings suggest that existing LLM evaluation benchmarks inadequately predict performance in military contexts, where operational constraints often outweigh raw accuracy. MEBL addresses this evaluation gap by providing defence organisations with a framework tailored to their unique deployment requirements and constraints.<br><br>Datasets: <br>1. MEBL Custom Military Dataset<br>2. Mil Law Dataset<br>3. Fine-tunign IFT Dataset</span></p>
format Recurso digital
id zenodo_https___doi_org_10_5281_zenodo_20005784
institution Zenodo
language eng
publishDate 2026
publisher Zenodo
record_format zenodo
spellingShingle Custom Mil Dataset
Gupta, Satish Kumar
<p class="abstract"><span>As large language models advance toward deployment in military mission critical information system applications, existing evaluation frameworks prove inadequate for assessing their suitability in defence contexts. Traditional benchmarks like MMLU-Pro and HELM, while effective for general capabilities, don't take into account the special needs of military operations: domain-specific knowledge, operational constraints in air-gapped environments, adversarial robustness under attack and resource limitations in tactical edge deployments. This paper presents MEBL (Military Evaluation Benchmarking for LLM), a comprehensive evaluation framework designed specifically for defence applications. Our methodology evaluates models across four critical axes - foundational capabilities, mission-specific performance, operational robustness, and resource efficiency - using Multi-Criteria Decision Analysis (MCDA) with hardware-specific normalization. We assessed ten prominent open-source models using our large carefully curated Military Golden Dataset, derived from several publically available defense documents and military scenarios, with evaluations conducted on representative military hardware configurations i.e. RTX 3060 GPU workstation for strategic deployments and Intel Core i7 CPU systems for tactical edge scenarios. Results show that model size is only weakly correlated with military utility, highlighting the importance of efficiency, mission specific performance and operational robustness over the raw scale. MEBL further assigns deployment-readiness categories—Strategic, Tactical, and Limited Use—to guide procurement and fielding decisions. The findings show notable disparities in deployment preparedness: 60% of the assessed models need considerable improvements prior to military deployment, while only Gemma 12B and Mistral 7B models achieved operational readiness. These findings suggest that existing LLM evaluation benchmarks inadequately predict performance in military contexts, where operational constraints often outweigh raw accuracy. MEBL addresses this evaluation gap by providing defence organisations with a framework tailored to their unique deployment requirements and constraints.<br><br>Datasets: <br>1. MEBL Custom Military Dataset<br>2. Mil Law Dataset<br>3. Fine-tunign IFT Dataset</span></p>
title Custom Mil Dataset
url https://doi.org/10.5281/zenodo.20005784