Saved in:
Bibliographic Details
Main Author: Gokul Chandra, Purnachandra Reddy
Format: Recurso digital
Language:
Published: Zenodo 2026
Online Access:https://doi.org/10.5281/zenodo.20320651
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • <p class="MsoBodyText"><span>Production<span> </span>deployments<span> </span>of<span> </span>large<span> </span>language<span> </span>models<span> </span>(LLMs)<span> </span>routinely<span> </span>assume<span> </span>that two endpoints serving the same open-weight model produce equivalent output.<span> </span>We test that assumption empirically. Using the OpenRouter aggregator with strict provider pinning, we issue 3,564<span> </span>temperature-zero<span> </span>completions<span> </span>across seven<span> </span>open-weight<span> </span>models (Llama-3.1-8B,<span> </span>Llama-3.3-<span>70B, Mistral-Small-3.2-24B, DeepSeek-V3.1, Qwen3-235B-A22B, Gemma-3-27B, and GPT-OSS-</span>20B) served by 22 distinct (model, provider) cells spanning FP4, FP8, BF16, FP16, and INT8 quantizations,<span> </span>on<span> </span>a<span> </span>54-prompt<span> </span>corpus<span> </span>covering<span> </span>five<span> </span>task<span> </span>categories.<span> </span>We<span> </span>report<span> </span>four<span> </span>findings.</span></p> <p class="MsoBodyText"><strong><span>(1) </span></strong><span>Inter-provider exact-output agreement on creative tasks is 0% across every model in the panel,<span> </span>even<span> </span>at<span> </span>temperature<span> </span>zero.<span> </span><strong>(2)<span> </span></strong>Inter-provider<span> </span>semantic-output<span> </span>agreement<span> </span>varies<span> </span>sharply<span> </span>by task:<span> </span>66%–100% on factual and arithmetic questions, but only 30%–70% on code generation.</span></p> <p class="MsoBodyText"><strong><span>(3)<span> </span></span></strong><span>Intra-provider<span> </span>determinism<span> </span>at<span> </span>temperature<span> </span>zero<span> </span>is<span> </span>widely<span> </span>violated;<span> </span>per-cell<span> </span>exact-match-across-3-reps rates range from 67% to 100%, indicating that several production endpoints are not bit-exact<span> </span>deterministic<span> </span>at<span> </span>temperature<span> </span>0<span> </span>despite<span> </span>the<span> </span>API<span> </span>contract.<span> </span><strong>(4)<span> </span></strong>On<span> </span>GPT-OSS-20B,<span> </span>two of three providers return empty visible output for short-output prompts because their serving</span><span> </span><span>stack does not surface the reasoning-channel final response—a categorical interoperability failure that the API surfaces no warning for. The phenomenon is provider-stack-specific rather than weight-specific.<span> </span>All<span> </span>prompts,<span> </span>raw<span> </span>model<span> </span>outputs,<span> </span>comparators,<span> </span>and<span> </span>analysis<span> </span>code<span> </span>are<span> </span>released.</span></p>