Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Gokul Chandra, Purnachandra Reddy
Format:	Recurso digital
Language:
Published:	Zenodo 2026
Online Access:	https://doi.org/10.5281/zenodo.20320651
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Production deployments of large language models (LLMs) routinely assume that two endpoints serving the same open-weight model produce equivalent output. We test that assumption empirically. Using the OpenRouter aggregator with strict provider pinning, we issue 3,564 temperature-zero completions across seven open-weight models (Llama-3.1-8B, Llama-3.3-70B, Mistral-Small-3.2-24B, DeepSeek-V3.1, Qwen3-235B-A22B, Gemma-3-27B, and GPT-OSS-20B) served by 22 distinct (model, provider) cells spanning FP4, FP8, BF16, FP16, and INT8 quantizations, on a 54-prompt corpus covering five task categories. We report four findings. (1) Inter-provider exact-output agreement on creative tasks is 0% across every model in the panel, even at temperature zero. (2) Inter-provider semantic-output agreement varies sharply by task: 66%–100% on factual and arithmetic questions, but only 30%–70% on code generation. (3) Intra-provider determinism at temperature zero is widely violated; per-cell exact-match-across-3-reps rates range from 67% to 100%, indicating that several production endpoints are not bit-exact deterministic at temperature 0 despite the API contract. (4) On GPT-OSS-20B, two of three providers return empty visible output for short-output prompts because their serving stack does not surface the reasoning-channel final response—a categorical interoperability failure that the API surfaces no warning for. The phenomenon is provider-stack-specific rather than weight-specific. All prompts, raw model outputs, comparators, and analysis code are released.

Similar Items