Gorde:
| Egile nagusia: | |
|---|---|
| Formatua: | Recurso digital |
| Hizkuntza: | |
| Argitaratua: |
Zenodo
2026
|
| Gaiak: | |
| Sarrera elektronikoa: | https://doi.org/10.5281/zenodo.19314968 |
| Etiketak: |
Etiketa erantsi
Etiketarik gabe, Izan zaitez lehena erregistro honi etiketa jartzen!
|
Aurkibidea:
- <p>This dataset provides comparative tokenization metrics for 17 commercially available large language models from 9 providers (OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, Alibaba, Cohere, and xAI) as of Q1 2026. Each model is characterized by its tokenizer family, average tokens-per-word ratio, context window size, maximum output length, input/output pricing per million tokens, first-token latency, and sustained generation throughput.</p><p>Token estimation accuracy is critical for production LLM applications: underestimating input tokens leads to context window overflow and truncated prompts, while overestimating leads to unnecessary model downgrades or prompt compression. This dataset quantifies the variation in tokenization efficiency across model families, revealing that tokens-per-word ratios range from 1.18 (DeepSeek's efficient tokenizer) to 1.35 (Anthropic's Claude tokenizer), a difference that compounds significantly at scale.</p><p>The benchmarking methodology uses a standardized corpus of 500 mixed-content web documents (averaging 1,400 words each), including technical documentation, news articles, creative writing, and code snippets. For each model's tokenizer, the dataset reports mean, median, and 95th percentile token counts, along with variance, enabling developers to build accurate cost estimation models with appropriate safety margins.</p><p>Cost-performance analysis is also supported: the dataset includes current API pricing, enabling computation of cost-per-token, cost-per-word, and throughput-adjusted cost metrics. This is particularly relevant as the pricing landscape has compressed dramatically, with frontier model input costs spanning two orders of magnitude ($0.05 to $15.00 per million tokens).</p><p>Maintained by <a href="https://mohitkhare.me">Mohit Khare</a>, a software engineer and researcher focused on developer tooling and AI infrastructure.</p>