Saved in:
Bibliographic Details
Main Author: Barros, Sebastian
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.03708
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Latency remains a critical bottleneck for deploying foundational artificial intelligence (AI) models, such as large language models (LLMs), in customer-facing, real-time applications. While cloud-based inference offers scalability, it frequently introduces delays unacceptable for interactive experiences, such as semantic search, personalized recommendations, or conversational interfaces. Telecommunications operators, historically adept at solving content latency challenges through partnerships with providers like Google and Facebook, now have a unique opportunity to address similar AI latency concerns. This paper presents a technical framework leveraging Telco infrastructure-spanning regional data centers, existing content delivery network (CDN) nodes, and near-radio access network (RAN) sites-as hierarchical "AI edges" for caching and partial inference. We explore the architectural feasibility of embedding semantic and vector-based AI inference caches within existing Telco assets, proposing tiered caching strategies and split-inference architectures that significantly reduce latency and compute costs. Additionally, we address technical challenges specific to Telcos, such as cache synchronization, model distribution, privacy, and hardware acceleration considerations. Finally, we discuss viable partnership models between telcos and AI providers, highlighting how this innovative use of telco infrastructure can unlock both improved AI user experience and new revenue streams.