Saved in:
| Main Author: | |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2026
|
| Online Access: | https://doi.org/10.5281/zenodo.19356078 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- <p>High-scale distributed systems face significant challenges in understanding performance degradation at the tail end of latency distributions. In traditional performance monitoring methods, it is common for the average of all requests to mask the true experience users have when making the requests towards the outer edges of the distribution. Monitoring methods will generally either fail to identify rare instances of performance anomaly or consume too much computational resource when profiling regular transactions. This article introduces an advanced telemetry architecture, which provides detailed insights into tail latency without decreasing system efficiency. The proposed architecture utilizes a centralized controller to maintain continuously updated dynamic percentile cutoffs for thousands of concurrent experiments, which operate across rolling time windows. The use of minimal reporting means that the amount of information exchanged between the serving tasks and the controller will be very small in terms of volume. In addition, the controller's conditional profiling activates the detailed diagnostics collection mechanism only when latency threshold limits are exceeded. The architecture has been designed to accommodate hybrid sharding schemes that define different traffic profiles in global deployments. Multi-slice sharding allows for horizontal scaling and increased metric storage, while maintaining centralized co-ordination to accurately manage thresholds. The real-time monitoring of performance in both production environments and load-test environments improves the speed to deliver developer velocity. The presented work demonstrates that advanced observability approaches are capable of working in an efficient manner at a planetary scale if they are developed using outcome-aware principles, along with effective resource allocation that leverages the system resources intelligently.</p>