:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Bikshandi, Ganesh
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.11608
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization
by: Tschand, Arya, et al.
Published: (2025)

The Case for Co-Designing Model Architectures with Hardware
by: Anthony, Quentin, et al.
Published: (2024)

Sustainable AI Training via Hardware-Software Co-Design on NVIDIA, AMD, and Emerging GPU Architectures
by: Makin, Yashasvi, et al.
Published: (2025)

Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning
by: An, Wei, et al.
Published: (2024)

The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution
by: Panigrahy, Deepak, et al.
Published: (2026)

BanditWare: A Contextual Bandit-based Framework for Hardware Prediction
by: Coleman, Tainã, et al.
Published: (2025)

Confidential Computing on NVIDIA Hopper GPUs: A Performance Benchmark Study
by: Zhu, Jianwei, et al.
Published: (2024)

Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection
by: Pasandideh, Faezeh, et al.
Published: (2026)

HadaCore: Tensor Core Accelerated Hadamard Transform Kernel
by: Agarwal, Krish, et al.
Published: (2024)

LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure
by: Cho, Jaehong, et al.
Published: (2025)

MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
by: Yuan, Yichao, et al.
Published: (2025)

Viability and Performance of a Private LLM Server for SMBs: A Benchmark Analysis of Qwen3-30B on Consumer-Grade Hardware
by: Khalil, Alex, et al.
Published: (2025)

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
by: Liu, Man, et al.
Published: (2026)

LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving
by: Hu, Huanqi, et al.
Published: (2025)

Deploying Atmospheric and Oceanic AI Models on Chinese Hardware and Framework: Migration Strategies, Performance Optimization and Analysis
by: Sun, Yuze, et al.
Published: (2025)

AEG: A Baremetal Framework for AI Acceleration via Direct Hardware Access in Heterogeneous Accelerators
by: Jiang, Hua, et al.
Published: (2026)

Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution
by: Sui, Yifan, et al.
Published: (2026)

B-PASTE: Beam-Aware Pattern-Guided Speculative Execution for Resource-Constrained LLM Agents
by: Song, Yanfei
Published: (2026)

Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers
by: Daghero, Francesco, et al.
Published: (2025)

STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design
by: Man, Changhai, et al.
Published: (2025)

FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
by: Dege, Pengcuo, et al.
Published: (2025)

Optimizing Data Distribution and Kernel Performance for Efficient Training of Chemistry Foundation Models: A Case Study with MACE
by: Firoz, Jesun, et al.
Published: (2025)

Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization
by: Zhou, Yangjie, et al.
Published: (2024)

Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator
by: Peccia, Federico Nicolas, et al.
Published: (2024)

PhD Thesis Summary: Methods for Reliability Assessment and Enhancement of Deep Neural Network Hardware Accelerators
by: Taheri, Mahdi
Published: (2026)

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
by: Zhao, Chenggang, et al.
Published: (2025)

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
by: Li, Yilong, et al.
Published: (2025)

PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
by: Yeo, Gwangoo, et al.
Published: (2024)

Towards Resource-Efficient Compound AI Systems
by: Chaudhry, Gohar Irfan, et al.
Published: (2025)

Dora: QoE-Aware Hybrid Parallelism for Distributed Edge AI
by: Jin, Jianli, et al.
Published: (2025)

TT-Edge: A Hardware-Software Co-Design for Energy-Efficient Tensor-Train Decomposition on Edge AI
by: Kwak, Hyunseok, et al.
Published: (2025)

Advancing AI-assisted Hardware Design with Hierarchical Decentralized Training and Personalized Inference-Time Optimization
by: Chen, Hao Mark, et al.
Published: (2025)

Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
by: Renney, Harri, et al.
Published: (2026)

KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
by: Yuan, Yichao, et al.
Published: (2026)

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
by: Chen, Huamin, et al.
Published: (2026)

TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
by: Liu, Dong, et al.
Published: (2025)

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
by: Zhang, Zijian, et al.
Published: (2025)

Towards Energy-Efficient Serverless Computing with Hardware Isolation
by: Carl, Natalie, et al.
Published: (2025)

HiAER-Spike: Hardware-Software Co-Design for Large-Scale Reconfigurable Event-Driven Neuromorphic Computing
by: Frank, Gwenevere, et al.
Published: (2025)

TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
by: Ye, Chenhao, et al.
Published: (2026)