Saved in:
| Main Author: | Bikshandi, Ganesh |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.11608 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization
by: Tschand, Arya, et al.
Published: (2025)
by: Tschand, Arya, et al.
Published: (2025)
The Case for Co-Designing Model Architectures with Hardware
by: Anthony, Quentin, et al.
Published: (2024)
by: Anthony, Quentin, et al.
Published: (2024)
Sustainable AI Training via Hardware-Software Co-Design on NVIDIA, AMD, and Emerging GPU Architectures
by: Makin, Yashasvi, et al.
Published: (2025)
by: Makin, Yashasvi, et al.
Published: (2025)
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning
by: An, Wei, et al.
Published: (2024)
by: An, Wei, et al.
Published: (2024)
The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution
by: Panigrahy, Deepak, et al.
Published: (2026)
by: Panigrahy, Deepak, et al.
Published: (2026)
BanditWare: A Contextual Bandit-based Framework for Hardware Prediction
by: Coleman, Tainã, et al.
Published: (2025)
by: Coleman, Tainã, et al.
Published: (2025)
Confidential Computing on NVIDIA Hopper GPUs: A Performance Benchmark Study
by: Zhu, Jianwei, et al.
Published: (2024)
by: Zhu, Jianwei, et al.
Published: (2024)
Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection
by: Pasandideh, Faezeh, et al.
Published: (2026)
by: Pasandideh, Faezeh, et al.
Published: (2026)
HadaCore: Tensor Core Accelerated Hadamard Transform Kernel
by: Agarwal, Krish, et al.
Published: (2024)
by: Agarwal, Krish, et al.
Published: (2024)
LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure
by: Cho, Jaehong, et al.
Published: (2025)
by: Cho, Jaehong, et al.
Published: (2025)
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
by: Yuan, Yichao, et al.
Published: (2025)
by: Yuan, Yichao, et al.
Published: (2025)
Viability and Performance of a Private LLM Server for SMBs: A Benchmark Analysis of Qwen3-30B on Consumer-Grade Hardware
by: Khalil, Alex, et al.
Published: (2025)
by: Khalil, Alex, et al.
Published: (2025)
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
by: Liu, Man, et al.
Published: (2026)
by: Liu, Man, et al.
Published: (2026)
LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving
by: Hu, Huanqi, et al.
Published: (2025)
by: Hu, Huanqi, et al.
Published: (2025)
Deploying Atmospheric and Oceanic AI Models on Chinese Hardware and Framework: Migration Strategies, Performance Optimization and Analysis
by: Sun, Yuze, et al.
Published: (2025)
by: Sun, Yuze, et al.
Published: (2025)
AEG: A Baremetal Framework for AI Acceleration via Direct Hardware Access in Heterogeneous Accelerators
by: Jiang, Hua, et al.
Published: (2026)
by: Jiang, Hua, et al.
Published: (2026)
Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution
by: Sui, Yifan, et al.
Published: (2026)
by: Sui, Yifan, et al.
Published: (2026)
B-PASTE: Beam-Aware Pattern-Guided Speculative Execution for Resource-Constrained LLM Agents
by: Song, Yanfei
Published: (2026)
by: Song, Yanfei
Published: (2026)
Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers
by: Daghero, Francesco, et al.
Published: (2025)
by: Daghero, Francesco, et al.
Published: (2025)
STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design
by: Man, Changhai, et al.
Published: (2025)
by: Man, Changhai, et al.
Published: (2025)
FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
by: Dege, Pengcuo, et al.
Published: (2025)
by: Dege, Pengcuo, et al.
Published: (2025)
Optimizing Data Distribution and Kernel Performance for Efficient Training of Chemistry Foundation Models: A Case Study with MACE
by: Firoz, Jesun, et al.
Published: (2025)
by: Firoz, Jesun, et al.
Published: (2025)
Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization
by: Zhou, Yangjie, et al.
Published: (2024)
by: Zhou, Yangjie, et al.
Published: (2024)
Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator
by: Peccia, Federico Nicolas, et al.
Published: (2024)
by: Peccia, Federico Nicolas, et al.
Published: (2024)
PhD Thesis Summary: Methods for Reliability Assessment and Enhancement of Deep Neural Network Hardware Accelerators
by: Taheri, Mahdi
Published: (2026)
by: Taheri, Mahdi
Published: (2026)
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
by: Zhao, Chenggang, et al.
Published: (2025)
by: Zhao, Chenggang, et al.
Published: (2025)
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
by: Li, Yilong, et al.
Published: (2025)
by: Li, Yilong, et al.
Published: (2025)
PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
by: Yeo, Gwangoo, et al.
Published: (2024)
by: Yeo, Gwangoo, et al.
Published: (2024)
Towards Resource-Efficient Compound AI Systems
by: Chaudhry, Gohar Irfan, et al.
Published: (2025)
by: Chaudhry, Gohar Irfan, et al.
Published: (2025)
Dora: QoE-Aware Hybrid Parallelism for Distributed Edge AI
by: Jin, Jianli, et al.
Published: (2025)
by: Jin, Jianli, et al.
Published: (2025)
TT-Edge: A Hardware-Software Co-Design for Energy-Efficient Tensor-Train Decomposition on Edge AI
by: Kwak, Hyunseok, et al.
Published: (2025)
by: Kwak, Hyunseok, et al.
Published: (2025)
Advancing AI-assisted Hardware Design with Hierarchical Decentralized Training and Personalized Inference-Time Optimization
by: Chen, Hao Mark, et al.
Published: (2025)
by: Chen, Hao Mark, et al.
Published: (2025)
Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
by: Renney, Harri, et al.
Published: (2026)
by: Renney, Harri, et al.
Published: (2026)
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
by: Yuan, Yichao, et al.
Published: (2026)
by: Yuan, Yichao, et al.
Published: (2026)
Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
by: Chen, Huamin, et al.
Published: (2026)
by: Chen, Huamin, et al.
Published: (2026)
TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
by: Liu, Dong, et al.
Published: (2025)
by: Liu, Dong, et al.
Published: (2025)
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
by: Zhang, Zijian, et al.
Published: (2025)
by: Zhang, Zijian, et al.
Published: (2025)
Towards Energy-Efficient Serverless Computing with Hardware Isolation
by: Carl, Natalie, et al.
Published: (2025)
by: Carl, Natalie, et al.
Published: (2025)
HiAER-Spike: Hardware-Software Co-Design for Large-Scale Reconfigurable Event-Driven Neuromorphic Computing
by: Frank, Gwenevere, et al.
Published: (2025)
by: Frank, Gwenevere, et al.
Published: (2025)
TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
by: Ye, Chenhao, et al.
Published: (2026)
by: Ye, Chenhao, et al.
Published: (2026)
Similar Items
-
SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization
by: Tschand, Arya, et al.
Published: (2025) -
The Case for Co-Designing Model Architectures with Hardware
by: Anthony, Quentin, et al.
Published: (2024) -
Sustainable AI Training via Hardware-Software Co-Design on NVIDIA, AMD, and Emerging GPU Architectures
by: Makin, Yashasvi, et al.
Published: (2025) -
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning
by: An, Wei, et al.
Published: (2024) -
The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution
by: Panigrahy, Deepak, et al.
Published: (2026)