:: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yijia, Gou, Zhihong, Cao, Shijie, Feng, Weigang, Zhang, Sicheng, Dai, Guohao, Xu, Ningyi
Format:	Preprint
Published:	2024
Subjects:	Performance Machine Learning
Online Access:	https://arxiv.org/abs/2411.18873
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DSO: A GPU Energy Efficiency Optimizer by Fusing Dynamic and Static Information
by: Wang, Qiang, et al.
Published: (2024)

WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
by: Zhang, Kaixuan, et al.
Published: (2026)

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
by: Wang, Han, et al.
Published: (2026)

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
by: Jaber, Jaber, et al.
Published: (2026)

KernelBench: Can LLMs Write Efficient GPU Kernels?
by: Ouyang, Anne, et al.
Published: (2025)

FlipFlop: A Static Analysis-based Energy Optimization Framework for GPU Kernels
by: Rajput, Saurabhsingh, et al.
Published: (2026)

oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
by: Li, Jianhui, et al.
Published: (2023)

Insum: Sparse GPU Kernels Simplified and Optimized with Indirect Einsums
by: Won, Jaeyeon, et al.
Published: (2025)

Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search
by: Nichols, Daniel, et al.
Published: (2026)

Conformer-Based Speech Recognition On Extreme Edge-Computing Devices
by: Xu, Mingbin, et al.
Published: (2023)

GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization
by: Andrews, Martin, et al.
Published: (2025)

KEET: Explaining Performance of GPU Kernels Using LLM Agents
by: Davis, Joshua H., et al.
Published: (2026)

PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction
by: Zhang, Kaixuan, et al.
Published: (2026)

Integrating Performance Tools in Model Reasoning for GPU Kernel Optimization
by: Nichols, Daniel, et al.
Published: (2025)

Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design
by: Huang, Zixiao, et al.
Published: (2025)

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines
by: He, Jiaao, et al.
Published: (2024)

Efficient GPU implementation of randomized SVD and its applications
by: Struski, Łukasz, et al.
Published: (2021)

The Energy Cost of Execution-Idle in GPU Clusters
by: Lei, Yiran, et al.
Published: (2026)

A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time Series
by: Beseda, Martin, et al.
Published: (2025)

Efficient GPU-Centered Singular Value Decomposition Using the Divide-and-Conquer Method
by: Liu, Shifang, et al.
Published: (2025)

GCL-Sampler: Discovering Kernel Similarity for Sampled GPU Simulation via Graph Contrastive Learning
by: Wang, Jiaqi, et al.
Published: (2026)

An Empirical Study on the Performance and Energy Usage of Compiled Python Code
by: Stoico, Vincenzo, et al.
Published: (2025)

On Combining Two Server Control Policies for Energy Efficiency
by: Dai, Jingze, et al.
Published: (2025)

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
by: Lin, Mao, et al.
Published: (2026)

MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation
by: Wen, Zhongzhen, et al.
Published: (2025)

Flex Attention: A Programming Model for Generating Optimized Attention Kernels
by: Dong, Juechu, et al.
Published: (2024)

CAPSim: A Fast CPU Performance Simulator Using Attention-based Predictor
by: Xu, Buqing, et al.
Published: (2025)

FLuRKA: Fast and accurate unified Low-Rank & Kernel Attention
by: Gupta, Ahan, et al.
Published: (2023)

DRIM-ANN: An Approximate Nearest Neighbor Search Engine based on Commercial DRAM-PIMs
by: Chen, Mingkai, et al.
Published: (2024)

Fast and Scalable Mixed Precision Euclidean Distance Calculations Using GPU Tensor Cores
by: Curless, Brian, et al.
Published: (2025)

Cloud Computing Energy Consumption Prediction Based on Kernel Extreme Learning Machine Algorithm Improved by Vector Weighted Average Algorithm
by: Wang, Yuqing, et al.
Published: (2025)

DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures
by: Yang, Peiming, et al.
Published: (2025)

AMD MI300X GPU Performance Analysis
by: Ambati, Chandrish, et al.
Published: (2025)

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe
by: Saba, Tara, et al.
Published: (2026)

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
by: Yu, Shan, et al.
Published: (2025)

gDist: Efficient Distance Computation between 3D Meshes on GPU
by: Fang, Peng, et al.
Published: (2024)

Canvas: End-to-End Kernel Architecture Search in Neural Networks
by: Zhao, Chenggang, et al.
Published: (2023)

Disaggregated Design for GPU-Based Volumetric Data Structures
by: Meneghin, Massimiliano, et al.
Published: (2025)

Efficient allocation of image recognition and LLM tasks on multi-GPU system
by: Lawenda, Marcin, et al.
Published: (2025)

CCSS: Hardware-Accelerated RTL Simulation with Fast Combinational Logic Computing and Sequential Logic Synchronization
by: Feng, Weigang, et al.
Published: (2025)