:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Klein, Bernhard, Selker, Falk, Borras, Hendrik, Steger, Sophie, Pernkopf, Franz, Fröning, Holger
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Hardware Architecture Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2511.23440
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration
by: Kubo, Tatsuya, et al.
Published: (2025)

Kitsune: Enabling Dataflow Execution on GPUs
by: Davies, Michael, et al.
Published: (2025)

Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A
by: Jarmusch, Aaron, et al.
Published: (2026)

Achieving Dependability of AI Execution with Radiation Hardened Processors
by: Taquichiri, Carlos Rafael Tordoya, et al.
Published: (2025)

Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems
by: Bai, Zhenyu, et al.
Published: (2025)

Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs
by: Kurzynski, Marco, et al.
Published: (2025)

Evaluating Rapid Makespan Predictions for Heterogeneous Systems with Programmable Logic
by: Wilhelm, Martin, et al.
Published: (2025)

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU
by: Wijeratne, Sasindu, et al.
Published: (2024)

Leveraging SIMD for Accelerating Large-number Arithmetic
by: Das, Subhrajit, et al.
Published: (2026)

Next-generation Probabilistic Computing Hardware with 3D MOSAICs, Illusion Scale-up, and Co-design
by: Srimani, Tathagata, et al.
Published: (2024)

Accelerating Triangle Counting with Real Processing-in-Memory Systems
by: Asquini, Lorenzo, et al.
Published: (2025)

Balanced Data Placement for GEMV Acceleration with Processing-In-Memory
by: Ibrahim, Mohamed Assem, et al.
Published: (2024)

Accelerating Data Chunking in Deduplication Systems using Vector Instructions
by: Udayashankar, Sreeharsha, et al.
Published: (2025)

Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
by: Zhang, Qijun, et al.
Published: (2026)

A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models
by: Sharma, Harsh, et al.
Published: (2023)

Application Experiences on a GPU-Accelerated Arm-based HPC Testbed
by: Elwasif, Wael, et al.
Published: (2022)

Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training
by: Adnan, Muhammad, et al.
Published: (2024)

TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAs
by: Prakriya, Neha, et al.
Published: (2023)

Design in Tiles: Automating GEMM Deployment on Tile-Based Many-PE Accelerators
by: Shen, Aofeng, et al.
Published: (2025)

FLEX: Leveraging FPGA-CPU Synergy for Mixed-Cell-Height Legalization Acceleration
by: Liu, Xingyu, et al.
Published: (2025)

Compiler Support for Speculation in Decoupled Access/Execute Architectures
by: Szafarczyk, Robert, et al.
Published: (2025)

An Evaluation and Comparison of GPU Hardware and Solver Libraries for Accelerating the OPM Flow Reservoir Simulator
by: Qiu, Tong Dong, et al.
Published: (2023)

CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories
by: Shi, Man, et al.
Published: (2024)

DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration
by: Abdelmaksoud, Ahmed J., et al.
Published: (2024)

EDEA: Efficient Dual-Engine Accelerator for Depthwise Separable Convolution with Direct Data Transfer
by: Chen, Yi, et al.
Published: (2025)

DP-HLS: A High-Level Synthesis Framework for Accelerating Dynamic Programming Algorithms in Bioinformatics
by: Cao, Yingqi, et al.
Published: (2024)

CCSS: Hardware-Accelerated RTL Simulation with Fast Combinational Logic Computing and Sequential Logic Synchronization
by: Feng, Weigang, et al.
Published: (2025)

A Survey of Real-time Scheduling on Accelerator-based Heterogeneous Architecture for Time Critical Applications
by: Zou, An, et al.
Published: (2025)

A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators
by: Colagrande, Luca, et al.
Published: (2026)

SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding
by: Xu, Weihong, et al.
Published: (2025)

DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
by: Mo, Zhiwen, et al.
Published: (2026)

MANOJAVAM: A Scalable, Unified FPGA Accelerator for Matrix Multiplication and Singular Value Decomposition in Principal Component Analysis
by: Ramasubramanian, Srivaths, et al.
Published: (2026)

XDMA: A Distributed, Extensible DMA Architecture for Layout-Flexible Data Movements in Heterogeneous Multi-Accelerator SoCs
by: Kong, Fanchen, et al.
Published: (2025)

DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling
by: Mei, Linyan, et al.
Published: (2022)

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
by: Zhou, Zhuoshan, et al.
Published: (2026)

Mitigating Shared Storage Congestion Using Control Theory
by: Collignon, Thomas, et al.
Published: (2025)

MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
by: Hsia, Samuel, et al.
Published: (2023)

NMP-PaK: Near-Memory Processing Acceleration of Scalable De Novo Genome Assembly
by: Kim, Heewoo, et al.
Published: (2025)

NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing
by: Zou, Cheng, et al.
Published: (2026)

Evaluation of POSIT Arithmetic with Accelerators
by: Nakasato, Naohito, et al.
Published: (2024)