:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sedukhin, Stanislav, Tomioka, Yoichi, Matsumoto, Kazuya, Okuyama, Yuichi
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Hardware Architecture Emerging Technologies Signal Processing C.1.4; C.3; F.2.1; G.1.3; G.4
Online Access:	https://arxiv.org/abs/2506.22818
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI
by: Tahmasebi, Faraz, et al.
Published: (2024)

Optimizing Tensor Train Decomposition in DNNs for RISC-V Architectures Using Design Space Exploration and Compiler Optimizations
by: Anthimopoulos, Theologos, et al.
Published: (2026)

HYLU: Hybrid Parallel Sparse LU Factorization
by: Chen, Xiaoming
Published: (2025)

Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor Analysis of Hardware-Invariant Computational Primitives in Parallel Processors
by: Abraham, Ojima, et al.
Published: (2026)

LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
by: Mo, Zhiwen, et al.
Published: (2024)

A Per-Access Upper Bound for Shared-Resource Interference in Direct-Mapped Multicore Architectures
by: Pedroni, Felipe T.
Published: (2026)

A Compilation Framework for Quantum Circuits with Mid-Circuit Measurement Error Awareness
by: Zhong, Ming, et al.
Published: (2025)

Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures
by: Siracusa, Marco, et al.
Published: (2025)

NM-SpMM: Accelerating Matrix Multiplication Using N:M Sparsity with GPGPU
by: Ma, Cong, et al.
Published: (2025)

TokenStack: A Heterogeneous HBM-PIM Architecture and Runtime for Efficient LLM Inference
by: Li, Zhuoran, et al.
Published: (2026)

Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores
by: Rout, Nikhil, et al.
Published: (2025)

The Monte Carlo Method and New Device and Architectural Techniques for Accelerating It
by: Petangoda, Janith, et al.
Published: (2025)

RowHammer Vulnerability Counter (RVC): Redefining RowHammer Detection with Victim-Centric Tracking
by: Jain, Lavi, et al.
Published: (2026)

An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores
by: Valpey, Benjamin, et al.
Published: (2025)

Exploring the Design Space for Message-Driven Systems for Dynamic Graph Processing using CCA
by: Chandio, Bibrak Qamar, et al.
Published: (2024)

Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance
by: Koeplinger, David, et al.
Published: (2024)

Multi-level informed optimization via decomposed Kriging for large design problems under uncertainty
by: Ampellio, Enrico, et al.
Published: (2025)

pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables
by: Ferreira, João Dinis, et al.
Published: (2021)

Speed, power and cost implications for GPU acceleration of Computational Fluid Dynamics on HPC systems
by: Cooper-Baldock, Zachary, et al.
Published: (2024)

Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures
by: Chillarón, Mónica, et al.
Published: (2024)

Efficient Parallel Scheduling for Sparse Triangular Solvers
by: Böhnlein, Toni, et al.
Published: (2025)

Covariance Matrix Analysis for Optimal Portfolio Selection
by: Keith, Lim Hao Shen
Published: (2024)

Scalable s-step Preconditioned Conjugate Gradient with Chebyshev Basis and Gauss-Seidel Gram Solve
by: D'Ambra, Pasqua, et al.
Published: (2026)

Fast and Practical Strassen's Matrix Multiplication using FPGAs
by: Ahmad, Afzal, et al.
Published: (2024)

Factor Machine: Mixed-signal Architecture for Fine-Grained Graph-Based Computing
by: Dudek, Piotr
Published: (2024)

ArchAgent: Agentic AI-driven Computer Architecture Discovery
by: Gupta, Raghav, et al.
Published: (2026)

Random Alloy Codes and the Fundamental Limits of Coded Distributed Tensors
by: Soto, Pedro
Published: (2022)

A Comparative Analysis of ARM and x86-64 Laptop-Class Processors: Architecture, Assembly-Level Performance, and Energy Efficiency
by: Özyılmaz, Mustafa Mert
Published: (2026)

Assembly of FETI dual operator using CUDA
by: Homola, Jakub, et al.
Published: (2025)

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
by: Cheng, Long, et al.
Published: (2026)

Hardware Trends Impacting Floating-Point Computations In Scientific Applications
by: Dongarra, Jack, et al.
Published: (2024)

Tensor Decompositions for Count Data that Leverage Stochastic and Deterministic Optimization
by: Myers, Jeremy M., et al.
Published: (2022)

RV-IM100: Quantifying ISA Extension, Datapath Width, and Pipeline Depth Trade-offs in RISC-V Microarchitectures
by: Kang, Hyunwoo
Published: (2026)

MEDEA: A Design-Time Multi-Objective Manager for Energy-Efficient DNN Inference on Heterogeneous Ultra-Low Power Platforms
by: Taji, Hossein, et al.
Published: (2025)

Dataflow & Tiling Strategies in Edge-AI FPGA Accelerators: A Comprehensive Literature Review
by: Li, Richie
Published: (2025)

SynapticCore-X: A Modular Neural Processing Architecture for Low-Cost FPGA Acceleration
by: Parameshwara, Arya
Published: (2025)

Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization
by: Ruggeri, Giuseppe, et al.
Published: (2025)

Local Adjoints for Simultaneous Preaccumulations with Shared Inputs
by: Blühdorn, Johannes, et al.
Published: (2024)

Hybrid parallel discrete adjoints in SU2
by: Blühdorn, Johannes, et al.
Published: (2024)

Flexible Quaternion Generalized Minimal Residual Method for Ill-Posed Quaternion Inverse Problems
by: Liu, Xuan, et al.
Published: (2024)