Saved in:
| Main Authors: | Sedukhin, Stanislav, Tomioka, Yoichi, Matsumoto, Kazuya, Okuyama, Yuichi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.22818 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI
by: Tahmasebi, Faraz, et al.
Published: (2024)
by: Tahmasebi, Faraz, et al.
Published: (2024)
Optimizing Tensor Train Decomposition in DNNs for RISC-V Architectures Using Design Space Exploration and Compiler Optimizations
by: Anthimopoulos, Theologos, et al.
Published: (2026)
by: Anthimopoulos, Theologos, et al.
Published: (2026)
HYLU: Hybrid Parallel Sparse LU Factorization
by: Chen, Xiaoming
Published: (2025)
by: Chen, Xiaoming
Published: (2025)
Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor Analysis of Hardware-Invariant Computational Primitives in Parallel Processors
by: Abraham, Ojima, et al.
Published: (2026)
by: Abraham, Ojima, et al.
Published: (2026)
LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
by: Mo, Zhiwen, et al.
Published: (2024)
by: Mo, Zhiwen, et al.
Published: (2024)
A Per-Access Upper Bound for Shared-Resource Interference in Direct-Mapped Multicore Architectures
by: Pedroni, Felipe T.
Published: (2026)
by: Pedroni, Felipe T.
Published: (2026)
A Compilation Framework for Quantum Circuits with Mid-Circuit Measurement Error Awareness
by: Zhong, Ming, et al.
Published: (2025)
by: Zhong, Ming, et al.
Published: (2025)
Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures
by: Siracusa, Marco, et al.
Published: (2025)
by: Siracusa, Marco, et al.
Published: (2025)
NM-SpMM: Accelerating Matrix Multiplication Using N:M Sparsity with GPGPU
by: Ma, Cong, et al.
Published: (2025)
by: Ma, Cong, et al.
Published: (2025)
TokenStack: A Heterogeneous HBM-PIM Architecture and Runtime for Efficient LLM Inference
by: Li, Zhuoran, et al.
Published: (2026)
by: Li, Zhuoran, et al.
Published: (2026)
Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores
by: Rout, Nikhil, et al.
Published: (2025)
by: Rout, Nikhil, et al.
Published: (2025)
The Monte Carlo Method and New Device and Architectural Techniques for Accelerating It
by: Petangoda, Janith, et al.
Published: (2025)
by: Petangoda, Janith, et al.
Published: (2025)
RowHammer Vulnerability Counter (RVC): Redefining RowHammer Detection with Victim-Centric Tracking
by: Jain, Lavi, et al.
Published: (2026)
by: Jain, Lavi, et al.
Published: (2026)
An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores
by: Valpey, Benjamin, et al.
Published: (2025)
by: Valpey, Benjamin, et al.
Published: (2025)
Exploring the Design Space for Message-Driven Systems for Dynamic Graph Processing using CCA
by: Chandio, Bibrak Qamar, et al.
Published: (2024)
by: Chandio, Bibrak Qamar, et al.
Published: (2024)
Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance
by: Koeplinger, David, et al.
Published: (2024)
by: Koeplinger, David, et al.
Published: (2024)
Multi-level informed optimization via decomposed Kriging for large design problems under uncertainty
by: Ampellio, Enrico, et al.
Published: (2025)
by: Ampellio, Enrico, et al.
Published: (2025)
pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables
by: Ferreira, João Dinis, et al.
Published: (2021)
by: Ferreira, João Dinis, et al.
Published: (2021)
Speed, power and cost implications for GPU acceleration of Computational Fluid Dynamics on HPC systems
by: Cooper-Baldock, Zachary, et al.
Published: (2024)
by: Cooper-Baldock, Zachary, et al.
Published: (2024)
Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures
by: Chillarón, Mónica, et al.
Published: (2024)
by: Chillarón, Mónica, et al.
Published: (2024)
Efficient Parallel Scheduling for Sparse Triangular Solvers
by: Böhnlein, Toni, et al.
Published: (2025)
by: Böhnlein, Toni, et al.
Published: (2025)
Covariance Matrix Analysis for Optimal Portfolio Selection
by: Keith, Lim Hao Shen
Published: (2024)
by: Keith, Lim Hao Shen
Published: (2024)
Scalable s-step Preconditioned Conjugate Gradient with Chebyshev Basis and Gauss-Seidel Gram Solve
by: D'Ambra, Pasqua, et al.
Published: (2026)
by: D'Ambra, Pasqua, et al.
Published: (2026)
Fast and Practical Strassen's Matrix Multiplication using FPGAs
by: Ahmad, Afzal, et al.
Published: (2024)
by: Ahmad, Afzal, et al.
Published: (2024)
Factor Machine: Mixed-signal Architecture for Fine-Grained Graph-Based Computing
by: Dudek, Piotr
Published: (2024)
by: Dudek, Piotr
Published: (2024)
ArchAgent: Agentic AI-driven Computer Architecture Discovery
by: Gupta, Raghav, et al.
Published: (2026)
by: Gupta, Raghav, et al.
Published: (2026)
Random Alloy Codes and the Fundamental Limits of Coded Distributed Tensors
by: Soto, Pedro
Published: (2022)
by: Soto, Pedro
Published: (2022)
A Comparative Analysis of ARM and x86-64 Laptop-Class Processors: Architecture, Assembly-Level Performance, and Energy Efficiency
by: Özyılmaz, Mustafa Mert
Published: (2026)
by: Özyılmaz, Mustafa Mert
Published: (2026)
Assembly of FETI dual operator using CUDA
by: Homola, Jakub, et al.
Published: (2025)
by: Homola, Jakub, et al.
Published: (2025)
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
by: Cheng, Long, et al.
Published: (2026)
by: Cheng, Long, et al.
Published: (2026)
Hardware Trends Impacting Floating-Point Computations In Scientific Applications
by: Dongarra, Jack, et al.
Published: (2024)
by: Dongarra, Jack, et al.
Published: (2024)
Tensor Decompositions for Count Data that Leverage Stochastic and Deterministic Optimization
by: Myers, Jeremy M., et al.
Published: (2022)
by: Myers, Jeremy M., et al.
Published: (2022)
RV-IM100: Quantifying ISA Extension, Datapath Width, and Pipeline Depth Trade-offs in RISC-V Microarchitectures
by: Kang, Hyunwoo
Published: (2026)
by: Kang, Hyunwoo
Published: (2026)
MEDEA: A Design-Time Multi-Objective Manager for Energy-Efficient DNN Inference on Heterogeneous Ultra-Low Power Platforms
by: Taji, Hossein, et al.
Published: (2025)
by: Taji, Hossein, et al.
Published: (2025)
Dataflow & Tiling Strategies in Edge-AI FPGA Accelerators: A Comprehensive Literature Review
by: Li, Richie
Published: (2025)
by: Li, Richie
Published: (2025)
SynapticCore-X: A Modular Neural Processing Architecture for Low-Cost FPGA Acceleration
by: Parameshwara, Arya
Published: (2025)
by: Parameshwara, Arya
Published: (2025)
Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization
by: Ruggeri, Giuseppe, et al.
Published: (2025)
by: Ruggeri, Giuseppe, et al.
Published: (2025)
Local Adjoints for Simultaneous Preaccumulations with Shared Inputs
by: Blühdorn, Johannes, et al.
Published: (2024)
by: Blühdorn, Johannes, et al.
Published: (2024)
Hybrid parallel discrete adjoints in SU2
by: Blühdorn, Johannes, et al.
Published: (2024)
by: Blühdorn, Johannes, et al.
Published: (2024)
Flexible Quaternion Generalized Minimal Residual Method for Ill-Posed Quaternion Inverse Problems
by: Liu, Xuan, et al.
Published: (2024)
by: Liu, Xuan, et al.
Published: (2024)
Similar Items
-
FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI
by: Tahmasebi, Faraz, et al.
Published: (2024) -
Optimizing Tensor Train Decomposition in DNNs for RISC-V Architectures Using Design Space Exploration and Compiler Optimizations
by: Anthimopoulos, Theologos, et al.
Published: (2026) -
HYLU: Hybrid Parallel Sparse LU Factorization
by: Chen, Xiaoming
Published: (2025) -
Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor Analysis of Hardware-Invariant Computational Primitives in Parallel Processors
by: Abraham, Ojima, et al.
Published: (2026) -
LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
by: Mo, Zhiwen, et al.
Published: (2024)