:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Feng, Dahu, Feng, Erhu, Du, Dong, Xu, Pinjie, Xia, Yubin, Chen, Haibo, Zhao, Rong
Format:	Preprint
Published:	2025
Subjects:	Hardware Architecture Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2506.11446
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference
by: Chen, Le, et al.
Published: (2025)

Enabling Time-Aware Priority Traffic Management over Distributed FPGA Nodes
by: Scionti, Alberto, et al.
Published: (2025)

Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
by: Zhang, Chen, et al.
Published: (2026)

Handling of Memory Page Faults during Virtual-Address RDMA
by: Psistakis, Antonis
Published: (2025)

ELK: Exploring the Efficiency of Inter-core Connected AI Chips with Deep Learning Compiler Techniques
by: Liu, Yiqi, et al.
Published: (2025)

IOMMU Support for Virtual-Address Remote DMA in an ARMv8 environment
by: Psistakis, Antonis
Published: (2025)

Deadlock-free routing for Full-mesh networks without using Virtual Channels
by: Cano, Alejandro, et al.
Published: (2025)

TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link
by: Zhang, Yichao, et al.
Published: (2026)

NetSmith: An Optimization Framework for Machine-Discovered Network Topologies
by: Green, Conor, et al.
Published: (2024)

cMPI: Using CXL Memory Sharing for MPI One-Sided and Two-Sided Inter-Node Communications
by: Wang, Xi, et al.
Published: (2025)

Performance Implications of Multi-Chiplet Neural Processing Units on Autonomous Driving Perception
by: Odema, Mohanad, et al.
Published: (2024)

A Modern Primer on Processing in Memory
by: Mutlu, Onur, et al.
Published: (2020)

ALPHA-PIM: Analysis of Linear Algebraic Processing for High-Performance Graph Applications on a Real Processing-In-Memory System
by: Barkhordar, Marzieh, et al.
Published: (2026)

A New Family of Thread to Core Allocation Policies for an SMT ARM Processor
by: Navarro, Marta, et al.
Published: (2025)

CCSS: Hardware-Accelerated RTL Simulation with Fast Combinational Logic Computing and Sequential Logic Synchronization
by: Feng, Weigang, et al.
Published: (2025)

Optimizing Task Scheduling in Fog Computing with Deadline Awareness
by: Sirjani, Mohammad Sadegh, et al.
Published: (2025)

Accelerating Triangle Counting with Real Processing-in-Memory Systems
by: Asquini, Lorenzo, et al.
Published: (2025)

Balanced Data Placement for GEMV Acceleration with Processing-In-Memory
by: Ibrahim, Mohamed Assem, et al.
Published: (2024)

Memory-Centric Computing: Recent Advances in Processing-in-DRAM
by: Mutlu, Onur, et al.
Published: (2024)

MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Processing
by: Oliveira, Geraldo F., et al.
Published: (2024)

SpeedMalloc: Improving Multi-threaded Applications via a Lightweight Core for Memory Allocation
by: Li, Ruihao, et al.
Published: (2025)

Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training
by: Adnan, Muhammad, et al.
Published: (2024)

New Tools, Programming Models, and System Support for Processing-in-Memory Architectures
by: Oliveira, Geraldo F.
Published: (2025)

RevaMp3D: Architecting the Processor Core and Cache Hierarchy for Systems with Monolithically-Integrated Logic and Memory
by: Ghiasi, Nika Mansouri, et al.
Published: (2022)

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
by: Zhou, Zhuoshan, et al.
Published: (2026)

Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
by: Zhang, Qijun, et al.
Published: (2026)

PUDTune: Multi-Level Charging for High-Precision Calibration in Processing-Using-DRAM
by: Kubo, Tatsuya, et al.
Published: (2025)

Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A
by: Jarmusch, Aaron, et al.
Published: (2026)

PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System
by: Liu, Lian, et al.
Published: (2026)

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving
by: Tian, Yuyang, et al.
Published: (2025)

PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices
by: Noh, Si Ung, et al.
Published: (2024)

Automated Deep Neural Network Inference Partitioning for Distributed Embedded Systems
by: Kreß, Fabian, et al.
Published: (2024)

RAPID-Graph: Recursive All-Pairs Shortest Paths Using Processing-in-Memory for Dynamic Programming on Graphs
by: Chen, Yanru, et al.
Published: (2025)

Sequence-Aware Split Heuristic to Mitigate SM Underutilization in FlashAttention-3 Low-Head-Count Decoding
by: Font, Martí Llopart, et al.
Published: (2026)

Experimental Assessment of Containers Running on Top of Virtual Machines
by: Aqasizade, Hossein, et al.
Published: (2024)

Conduit: Programmer-Transparent Near-Data Processing Using Multiple Compute-Capable Resources in Solid State Drives
by: Nadig, Rakesh, et al.
Published: (2026)

Proteus: Enabling High-Performance Processing-Using-DRAM with Dynamic Bit-Precision, Adaptive Data Representation, and Flexible Arithmetic
by: Oliveira, Geraldo F., et al.
Published: (2025)

TeraPool-SDR: An 1.89TOPS 1024 RV-Cores 4MiB Shared-L1 Cluster for Next-Generation Open-Source Software-Defined Radios
by: Zhang, Yichao, et al.
Published: (2024)

FLEX: Leveraging FPGA-CPU Synergy for Mixed-Cell-Height Legalization Acceleration
by: Liu, Xingyu, et al.
Published: (2025)

Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service
by: Zheng, Xianzhe, et al.
Published: (2026)