Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Sitian, Li, Yusen, Chen, Yao, Deng, Minwen, Meng, Jintao, Zhou, Amelie Chi
Format:	Preprint
Published:	2026
Subjects:	Hardware Architecture
Online Access:	https://arxiv.org/abs/2605.25522
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911715390652416
author	Chen, Sitian Li, Yusen Chen, Yao Deng, Minwen Meng, Jintao Zhou, Amelie Chi
author_facet	Chen, Sitian Li, Yusen Chen, Yao Deng, Minwen Meng, Jintao Zhou, Amelie Chi
contents	Approximate Nearest Neighbor Search (ANNS) is a core primitive in modern AI systems, and graph-based methods currently offer the best accuracy-efficiency trade-off at scale. The workload is fundamentally memory-bound: graph traversal produces frequent, irregular memory accesses that cap CPU throughput at main-memory bandwidth, while GPUs lack the high-bandwidth memory capacity to host billion-scale indexes. Processing-in-Memory (PIM) is a natural candidate, as placing computation next to data unlocks the abundant internal bandwidth that such bandwidth-starved workloads demand. Porting graph-based ANNS to PIM, however, exposes several architectural mismatches: each processing unit has only a small local memory, inter-unit communication is costly, host coordination adds overhead, and in-memory compute units are relatively weak -- limitations that have forced prior PIM-based ANNS designs to fall back on cluster-based indexing, whose recall ceiling is far below that of graph methods. This paper presents an algorithm-architecture co-design that overcomes these obstacles through three components: a compacted index layout that shrinks the PIM-resident memory footprint by 14.5x; an asynchronous pipelined scheduler that keeps the host-to-PIM interconnect saturated; and a multiplication-free distance kernel that loses under 0.08% recall. Across three billion-scale benchmarks, the proposed design achieves up to 20x and 17.1x higher throughput than CPU and GPU baselines, respectively, outperforms prior PIM accelerators by 129x in the high-recall regime, and scales gracefully across multi-node deployments and emerging PIM architecture.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_25522
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Co-Designing Graph-based Approximate Nearest Neighbor Search at Billion Scale for Processing-in-Memory Chen, Sitian Li, Yusen Chen, Yao Deng, Minwen Meng, Jintao Zhou, Amelie Chi Hardware Architecture Approximate Nearest Neighbor Search (ANNS) is a core primitive in modern AI systems, and graph-based methods currently offer the best accuracy-efficiency trade-off at scale. The workload is fundamentally memory-bound: graph traversal produces frequent, irregular memory accesses that cap CPU throughput at main-memory bandwidth, while GPUs lack the high-bandwidth memory capacity to host billion-scale indexes. Processing-in-Memory (PIM) is a natural candidate, as placing computation next to data unlocks the abundant internal bandwidth that such bandwidth-starved workloads demand. Porting graph-based ANNS to PIM, however, exposes several architectural mismatches: each processing unit has only a small local memory, inter-unit communication is costly, host coordination adds overhead, and in-memory compute units are relatively weak -- limitations that have forced prior PIM-based ANNS designs to fall back on cluster-based indexing, whose recall ceiling is far below that of graph methods. This paper presents an algorithm-architecture co-design that overcomes these obstacles through three components: a compacted index layout that shrinks the PIM-resident memory footprint by 14.5x; an asynchronous pipelined scheduler that keeps the host-to-PIM interconnect saturated; and a multiplication-free distance kernel that loses under 0.08% recall. Across three billion-scale benchmarks, the proposed design achieves up to 20x and 17.1x higher throughput than CPU and GPU baselines, respectively, outperforms prior PIM accelerators by 129x in the high-recall regime, and scales gracefully across multi-node deployments and emerging PIM architecture.
title	Co-Designing Graph-based Approximate Nearest Neighbor Search at Billion Scale for Processing-in-Memory
topic	Hardware Architecture
url	https://arxiv.org/abs/2605.25522

Similar Items