Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Pirolo, Andres
Format:	Recurso digital
Language:
Published:	Zenodo 2026
Subjects:	BNN Cnn Edge ai Data center optimization
Online Access:	https://doi.org/10.5281/zenodo.19025822
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866901517572767744
author	Pirolo, Andres
author_facet	Pirolo, Andres
contents	<p> </p> <p>This work extends the adaptive per-block mode selector introduced in Part I with three main contributions.</p> <p>First, we introduce a fourth pipeline mode that encodes quaternary weights {0,1,2,3} using a dual-plane XNOR representation.<br>Second, we present a compiler optimization ablation demonstrating that standard GPU optimization techniques—such as <code>switch</code> statements and shared memory padding—do not improve performance on the Adreno 740 compiler.<br>Third, we provide the first empirical characterization of multi-layer parallel dispatch overlap on a commodity mobile GPU.</p> <p>The quaternary pipeline achieves throughput between ternary and binary modes and outperforms binary by up to <strong>3.1× at low batch sizes</strong>, revealing a previously undocumented regime where <strong>instruction count per thread dominates over arithmetic intensity</strong>.</p> <p>The compiler ablation shows that <code>if</code> cascades and <code>switch()</code> statements produce equivalent throughput on Adreno 740, indicating that the compiler already generates efficient jump tables. Additionally, shared memory padding techniques reported in prior GPU optimization literature degrade performance by up to <strong>2.7%</strong>, confirming that Adreno 740 does not exhibit the shared-memory bank conflict patterns characteristic of NVIDIA architectures.</p> <p>Finally, multi-layer parallel dispatch—submitting compute shaders for multiple layers without synchronization barriers between them—achieves <strong>34–44% execution overlap</strong> on the validation GPU. Overlap efficiency scales with layer size: at <strong>N = 16384</strong>, two-layer dispatch achieves <strong>44.3% overlap</strong>, yielding a <strong>1.58× forward-pass speedup</strong> for <strong>32-layer transformer inference</strong>. Asymmetric pipeline pairing consistently outperforms symmetric pairing, revealing a scheduling property with direct implications for neural network architecture design.</p> <p><em>See license in repository zip.</em></p> <p> </p>
format	Recurso digital
id	zenodo_https___doi_org_10_5281_zenodo_19025822
institution	Zenodo
language
publishDate	2026
publisher	Zenodo
record_format	zenodo
spellingShingle	4-Pipeline Parallel Dispatch for Low-Bit GPU Neural Inference Pirolo, Andres BNN Cnn Edge ai Data center optimization <p> </p> <p>This work extends the adaptive per-block mode selector introduced in Part I with three main contributions.</p> <p>First, we introduce a fourth pipeline mode that encodes quaternary weights {0,1,2,3} using a dual-plane XNOR representation.<br>Second, we present a compiler optimization ablation demonstrating that standard GPU optimization techniques—such as <code>switch</code> statements and shared memory padding—do not improve performance on the Adreno 740 compiler.<br>Third, we provide the first empirical characterization of multi-layer parallel dispatch overlap on a commodity mobile GPU.</p> <p>The quaternary pipeline achieves throughput between ternary and binary modes and outperforms binary by up to <strong>3.1× at low batch sizes</strong>, revealing a previously undocumented regime where <strong>instruction count per thread dominates over arithmetic intensity</strong>.</p> <p>The compiler ablation shows that <code>if</code> cascades and <code>switch()</code> statements produce equivalent throughput on Adreno 740, indicating that the compiler already generates efficient jump tables. Additionally, shared memory padding techniques reported in prior GPU optimization literature degrade performance by up to <strong>2.7%</strong>, confirming that Adreno 740 does not exhibit the shared-memory bank conflict patterns characteristic of NVIDIA architectures.</p> <p>Finally, multi-layer parallel dispatch—submitting compute shaders for multiple layers without synchronization barriers between them—achieves <strong>34–44% execution overlap</strong> on the validation GPU. Overlap efficiency scales with layer size: at <strong>N = 16384</strong>, two-layer dispatch achieves <strong>44.3% overlap</strong>, yielding a <strong>1.58× forward-pass speedup</strong> for <strong>32-layer transformer inference</strong>. Asymmetric pipeline pairing consistently outperforms symmetric pairing, revealing a scheduling property with direct implications for neural network architecture design.</p> <p><em>See license in repository zip.</em></p> <p> </p>
title	4-Pipeline Parallel Dispatch for Low-Bit GPU Neural Inference
topic	BNN Cnn Edge ai Data center optimization
url	https://doi.org/10.5281/zenodo.19025822

Similar Items