Saved in:
Bibliographic Details
Main Author: Pirolo, Andres
Format: Recurso digital
Language:
Published: Zenodo 2026
Subjects:
Online Access:https://doi.org/10.5281/zenodo.19025822
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866901517572767744
author Pirolo, Andres
author_facet Pirolo, Andres
contents <p> </p> <p>This work extends the adaptive per-block mode selector introduced in Part I with three main contributions.</p> <p>First, we introduce a fourth pipeline mode that encodes quaternary weights {0,1,2,3} using a dual-plane XNOR representation.<br>Second, we present a compiler optimization ablation demonstrating that standard GPU optimization techniques—such as <code>switch</code> statements and shared memory padding—do not improve performance on the Adreno 740 compiler.<br>Third, we provide the first empirical characterization of multi-layer parallel dispatch overlap on a commodity mobile GPU.</p> <p>The quaternary pipeline achieves throughput between ternary and binary modes and outperforms binary by up to <strong>3.1× at low batch sizes</strong>, revealing a previously undocumented regime where <strong>instruction count per thread dominates over arithmetic intensity</strong>.</p> <p>The compiler ablation shows that <code>if</code> cascades and <code>switch()</code> statements produce equivalent throughput on Adreno 740, indicating that the compiler already generates efficient jump tables. Additionally, shared memory padding techniques reported in prior GPU optimization literature degrade performance by up to <strong>2.7%</strong>, confirming that Adreno 740 does not exhibit the shared-memory bank conflict patterns characteristic of NVIDIA architectures.</p> <p>Finally, multi-layer parallel dispatch—submitting compute shaders for multiple layers without synchronization barriers between them—achieves <strong>34–44% execution overlap</strong> on the validation GPU. Overlap efficiency scales with layer size: at <strong>N = 16384</strong>, two-layer dispatch achieves <strong>44.3% overlap</strong>, yielding a <strong>1.58× forward-pass speedup</strong> for <strong>32-layer transformer inference</strong>. Asymmetric pipeline pairing consistently outperforms symmetric pairing, revealing a scheduling property with direct implications for neural network architecture design.</p> <p><em>See license in repository zip.</em></p> <p> </p>
format Recurso digital
id zenodo_https___doi_org_10_5281_zenodo_19025822
institution Zenodo
language
publishDate 2026
publisher Zenodo
record_format zenodo
spellingShingle 4-Pipeline Parallel Dispatch for Low-Bit GPU Neural Inference
Pirolo, Andres
BNN
Cnn
Edge ai
Data center optimization
<p> </p> <p>This work extends the adaptive per-block mode selector introduced in Part I with three main contributions.</p> <p>First, we introduce a fourth pipeline mode that encodes quaternary weights {0,1,2,3} using a dual-plane XNOR representation.<br>Second, we present a compiler optimization ablation demonstrating that standard GPU optimization techniques—such as <code>switch</code> statements and shared memory padding—do not improve performance on the Adreno 740 compiler.<br>Third, we provide the first empirical characterization of multi-layer parallel dispatch overlap on a commodity mobile GPU.</p> <p>The quaternary pipeline achieves throughput between ternary and binary modes and outperforms binary by up to <strong>3.1× at low batch sizes</strong>, revealing a previously undocumented regime where <strong>instruction count per thread dominates over arithmetic intensity</strong>.</p> <p>The compiler ablation shows that <code>if</code> cascades and <code>switch()</code> statements produce equivalent throughput on Adreno 740, indicating that the compiler already generates efficient jump tables. Additionally, shared memory padding techniques reported in prior GPU optimization literature degrade performance by up to <strong>2.7%</strong>, confirming that Adreno 740 does not exhibit the shared-memory bank conflict patterns characteristic of NVIDIA architectures.</p> <p>Finally, multi-layer parallel dispatch—submitting compute shaders for multiple layers without synchronization barriers between them—achieves <strong>34–44% execution overlap</strong> on the validation GPU. Overlap efficiency scales with layer size: at <strong>N = 16384</strong>, two-layer dispatch achieves <strong>44.3% overlap</strong>, yielding a <strong>1.58× forward-pass speedup</strong> for <strong>32-layer transformer inference</strong>. Asymmetric pipeline pairing consistently outperforms symmetric pairing, revealing a scheduling property with direct implications for neural network architecture design.</p> <p><em>See license in repository zip.</em></p> <p> </p>
title 4-Pipeline Parallel Dispatch for Low-Bit GPU Neural Inference
topic BNN
Cnn
Edge ai
Data center optimization
url https://doi.org/10.5281/zenodo.19025822