Saved in:
| Main Author: | |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2026
|
| Subjects: | |
| Online Access: | https://doi.org/10.5281/zenodo.19025822 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866901517572767744 |
|---|---|
| author | Pirolo, Andres |
| author_facet | Pirolo, Andres |
| contents | <p> </p> <p>This work extends the adaptive per-block mode selector introduced in Part I with three main contributions.</p> <p>First, we introduce a fourth pipeline mode that encodes quaternary weights {0,1,2,3} using a dual-plane XNOR representation.<br>Second, we present a compiler optimization ablation demonstrating that standard GPU optimization techniques—such as <code>switch</code> statements and shared memory padding—do not improve performance on the Adreno 740 compiler.<br>Third, we provide the first empirical characterization of multi-layer parallel dispatch overlap on a commodity mobile GPU.</p> <p>The quaternary pipeline achieves throughput between ternary and binary modes and outperforms binary by up to <strong>3.1× at low batch sizes</strong>, revealing a previously undocumented regime where <strong>instruction count per thread dominates over arithmetic intensity</strong>.</p> <p>The compiler ablation shows that <code>if</code> cascades and <code>switch()</code> statements produce equivalent throughput on Adreno 740, indicating that the compiler already generates efficient jump tables. Additionally, shared memory padding techniques reported in prior GPU optimization literature degrade performance by up to <strong>2.7%</strong>, confirming that Adreno 740 does not exhibit the shared-memory bank conflict patterns characteristic of NVIDIA architectures.</p> <p>Finally, multi-layer parallel dispatch—submitting compute shaders for multiple layers without synchronization barriers between them—achieves <strong>34–44% execution overlap</strong> on the validation GPU. Overlap efficiency scales with layer size: at <strong>N = 16384</strong>, two-layer dispatch achieves <strong>44.3% overlap</strong>, yielding a <strong>1.58× forward-pass speedup</strong> for <strong>32-layer transformer inference</strong>. Asymmetric pipeline pairing consistently outperforms symmetric pairing, revealing a scheduling property with direct implications for neural network architecture design.</p> <p><em>See license in repository zip.</em></p> <p> </p> |
| format | Recurso digital |
| id | zenodo_https___doi_org_10_5281_zenodo_19025822 |
| institution | Zenodo |
| language | |
| publishDate | 2026 |
| publisher | Zenodo |
| record_format | zenodo |
| spellingShingle | 4-Pipeline Parallel Dispatch for Low-Bit GPU Neural Inference Pirolo, Andres BNN Cnn Edge ai Data center optimization <p> </p> <p>This work extends the adaptive per-block mode selector introduced in Part I with three main contributions.</p> <p>First, we introduce a fourth pipeline mode that encodes quaternary weights {0,1,2,3} using a dual-plane XNOR representation.<br>Second, we present a compiler optimization ablation demonstrating that standard GPU optimization techniques—such as <code>switch</code> statements and shared memory padding—do not improve performance on the Adreno 740 compiler.<br>Third, we provide the first empirical characterization of multi-layer parallel dispatch overlap on a commodity mobile GPU.</p> <p>The quaternary pipeline achieves throughput between ternary and binary modes and outperforms binary by up to <strong>3.1× at low batch sizes</strong>, revealing a previously undocumented regime where <strong>instruction count per thread dominates over arithmetic intensity</strong>.</p> <p>The compiler ablation shows that <code>if</code> cascades and <code>switch()</code> statements produce equivalent throughput on Adreno 740, indicating that the compiler already generates efficient jump tables. Additionally, shared memory padding techniques reported in prior GPU optimization literature degrade performance by up to <strong>2.7%</strong>, confirming that Adreno 740 does not exhibit the shared-memory bank conflict patterns characteristic of NVIDIA architectures.</p> <p>Finally, multi-layer parallel dispatch—submitting compute shaders for multiple layers without synchronization barriers between them—achieves <strong>34–44% execution overlap</strong> on the validation GPU. Overlap efficiency scales with layer size: at <strong>N = 16384</strong>, two-layer dispatch achieves <strong>44.3% overlap</strong>, yielding a <strong>1.58× forward-pass speedup</strong> for <strong>32-layer transformer inference</strong>. Asymmetric pipeline pairing consistently outperforms symmetric pairing, revealing a scheduling property with direct implications for neural network architecture design.</p> <p><em>See license in repository zip.</em></p> <p> </p> |
| title | 4-Pipeline Parallel Dispatch for Low-Bit GPU Neural Inference |
| topic | BNN Cnn Edge ai Data center optimization |
| url | https://doi.org/10.5281/zenodo.19025822 |