Saved in:
| Main Authors: | Li, Qingyuan, Zhang, Bo, Ye, Liang, Zhang, Yifan, Wu, Wei, Sun, Yerui, Ma, Lin, Xie, Yuchen |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.04964 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs
by: Li, Qingyuan, et al.
Published: (2024)
by: Li, Qingyuan, et al.
Published: (2024)
FlashCommunication V2: Bit Splitting and Spike Reserving for Any Bit Communication
by: Li, Qingyuan, et al.
Published: (2025)
by: Li, Qingyuan, et al.
Published: (2025)
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
by: Li, Jiacheng, et al.
Published: (2026)
by: Li, Jiacheng, et al.
Published: (2026)
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
by: Hu, Yuxuan, et al.
Published: (2026)
by: Hu, Yuxuan, et al.
Published: (2026)
SVIP: Towards Verifiable Inference of Open-source Large Language Models
by: Sun, Yifan, et al.
Published: (2024)
by: Sun, Yifan, et al.
Published: (2024)
TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks
by: Shi, Ziji, et al.
Published: (2023)
by: Shi, Ziji, et al.
Published: (2023)
Solving Schrödinger Equation Using Tensor Neural Network
by: Liao, Yangfei, et al.
Published: (2022)
by: Liao, Yangfei, et al.
Published: (2022)
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
by: Sun, Pingwei, et al.
Published: (2026)
by: Sun, Pingwei, et al.
Published: (2026)
Communication Compression for Tensor Parallel LLM Inference
by: Hansen-Palmus, Jan, et al.
Published: (2024)
by: Hansen-Palmus, Jan, et al.
Published: (2024)
Communication-Efficient and Tensorized Federated Fine-Tuning of Large Language Models
by: Ghiasvand, Sajjad, et al.
Published: (2024)
by: Ghiasvand, Sajjad, et al.
Published: (2024)
Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies
by: Hu, Yuxuan, et al.
Published: (2025)
by: Hu, Yuxuan, et al.
Published: (2025)
Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution
by: Qin, Tianrui, et al.
Published: (2025)
by: Qin, Tianrui, et al.
Published: (2025)
Unveiling Super Experts in Mixture-of-Experts Large Language Models
by: Su, Zunhai, et al.
Published: (2025)
by: Su, Zunhai, et al.
Published: (2025)
BBox-Adapter: Lightweight Adapting for Black-Box Large Language Models
by: Sun, Haotian, et al.
Published: (2024)
by: Sun, Haotian, et al.
Published: (2024)
Concept Bottleneck Large Language Models
by: Sun, Chung-En, et al.
Published: (2024)
by: Sun, Chung-En, et al.
Published: (2024)
Fast Distributed Inference Serving for Large Language Models
by: Wu, Bingyang, et al.
Published: (2023)
by: Wu, Bingyang, et al.
Published: (2023)
Online Tensor Inference
by: Wen, Xin, et al.
Published: (2023)
by: Wen, Xin, et al.
Published: (2023)
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models
by: Suo, Wei, et al.
Published: (2024)
by: Suo, Wei, et al.
Published: (2024)
FlashDecoding++: Faster Large Language Model Inference on GPUs
by: Hong, Ke, et al.
Published: (2023)
by: Hong, Ke, et al.
Published: (2023)
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
by: Zhang, Jun, et al.
Published: (2026)
by: Zhang, Jun, et al.
Published: (2026)
FlashSampling: Fast and Memory-Efficient Exact Sampling
by: Ruiz, Tomas, et al.
Published: (2026)
by: Ruiz, Tomas, et al.
Published: (2026)
FlashBias: Fast Computation of Attention with Bias
by: Wu, Haixu, et al.
Published: (2025)
by: Wu, Haixu, et al.
Published: (2025)
Towards Low-bit Communication for Tensor Parallel LLM Inference
by: Dong, Harry, et al.
Published: (2024)
by: Dong, Harry, et al.
Published: (2024)
SonicBench: Dissecting the Physical Perception Bottleneck in Large Audio Language Models
by: Sun, Yirong, et al.
Published: (2026)
by: Sun, Yirong, et al.
Published: (2026)
Fast Inference for Augmented Large Language Models
by: Shahout, Rana, et al.
Published: (2024)
by: Shahout, Rana, et al.
Published: (2024)
LearnedFTL: A Learning-Based Page-Level FTL for Reducing Double Reads in Flash-Based SSDs
by: Wang, Shengzhe, et al.
Published: (2023)
by: Wang, Shengzhe, et al.
Published: (2023)
Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
by: Wang, Chong, et al.
Published: (2026)
by: Wang, Chong, et al.
Published: (2026)
AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing
by: Li, Jiacheng, et al.
Published: (2025)
by: Li, Jiacheng, et al.
Published: (2025)
FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores
by: Shi, Jinliang, et al.
Published: (2024)
by: Shi, Jinliang, et al.
Published: (2024)
Token Communication in the Era of Large Models: An Information Bottleneck-Based Approach
by: Wei, Hao, et al.
Published: (2025)
by: Wei, Hao, et al.
Published: (2025)
FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast
by: Wu, Wenhao, et al.
Published: (2026)
by: Wu, Wenhao, et al.
Published: (2026)
DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models
by: Huang, You-Liang, et al.
Published: (2026)
by: Huang, You-Liang, et al.
Published: (2026)
FlashMP: Fast Discrete Transform-Based Solver for Preconditioning Maxwell's Equations on GPUs
by: Zhang, Haoyuan, et al.
Published: (2025)
by: Zhang, Haoyuan, et al.
Published: (2025)
Scaling Embeddings Outperforms Scaling Experts in Language Models
by: Liu, Hong, et al.
Published: (2026)
by: Liu, Hong, et al.
Published: (2026)
FlashMoE: Reducing SSD I/O Bottlenecks via ML-Based Cache Replacement for Mixture-of-Experts Inference on Edge Devices
by: Kim, Byeongju, et al.
Published: (2026)
by: Kim, Byeongju, et al.
Published: (2026)
λScale: Enabling Fast Scaling for Serverless Large Language Model Inference
by: Yu, Minchen, et al.
Published: (2025)
by: Yu, Minchen, et al.
Published: (2025)
Do Advanced Language Models Eliminate the Need for Prompt Engineering in Software Engineering?
by: Wang, Guoqing, et al.
Published: (2024)
by: Wang, Guoqing, et al.
Published: (2024)
Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time
by: Lan, Yifan, et al.
Published: (2025)
by: Lan, Yifan, et al.
Published: (2025)
FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs
by: Zhang, Haijun, et al.
Published: (2025)
by: Zhang, Haijun, et al.
Published: (2025)
TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference
by: Tang, Xiaojuan, et al.
Published: (2025)
by: Tang, Xiaojuan, et al.
Published: (2025)
Similar Items
-
Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs
by: Li, Qingyuan, et al.
Published: (2024) -
FlashCommunication V2: Bit Splitting and Spike Reserving for Any Bit Communication
by: Li, Qingyuan, et al.
Published: (2025) -
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
by: Li, Jiacheng, et al.
Published: (2026) -
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
by: Hu, Yuxuan, et al.
Published: (2026) -
SVIP: Towards Verifiable Inference of Open-source Large Language Models
by: Sun, Yifan, et al.
Published: (2024)