Saved in:
| Main Authors: | Zhou, Yang, Chen, Zhuoming, Xu, Zhaozhuo, Lin, Victoria, Chen, Beidi |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.03856 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?
by: Zhou, Yang, et al.
Published: (2025)
by: Zhou, Yang, et al.
Published: (2025)
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
by: Sun, Hanshi, et al.
Published: (2024)
by: Sun, Hanshi, et al.
Published: (2024)
Kinetics: Rethinking Test-Time Scaling Laws
by: Sadhukhan, Ranajoy, et al.
Published: (2025)
by: Sadhukhan, Ranajoy, et al.
Published: (2025)
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
by: Svirschevski, Ruslan, et al.
Published: (2024)
by: Svirschevski, Ruslan, et al.
Published: (2024)
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
by: Chen, Zhuoming, et al.
Published: (2024)
by: Chen, Zhuoming, et al.
Published: (2024)
MagicPIG: LSH Sampling for Efficient LLM Generation
by: Chen, Zhuoming, et al.
Published: (2024)
by: Chen, Zhuoming, et al.
Published: (2024)
Learn To be Efficient: Build Structured Sparsity in Large Language Models
by: Zheng, Haizhong, et al.
Published: (2024)
by: Zheng, Haizhong, et al.
Published: (2024)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
by: Liu, Zirui, et al.
Published: (2024)
by: Liu, Zirui, et al.
Published: (2024)
Do LLMs Know to Respect Copyright Notice?
by: Xu, Jialiang, et al.
Published: (2024)
by: Xu, Jialiang, et al.
Published: (2024)
Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models
by: Wu, Yuheng, et al.
Published: (2025)
by: Wu, Yuheng, et al.
Published: (2025)
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
by: Sadhukhan, Ranajoy, et al.
Published: (2024)
by: Sadhukhan, Ranajoy, et al.
Published: (2024)
Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation
by: Dong, Harry, et al.
Published: (2024)
by: Dong, Harry, et al.
Published: (2024)
WWW.Serve: Interconnecting Global LLM Services through Decentralization
by: Wang, Huanyu, et al.
Published: (2026)
by: Wang, Huanyu, et al.
Published: (2026)
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
by: Li, Minghan, et al.
Published: (2024)
by: Li, Minghan, et al.
Published: (2024)
Efficient Streaming Language Models with Attention Sinks
by: Xiao, Guangxuan, et al.
Published: (2023)
by: Xiao, Guangxuan, et al.
Published: (2023)
FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel
by: Yan, Ran, et al.
Published: (2025)
by: Yan, Ran, et al.
Published: (2025)
Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation
by: Liu, Jingyu, et al.
Published: (2025)
by: Liu, Jingyu, et al.
Published: (2025)
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
by: Zhong, Yiwu, et al.
Published: (2024)
by: Zhong, Yiwu, et al.
Published: (2024)
Detecting Contextual Hallucinations in LLMs with Frequency-Aware Attention
by: Qi, Siya, et al.
Published: (2026)
by: Qi, Siya, et al.
Published: (2026)
The Diminishing Returns of Early-Exit Decoding in Modern LLMs
by: Wei, Rui, et al.
Published: (2026)
by: Wei, Rui, et al.
Published: (2026)
Token-wise Influential Training Data Retrieval for Large Language Models
by: Lin, Huawei, et al.
Published: (2024)
by: Lin, Huawei, et al.
Published: (2024)
CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models
by: Lee, Donghyun, et al.
Published: (2024)
by: Lee, Donghyun, et al.
Published: (2024)
Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity
by: Tang, Yehui, et al.
Published: (2025)
by: Tang, Yehui, et al.
Published: (2025)
Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity
by: Xu, Chi, et al.
Published: (2025)
by: Xu, Chi, et al.
Published: (2025)
LoCoCo: Dropping In Convolutions for Long Context Compression
by: Cai, Ruisi, et al.
Published: (2024)
by: Cai, Ruisi, et al.
Published: (2024)
Self-Correction Makes LLMs Better Parsers
by: Zhang, Ziyan, et al.
Published: (2025)
by: Zhang, Ziyan, et al.
Published: (2025)
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
by: Ma, Xuezhe, et al.
Published: (2024)
by: Ma, Xuezhe, et al.
Published: (2024)
SFR-RAG: Towards Contextually Faithful LLMs
by: Nguyen, Xuan-Phi, et al.
Published: (2024)
by: Nguyen, Xuan-Phi, et al.
Published: (2024)
Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity
by: Guo, Wentao, et al.
Published: (2024)
by: Guo, Wentao, et al.
Published: (2024)
Supervised Optimism Correction: Be Confident When LLMs Are Sure
by: Zhang, Junjie, et al.
Published: (2025)
by: Zhang, Junjie, et al.
Published: (2025)
DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic
by: Wu, Yuheng, et al.
Published: (2025)
by: Wu, Yuheng, et al.
Published: (2025)
NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
by: Zhang, Tianyi, et al.
Published: (2024)
by: Zhang, Tianyi, et al.
Published: (2024)
Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity
by: Wang, Tuowei, et al.
Published: (2025)
by: Wang, Tuowei, et al.
Published: (2025)
Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping
by: Chen, Yijie, et al.
Published: (2025)
by: Chen, Yijie, et al.
Published: (2025)
When Does Sparsity Mitigate the Curse of Depth in LLMs
by: Muhtar, Dilxat, et al.
Published: (2026)
by: Muhtar, Dilxat, et al.
Published: (2026)
Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs
by: Le, Chenqian, et al.
Published: (2025)
by: Le, Chenqian, et al.
Published: (2025)
Scalable LLM Reasoning Acceleration with Low-rank Distillation
by: Dong, Harry, et al.
Published: (2025)
by: Dong, Harry, et al.
Published: (2025)
Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning
by: Chen, Zhuoming, et al.
Published: (2026)
by: Chen, Zhuoming, et al.
Published: (2026)
Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
by: Yang, Zhe, et al.
Published: (2024)
by: Yang, Zhe, et al.
Published: (2024)
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
by: Lu, Xudong, et al.
Published: (2024)
by: Lu, Xudong, et al.
Published: (2024)
Similar Items
-
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?
by: Zhou, Yang, et al.
Published: (2025) -
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
by: Sun, Hanshi, et al.
Published: (2024) -
Kinetics: Rethinking Test-Time Scaling Laws
by: Sadhukhan, Ranajoy, et al.
Published: (2025) -
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
by: Svirschevski, Ruslan, et al.
Published: (2024) -
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
by: Chen, Zhuoming, et al.
Published: (2024)