:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhou, Yang, Chen, Zhuoming, Xu, Zhaozhuo, Lin, Victoria, Chen, Beidi
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2409.03856
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?
by: Zhou, Yang, et al.
Published: (2025)

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
by: Sun, Hanshi, et al.
Published: (2024)

Kinetics: Rethinking Test-Time Scaling Laws
by: Sadhukhan, Ranajoy, et al.
Published: (2025)

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
by: Svirschevski, Ruslan, et al.
Published: (2024)

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
by: Chen, Zhuoming, et al.
Published: (2024)

MagicPIG: LSH Sampling for Efficient LLM Generation
by: Chen, Zhuoming, et al.
Published: (2024)

Learn To be Efficient: Build Structured Sparsity in Large Language Models
by: Zheng, Haizhong, et al.
Published: (2024)

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
by: Liu, Zirui, et al.
Published: (2024)

Do LLMs Know to Respect Copyright Notice?
by: Xu, Jialiang, et al.
Published: (2024)

Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models
by: Wu, Yuheng, et al.
Published: (2025)

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
by: Sadhukhan, Ranajoy, et al.
Published: (2024)

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation
by: Dong, Harry, et al.
Published: (2024)

WWW.Serve: Interconnecting Global LLM Services through Decentralization
by: Wang, Huanyu, et al.
Published: (2026)

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
by: Li, Minghan, et al.
Published: (2024)

Efficient Streaming Language Models with Attention Sinks
by: Xiao, Guangxuan, et al.
Published: (2023)

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel
by: Yan, Ran, et al.
Published: (2025)

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation
by: Liu, Jingyu, et al.
Published: (2025)

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
by: Zhong, Yiwu, et al.
Published: (2024)

Detecting Contextual Hallucinations in LLMs with Frequency-Aware Attention
by: Qi, Siya, et al.
Published: (2026)

The Diminishing Returns of Early-Exit Decoding in Modern LLMs
by: Wei, Rui, et al.
Published: (2026)

Token-wise Influential Training Data Retrieval for Large Language Models
by: Lin, Huawei, et al.
Published: (2024)

CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models
by: Lee, Donghyun, et al.
Published: (2024)

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity
by: Tang, Yehui, et al.
Published: (2025)

Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity
by: Xu, Chi, et al.
Published: (2025)

LoCoCo: Dropping In Convolutions for Long Context Compression
by: Cai, Ruisi, et al.
Published: (2024)

Self-Correction Makes LLMs Better Parsers
by: Zhang, Ziyan, et al.
Published: (2025)

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
by: Ma, Xuezhe, et al.
Published: (2024)

SFR-RAG: Towards Contextually Faithful LLMs
by: Nguyen, Xuan-Phi, et al.
Published: (2024)

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity
by: Guo, Wentao, et al.
Published: (2024)

Supervised Optimism Correction: Be Confident When LLMs Are Sure
by: Zhang, Junjie, et al.
Published: (2025)

DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic
by: Wu, Yuheng, et al.
Published: (2025)

NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
by: Zhang, Tianyi, et al.
Published: (2024)

Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity
by: Wang, Tuowei, et al.
Published: (2025)

Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping
by: Chen, Yijie, et al.
Published: (2025)

When Does Sparsity Mitigate the Curse of Depth in LLMs
by: Muhtar, Dilxat, et al.
Published: (2026)

Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs
by: Le, Chenqian, et al.
Published: (2025)

Scalable LLM Reasoning Acceleration with Low-rank Distillation
by: Dong, Harry, et al.
Published: (2025)

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning
by: Chen, Zhuoming, et al.
Published: (2026)

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
by: Yang, Zhe, et al.
Published: (2024)

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
by: Lu, Xudong, et al.
Published: (2024)