:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, James, Xiao, Guangxuan, Li, Kai, Lee, Jason D., Han, Song, Dao, Tri, Cai, Tianle
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2402.10193
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
by: Cai, Tianle, et al.
Published: (2024)

OneBit: Towards Extremely Low-bit Large Language Models
by: Xu, Yuzhuang, et al.
Published: (2024)

An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits
by: Steinmetz, Cody, et al.
Published: (2025)

LittleBit: Ultra Low-Bit Quantization via Latent Factorization
by: Lee, Banseok, et al.
Published: (2025)

BitNet Distillation
by: Wu, Xun, et al.
Published: (2025)

LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits
by: Zhou, Zikai, et al.
Published: (2025)

Optimizing Mixture of Block Attention
by: Xiao, Guangxuan, et al.
Published: (2025)

BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation
by: Du, Dayou, et al.
Published: (2024)

Reward Collapse in Aligning Large Language Models
by: Song, Ziang, et al.
Published: (2023)

Bit Blasting Probabilistic Programs
by: Garg, Poorva, et al.
Published: (2023)

Multi-Bit Distortion-Free Watermarking for Large Language Models
by: Boroujeny, Massieh Kordi, et al.
Published: (2024)

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models
by: Hu, Xing, et al.
Published: (2024)

Bit-Vector CHC Solving for Binary Analysis and Binary Analysis for Bit-Vector CHC Solving
by: Bembenek, Aaron, et al.
Published: (2026)

Equational Bit-Vector Solving via Strong Gröbner Bases
by: Song, Jiaxin, et al.
Published: (2024)

Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models
by: Liu, Wanlong, et al.
Published: (2025)

Efficient Streaming Language Models with Attention Sinks
by: Xiao, Guangxuan, et al.
Published: (2023)

QuIP: 2-Bit Quantization of Large Language Models With Guarantees
by: Chee, Jerry, et al.
Published: (2023)

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
by: Du, Dayou, et al.
Published: (2025)

Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning
by: Xu, Guangxuan, et al.
Published: (2024)

BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
by: Aman, Euhid, et al.
Published: (2025)

Bit-level BPE: Below the byte boundary
by: Moon, Sangwhan, et al.
Published: (2025)

XAttention: Block Sparse Attention with Antidiagonal Scoring
by: Xu, Ruyi, et al.
Published: (2025)

Unlocking the Theory Behind Scaling 1-Bit Neural Networks
by: Daliri, Majid, et al.
Published: (2024)

Marking: Visual Grading with Highlighting Errors and Annotating Missing Bits
by: Sonkar, Shashank, et al.
Published: (2024)

To be Continuous, or to be Discrete, Those are Bits of Questions
by: Wang, Yiran, et al.
Published: (2024)

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
by: Ben-Zaken, Elad, et al.
Published: (2021)

Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning
by: Fang, Yangui, et al.
Published: (2025)

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
by: Tang, Jiaming, et al.
Published: (2024)

Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models
by: LÊ, Minh Tri, et al.
Published: (2025)

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
by: Zhuang, Shaobin, et al.
Published: (2026)

Hardware-Efficient Attention for Fast Decoding
by: Zadouri, Ted, et al.
Published: (2025)

Majority Bit-Aware Watermarking For Large Language Models
by: Xu, Jiahao, et al.
Published: (2025)

FrameQuant: Flexible Low-Bit Quantization for Transformers
by: Adepu, Harshavardhan, et al.
Published: (2024)

BitNet b1.58 2B4T Technical Report
by: Ma, Shuming, et al.
Published: (2025)

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
by: Xiao, Guangxuan, et al.
Published: (2022)

H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference
by: Vejendla, Harshil
Published: (2025)

A New Pipeline For Generating Instruction Dataset via RAG and Self Fine-Tuning
by: Song, Chih-Wei, et al.
Published: (2024)

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
by: Zandieh, Amir, et al.
Published: (2024)

REST: Retrieval-Based Speculative Decoding
by: He, Zhenyu, et al.
Published: (2023)

StreamingVLM: Real-Time Understanding for Infinite Video Streams
by: Xu, Ruyi, et al.
Published: (2025)