KVzap
Collection
6 items
•
Updated
•
3
KVzap is a KV cache pruning method aiming to accelerate LLM inference in both prefilling and decoding. It applies a lightweight model to the hidden states to predict importance scores for every KV pair and prunes the ones with a score below a given threshold.
KVzap is trained as a fast approximation of KVzip+, using 1.2M samples from Nemotron-Pretraining-Dataset-sample. Training code is available in the kvpress repository (source).