Also known as: GGML, Q4_K_M, llama.cpp quantization, K-quants
TL;DR
The file format and quantization scheme that powers llama.cpp — the de-facto local-inference stack for LLMs on commodity hardware. GGUF embeds tokenizer, chat template, and quantized weights in a single mmap-able artifact.
GGUF — successor to GGML — is the file format and quantization scheme used by llama.cpp, the de-facto local-inference stack for LLMs on commodity hardware. While GPTQ and AWQ dominate server-side INT4 weight-only quantization on data-center GPUs, GGUF rules the other deployment quantum: CPU inference, Apple Silicon (Metal), consumer GPUs, and anything where streaming weights from disk via mmap matters more than tensor-core throughput.
The K-quant family — Q4_K_M, Q5_K_M, Q6_K, Q3_K_M, and friends — is what most users actually mean when they say “GGUF.” It’s a two-level hierarchical quantization scheme tuned for CPU SIMD decode, with per-layer mixed-precision allocation hand-tuned by the llama.cpp authors based on layer sensitivity.
What GGUF actually is
A GGUF file is a single binary artifact with three sections:
Tokenizer + chat template embedded. A .gguf file is portable — copy it to a Mac mini and llama-server knows how to tokenize and how to format chat messages. No external tokenizer.json, no chat_template.jinja. This is why GGUF distribution beat the alternatives.
Mmap-friendly. Tensor data is aligned, contiguous, and readable in-place — llama.cpp mmaps the file and the OS handles paging weights into RAM as they’re touched. That makes “load a 70B model from disk” a sub-second operation that pages in only the layers actually used. On a Mac with 24 GB of RAM, you can run a 30 GB model — the OS just streams the weights it needs.
Mixed precision per tensor. The quantization dtype is per-tensor in the file. A single GGUF file can store some tensors in Q4_K, others in Q6_K, others in F16 — exactly what the K-quant _S/_M/_L policies exploit.
The K-quant family
K-quants use a two-level hierarchical scaling scheme. Weights are grouped into blocks of 16 (for 2/3/6-bit) or 32 (for 4/5-bit); 16 of those blocks form a super-block of 256 weights. The super-block carries one FP16 scale; each block within it carries a tiny (4 or 6-bit) sub-scale relative to the super-block scale.
The two-level scheme is doing two jobs simultaneously: the FP16 super-scale handles the macro magnitude variation across the model, and the 6-bit sub-scales handle the local distribution shifts within a super-block. Per-block FP16 scales (which is what naive group-quantization uses) would carry 0.5 bits/elem of overhead; the K-quant sub-scale-of-a-super-scale design drops that to 0.5 bits/elem total across the super-block of 256 weights. That recovered budget is what lets Q4_K_M outperform Q4_0 (legacy uniform per-block) at the same bit count.
The bit-budget table
Format
Bits/elem
Block size
Notes
Q2_K
2.625
16
Aggressive. Quality cliff for most models below 70B.
The recommended default. Sensitive layers in Q5_K/Q6_K.
Q5_K_S
5.5
32
Small-K 5-bit.
Q5_K_M
5.5
32
Medium-K 5-bit. Quality leans on the FP16 escalations.
Q6_K
6.5625
16
Near-FP16 quality. Often used as the high-precision tensor in K_M mixes.
Q8_0
8.5
32
Per-tensor INT8, no K-quant tricks. The “boring but safe” 8-bit.
The S/M/L suffix on K-quants indicates the mix policy — what fraction of layers are escalated to higher-precision quants:
Q4_K_S (small-K-quant 4-bit): most attn + ffn Q4_K norms, embed F16Q4_K_M (medium - default): attn_q, attn_k Q4_K attn_v, attn_output Q6_K <- escalated to high precision ffn_gate, ffn_up Q4_K ffn_down Q6_K <- escalated to high precision norms, embed F16Q4_K_L (large): most attn + ffn Q4_K attn_v, attn_output Q6_K ffn_down Q6_K ffn_gate, ffn_up Q5_K <- additional escalations norms, embed F16
Why those layers specifically? attn_v and the output projection are sensitivity hotspots — tensor importance studies on LLaMA showed perturbing them costs disproportionately on downstream evals. ffn_down matters because it sits at the residual-stream sum and small errors there add up across depth. The K_M policy is hand-tuned, not learned, but it’s a remarkably good Pareto point: at 4.5 bits/elem total it consistently beats Q4_0 (also 4.5 bits/elem) by 1–3 perplexity on WikiText.
The K-quants are formally specified by ggml-quants.c in llama.cpp. The “K” doesn’t stand for a person or a thing — it’s the project-internal name for “the better quants we made in 2023 to replace the legacy Q4_0 / Q4_1 / Q5_0 / Q5_1 / Q8_0 family that just had per-block FP16 scales.”
The structural advances over legacy quants:
Two-level scaling. Legacy quants stored an FP16 scale per block of 32 weights — 0.5 bits/elem of overhead. K-quants store an FP16 super-scale per 256 weights plus a 6-bit sub-scale per 32 — total overhead drops to ~0.4 bits/elem and the local fitting is better, not worse.
Asymmetric quantization with per-block min. Legacy quants used symmetric (zero-centered) ranges; K-quants store both a per-block min and a per-block scale, giving better fit for asymmetric distributions like post-GELU activations or SiLU-gated FFN intermediates.
6-bit sub-scales. Storing the sub-scales themselves at 6 bits (rather than 8 or 16) is a big chunk of the overhead saving. The sub-scales are themselves quantized and then reconstructed — a third level of quantization that pays for itself in headline bits/elem.
Mixed-precision per-layer policy (the _S/_M/_L suffix). Strictly outside the per-tensor encoding but central to what users mean by “K-quant” — different layers get different K-quant dtypes based on their measured sensitivity.
The practical upshot: at 4.5 bits/elem, Q4_K_M is roughly equivalent to GPTQ-INT4 with group size 128 in measured perplexity on LLaMA-class models, while being CPU-decodable in real time. That’s the value proposition.
What llama.cpp actually does at inference
The runtime side matters because that’s where GGUF beats GPU formats on the relevant hardware:
llama.cpp inference flow (CPU/Metal): 1. mmap(model.gguf) -> kernel maps tensor data into virtual address space 2. Parse header + metadata + tensor info -> small in-RAM index 3. For each generation step: a. Tokenize input via embedded BPE/SentencePiece b. For each layer: - SIMD dequantize Q4_K block of 256 weights into FP32 scratch - Multiply with FP16/FP32 activation, accumulate in FP32 - Apply norm, residual, attention, FFN - (On Metal: same pipeline but in Metal compute shaders) c. Sample next token, append to context 4. KV cache stays in RAM in FP16 (or quantized via --kv-cache-dtype)
Key properties:
Streaming-from-disk. Page-faults pull only the layers being used into RAM. This is why a Mac mini with 16 GB of RAM can run a 70B-parameter Q4_K_M GGUF (35 GB on disk) — most of the model never touches RAM at any given moment.
CPU SIMD friendly. Q4_K decode is hand-tuned for AVX2, AVX-512, ARM NEON, and Apple’s AMX. The SIMD path dequantizes a super-block of 256 weights in tens of cycles.
Metal first-class. Apple Silicon is a primary target. Metal compute shaders for K-quant GEMM are part of the llama.cpp distribution; perf on M-series chips approaches the theoretical memory-bandwidth ceiling.
CUDA secondary. llama.cpp also runs on CUDA, but the K-quant kernels there are slower than the equivalent GPTQ/AWQ INT4 kernels on the same GPU. CUDA is a fallback path; CPU + Metal is the primary one.
The local-LLM software stack
GPTQ and AWQ are server-side production quants; GGUF is the consumer/researcher quantum. Both ecosystems matter; they evolved in parallel for different hardware targets. If your inference target is “data-center GPU with vLLM / TensorRT-LLM” you want AWQ. If your target is “anything else” — laptop, Mac, consumer GPU, CPU server — you want GGUF.
When to use GGUF vs the alternatives
The deployment-target decision
Mac (M-series) + local inference → GGUF. Metal kernels + mmap make this the only sane choice.
CPU-only inference (commodity server, edge box, no GPU) → GGUF. No real alternative exists at INT4 quality.
Consumer GPU (4090, 5090, etc.) + local inference → GGUF or AWQ ; GGUF if the workflow already lives in Ollama / LM Studio, AWQ if you want maximum throughput.
Data-center GPU (H100, B200) + production serving → AWQ or NVFP4 . GGUF kernels exist on CUDA but are slower than purpose-built GPU formats.
Mobile / embedded → GGUF, or further-quantized formats like Q2_K / Q3_K. The mmap + streaming-from-flash story matters here too.
Where GGUF is going
The K-quant family is mature; the active development frontier is imatrix-based quantization (using importance-matrix weights computed on a calibration set, similar in spirit to AWQ’s saliency analysis but applied to the K-quant grid placement) and iQuants (IQ2_XXS, IQ3_XS, etc.) — sub-3-bit K-quants that beat the original Q2_K/Q3_K_S on perplexity at the same bit budget. The iQuants closed most of the quality gap between GGUF and the academic SOTA at sub-4-bit (AQLM, QuIP#) for CPU-deployable formats.
GGUF is also expanding outside LLMs: CLIP encoders, audio models (Whisper), and embedding models all increasingly ship in .gguf for the same reasons text models do — mmap-able single-file artifacts that run anywhere.
The format won the consumer LLM ecosystem the moment llama.cpp won the consumer LLM inference engine, and the parallel evolution with GPTQ / AWQ on the server side is now stable: server folks ship AWQ, consumer folks ship Q4_K_M, and the smart open-weight model release packages both.
Go further
What does the _K mean in Q4_K_M?
K-quants use a two-level scaling scheme: a per-block 16-bit scale plus a per-super-block 32-bit scale, and a hand-tuned mixed-precision allocation that puts more sensitive layers (attention output projections, FFN down projections) in a higher-precision quant than less sensitive ones. The trailing letter — _S, _M, _L — denotes the mix policy: _S = small (more layers in lower precision), _M = medium (the field default), _L = large (more layers in higher precision). Q4_K_M is the recommended default because it sits at the recall-vs-storage knee for typical LLaMA-class models.
Why doesn't GGUF use GPTQ or AWQ instead of K-quants?
Different hardware, different kernels. GPTQ and AWQ kernels are GPU-tuned — they assume tensor cores, fast HBM, and INT4 GEMM hardware. K-quants are CPU and Metal-tuned: they prioritize quantize/dequantize speed in CPU SIMD lanes and Apple's GPU shading units, smaller per-block metadata to fit cache, and a layout that mmap'd disk-streamed weights can decode as they arrive. The K-quant family was developed by the llama.cpp authors specifically for the CPU + consumer-GPU regime that GPTQ/AWQ ignored. The two ecosystems evolved in parallel for different hardware targets, and most open-weight models on Hugging Face ship in both — model.safetensors (FP16) + model-AWQ.safetensors + model.Q4_K_M.gguf are now standard companion artifacts.
Effectively yes. Ollama is a Go-based service wrapper that bundles a pinned llama.cpp build, adds a HTTP API and a model-management layer (pulling GGUFs from a registry), and ships a friendlier CLI. Under the hood every Ollama generation call goes through llama.cpp's inference engine and reads GGUF files. LM Studio, KoboldCPP, GPT4All, and Jan are all variations on the same idea: a UX layer over llama.cpp + GGUF. This is why GGUF won the consumer-LLM ecosystem — once llama.cpp was the inference engine, the format llama.cpp consumed was destiny.