GGUF and the K-Quant Family

Also known as: GGML, Q4_K_M, llama.cpp quantization, K-quants

TL;DR

The file format and quantization scheme that powers llama.cpp — the de-facto local-inference stack for LLMs on commodity hardware. GGUF embeds tokenizer, chat template, and quantized weights in a single mmap-able artifact.

GGUF — successor to GGML — is the file format and quantization scheme used by llama.cpp, the de-facto local-inference stack for LLMs on commodity hardware. While and dominate server-side INT4 weight-only quantization on data-center GPUs, GGUF rules the other deployment quantum: CPU inference, Apple Silicon (Metal), consumer GPUs, and anything where streaming weights from disk via mmap matters more than tensor-core throughput.

The K-quant family — Q4_K_M, Q5_K_M, Q6_K, Q3_K_M, and friends — is what most users actually mean when they say “GGUF.” It’s a two-level hierarchical quantization scheme tuned for CPU SIMD decode, with per-layer mixed-precision allocation hand-tuned by the llama.cpp authors based on layer sensitivity.

What GGUF actually is

A GGUF file is a single binary artifact with three sections:

GGUF file layout:

   +---------------------------------------------------+
   |  Magic number: "GGUF" (4 bytes)                   |
   |  Version (4 bytes)                                |
   |  Tensor count (8 bytes)                           |
   |  Metadata KV count (8 bytes)                      |
   +---------------------------------------------------+
   |  Metadata (key-value pairs)                       |
   |   - tokenizer.ggml.tokens                         |
   |   - tokenizer.ggml.merges                         |
   |   - tokenizer.chat_template                       |
   |   - llama.context_length                          |
   |   - llama.embedding_length                        |
   |   - llama.rope.dimension_count                    |
   |   - general.quantization_version                  |
   |   - ... (all model hyperparameters live here)     |
   +---------------------------------------------------+
   |  Tensor info (name, shape, dtype, file offset)    |
   |   - "blk.0.attn_q.weight"  | shape  | Q4_K  | off |
   |   - "blk.0.attn_k.weight"  | shape  | Q4_K  | off |
   |   - "blk.0.attn_v.weight"  | shape  | Q4_K  | off |
   |   - "blk.0.attn_output.weight"  | shape | Q6_K | off  <-- mixed precision
   |   - ... (one entry per tensor)                    |
   +---------------------------------------------------+
   |  Tensor data (aligned, mmap-friendly)             |
   |   - Quantized weight blobs back-to-back           |
   |   - 32-byte aligned for SIMD                      |
   +---------------------------------------------------+

Three things matter about this layout:

  1. Tokenizer + chat template embedded. A .gguf file is portable — copy it to a Mac mini and llama-server knows how to tokenize and how to format chat messages. No external tokenizer.json, no chat_template.jinja. This is why GGUF distribution beat the alternatives.
  2. Mmap-friendly. Tensor data is aligned, contiguous, and readable in-place — llama.cpp mmaps the file and the OS handles paging weights into RAM as they’re touched. That makes “load a 70B model from disk” a sub-second operation that pages in only the layers actually used. On a Mac with 24 GB of RAM, you can run a 30 GB model — the OS just streams the weights it needs.
  3. Mixed precision per tensor. The quantization dtype is per-tensor in the file. A single GGUF file can store some tensors in Q4_K, others in Q6_K, others in F16 — exactly what the K-quant _S/_M/_L policies exploit.

The K-quant family

K-quants use a two-level hierarchical scaling scheme. Weights are grouped into blocks of 16 (for 2/3/6-bit) or 32 (for 4/5-bit); 16 of those blocks form a super-block of 256 weights. The super-block carries one FP16 scale; each block within it carries a tiny (4 or 6-bit) sub-scale relative to the super-block scale.

Q4_K super-block (256 weights):

  Super-block FP16 scale    [16 bits]
  Super-block FP16 min      [16 bits]
       (used as offset for asymmetric quantization)

   block 0   block 1   block 2  ...  block 7
  +--------+--------+--------+ ... +--------+
  | sub-sc | sub-sc | sub-sc |     | sub-sc |   <- 8 * 6 bits each
  | sub-min| sub-min| sub-min|     | sub-min|   <- 8 * 6 bits each
  | 32 x   | 32 x   | 32 x   | ... | 32 x   |
  | 4-bit  | 4-bit  | 4-bit  |     | 4-bit  |
  | codes  | codes  | codes  |     | codes  |
  +--------+--------+--------+ ... +--------+
   32 wts   32 wts   32 wts          32 wts

  Total per super-block:
     4 bits/elem * 256       =  128 bytes (data)
     2 * 6 bits * 8 sub-blocks = 12 bytes (sub-scales + sub-mins)
     2 * 16 bits              =   4 bytes (super scale + super min)
                              ----------
                              = 144 bytes / 256 weights = 4.5 bits/elem

  Reconstruction:
     w[i] = super_min + super_scale * (sub_min[b] + sub_scale[b] * code[i])

The two-level scheme is doing two jobs simultaneously: the FP16 super-scale handles the macro magnitude variation across the model, and the 6-bit sub-scales handle the local distribution shifts within a super-block. Per-block FP16 scales (which is what naive group-quantization uses) would carry 0.5 bits/elem of overhead; the K-quant sub-scale-of-a-super-scale design drops that to 0.5 bits/elem total across the super-block of 256 weights. That recovered budget is what lets Q4_K_M outperform Q4_0 (legacy uniform per-block) at the same bit count.

The bit-budget table

FormatBits/elemBlock sizeNotes
Q2_K2.62516Aggressive. Quality cliff for most models below 70B.
Q3_K_S3.437516Small-K-quant 3-bit. Lower-precision layer mix.
Q3_K_M3.87516Medium-K 3-bit. Some layers escalated to 4-bit.
Q4_04.532Legacy. Uniform per-block scale. Pre-K-quant baseline.
Q4_K_S4.532Small-K-quant 4-bit. Most layers in Q4_K.
Q4_K_M4.532The recommended default. Sensitive layers in Q5_K/Q6_K.
Q5_K_S5.532Small-K 5-bit.
Q5_K_M5.532Medium-K 5-bit. Quality leans on the FP16 escalations.
Q6_K6.562516Near-FP16 quality. Often used as the high-precision tensor in K_M mixes.
Q8_08.532Per-tensor INT8, no K-quant tricks. The “boring but safe” 8-bit.

The S/M/L suffix on K-quants indicates the mix policy — what fraction of layers are escalated to higher-precision quants:

Q4_K_S (small-K-quant 4-bit):
   most attn + ffn      Q4_K
   norms, embed         F16

Q4_K_M (medium - default):
   attn_q, attn_k       Q4_K
   attn_v, attn_output  Q6_K   <- escalated to high precision
   ffn_gate, ffn_up     Q4_K
   ffn_down             Q6_K   <- escalated to high precision
   norms, embed         F16

Q4_K_L (large):
   most attn + ffn      Q4_K
   attn_v, attn_output  Q6_K
   ffn_down             Q6_K
   ffn_gate, ffn_up     Q5_K   <- additional escalations
   norms, embed         F16

Why those layers specifically? attn_v and the output projection are sensitivity hotspots — tensor importance studies on LLaMA showed perturbing them costs disproportionately on downstream evals. ffn_down matters because it sits at the residual-stream sum and small errors there add up across depth. The K_M policy is hand-tuned, not learned, but it’s a remarkably good Pareto point: at 4.5 bits/elem total it consistently beats Q4_0 (also 4.5 bits/elem) by 1–3 perplexity on WikiText.

The K-quants are formally specified by ggml-quants.c in llama.cpp. The “K” doesn’t stand for a person or a thing — it’s the project-internal name for “the better quants we made in 2023 to replace the legacy Q4_0 / Q4_1 / Q5_0 / Q5_1 / Q8_0 family that just had per-block FP16 scales.”

The structural advances over legacy quants:

  • Two-level scaling. Legacy quants stored an FP16 scale per block of 32 weights — 0.5 bits/elem of overhead. K-quants store an FP16 super-scale per 256 weights plus a 6-bit sub-scale per 32 — total overhead drops to ~0.4 bits/elem and the local fitting is better, not worse.
  • Asymmetric quantization with per-block min. Legacy quants used symmetric (zero-centered) ranges; K-quants store both a per-block min and a per-block scale, giving better fit for asymmetric distributions like post-GELU activations or SiLU-gated FFN intermediates.
  • 6-bit sub-scales. Storing the sub-scales themselves at 6 bits (rather than 8 or 16) is a big chunk of the overhead saving. The sub-scales are themselves quantized and then reconstructed — a third level of quantization that pays for itself in headline bits/elem.
  • Mixed-precision per-layer policy (the _S/_M/_L suffix). Strictly outside the per-tensor encoding but central to what users mean by “K-quant” — different layers get different K-quant dtypes based on their measured sensitivity.

The practical upshot: at 4.5 bits/elem, Q4_K_M is roughly equivalent to GPTQ-INT4 with group size 128 in measured perplexity on LLaMA-class models, while being CPU-decodable in real time. That’s the value proposition.

What llama.cpp actually does at inference

The runtime side matters because that’s where GGUF beats GPU formats on the relevant hardware:

llama.cpp inference flow (CPU/Metal):

   1. mmap(model.gguf) -> kernel maps tensor data into virtual address space
   2. Parse header + metadata + tensor info -> small in-RAM index
   3. For each generation step:
        a. Tokenize input via embedded BPE/SentencePiece
        b. For each layer:
             - SIMD dequantize Q4_K block of 256 weights into FP32 scratch
             - Multiply with FP16/FP32 activation, accumulate in FP32
             - Apply norm, residual, attention, FFN
             - (On Metal: same pipeline but in Metal compute shaders)
        c. Sample next token, append to context
   4. KV cache stays in RAM in FP16 (or quantized via --kv-cache-dtype)

Key properties:

  • Streaming-from-disk. Page-faults pull only the layers being used into RAM. This is why a Mac mini with 16 GB of RAM can run a 70B-parameter Q4_K_M GGUF (35 GB on disk) — most of the model never touches RAM at any given moment.
  • CPU SIMD friendly. Q4_K decode is hand-tuned for AVX2, AVX-512, ARM NEON, and Apple’s AMX. The SIMD path dequantizes a super-block of 256 weights in tens of cycles.
  • Metal first-class. Apple Silicon is a primary target. Metal compute shaders for K-quant GEMM are part of the llama.cpp distribution; perf on M-series chips approaches the theoretical memory-bandwidth ceiling.
  • CUDA secondary. llama.cpp also runs on CUDA, but the K-quant kernels there are slower than the equivalent GPTQ/AWQ INT4 kernels on the same GPU. CUDA is a fallback path; CPU + Metal is the primary one.

The local-LLM software stack

GPTQ and AWQ are server-side production quants; GGUF is the consumer/researcher quantum. Both ecosystems matter; they evolved in parallel for different hardware targets. If your inference target is “data-center GPU with vLLM / TensorRT-LLM” you want AWQ. If your target is “anything else” — laptop, Mac, consumer GPU, CPU server — you want GGUF.

When to use GGUF vs the alternatives

The deployment-target decision
  • Mac (M-series) + local inference → GGUF. Metal kernels + mmap make this the only sane choice.
  • CPU-only inference (commodity server, edge box, no GPU) → GGUF. No real alternative exists at INT4 quality.
  • Consumer GPU (4090, 5090, etc.) + local inference → GGUF or ; GGUF if the workflow already lives in Ollama / LM Studio, AWQ if you want maximum throughput.
  • Data-center GPU (H100, B200) + production serving or . GGUF kernels exist on CUDA but are slower than purpose-built GPU formats.
  • Mobile / embedded → GGUF, or further-quantized formats like Q2_K / Q3_K. The mmap + streaming-from-flash story matters here too.

Where GGUF is going

The K-quant family is mature; the active development frontier is imatrix-based quantization (using importance-matrix weights computed on a calibration set, similar in spirit to AWQ’s saliency analysis but applied to the K-quant grid placement) and iQuants (IQ2_XXS, IQ3_XS, etc.) — sub-3-bit K-quants that beat the original Q2_K/Q3_K_S on perplexity at the same bit budget. The iQuants closed most of the quality gap between GGUF and the academic SOTA at sub-4-bit (AQLM, QuIP#) for CPU-deployable formats.

GGUF is also expanding outside LLMs: CLIP encoders, audio models (Whisper), and embedding models all increasingly ship in .gguf for the same reasons text models do — mmap-able single-file artifacts that run anywhere.

The format won the consumer LLM ecosystem the moment llama.cpp won the consumer LLM inference engine, and the parallel evolution with / on the server side is now stable: server folks ship AWQ, consumer folks ship Q4_K_M, and the smart open-weight model release packages both.

Go further

What does the _K mean in Q4_K_M?

K-quants use a two-level scaling scheme: a per-block 16-bit scale plus a per-super-block 32-bit scale, and a hand-tuned mixed-precision allocation that puts more sensitive layers (attention output projections, FFN down projections) in a higher-precision quant than less sensitive ones. The trailing letter — _S, _M, _L — denotes the mix policy: _S = small (more layers in lower precision), _M = medium (the field default), _L = large (more layers in higher precision). Q4_K_M is the recommended default because it sits at the recall-vs-storage knee for typical LLaMA-class models.

Why doesn't GGUF use GPTQ or AWQ instead of K-quants?

Different hardware, different kernels. GPTQ and AWQ kernels are GPU-tuned — they assume tensor cores, fast HBM, and INT4 GEMM hardware. K-quants are CPU and Metal-tuned: they prioritize quantize/dequantize speed in CPU SIMD lanes and Apple's GPU shading units, smaller per-block metadata to fit cache, and a layout that mmap'd disk-streamed weights can decode as they arrive. The K-quant family was developed by the llama.cpp authors specifically for the CPU + consumer-GPU regime that GPTQ/AWQ ignored. The two ecosystems evolved in parallel for different hardware targets, and most open-weight models on Hugging Face ship in bothmodel.safetensors (FP16) + model-AWQ.safetensors + model.Q4_K_M.gguf are now standard companion artifacts.

Is Ollama just a wrapper around llama.cpp?

Effectively yes. Ollama is a Go-based service wrapper that bundles a pinned llama.cpp build, adds a HTTP API and a model-management layer (pulling GGUFs from a registry), and ships a friendlier CLI. Under the hood every Ollama generation call goes through llama.cpp's inference engine and reads GGUF files. LM Studio, KoboldCPP, GPT4All, and Jan are all variations on the same idea: a UX layer over llama.cpp + GGUF. This is why GGUF won the consumer-LLM ecosystem — once llama.cpp was the inference engine, the format llama.cpp consumed was destiny.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord