Production
From notebook to live traffic.
The patterns and pitfalls when retrieval moves out of the demo and into actual production. The concepts below cover the operational discipline that separates RAG that works from RAG that breaks: latency tail behavior under bursty load, context compression before the LLM call, monitoring drift in calibrated relevance scores, and the load-balancing tradeoffs of running specialized models behind tight SLOs. These topics are less glamorous than the model-architecture material but tend to be where production ZE deployments actually win or lose.
- Caching Strategies
Three layers of caching for LLM-driven systems: exact-match (request → response), prompt-prefix (KV cache reuse for shared prefixes), and semantic (similar-query reuse via embeddings). Each helps different production workloads in different ways.
- Context Compression
Context compression shrinks a retrieval result set or agent trace down to just the spans the LLM actually needs, before sending it to the model. Crucial for long-running agentic systems where context blows past the model's effective attention window.
- Continuous Batching
The vLLM-style scheduling trick where requests join and leave a batch in-flight, dynamically. Massively improves GPU utilization for variable-length generation compared to naive static batching, and is the default in every modern LLM serving stack.
- Cost per Token
The economics primitive of LLM-driven systems. Per-token pricing — input and output, with output usually 3-5× input — is what makes a feature financially viable or not. Production decisions are dominated by this number more than any other.
- Drift Detection
Monitoring distributional shift in inputs, outputs, or intermediate signals of a retrieval or LLM pipeline. The discipline that catches 'the metric is silently moving' before users notice.
- LanceDB
LanceDB is an open-source vector database built on the Lance columnar format — append-only, Rust core, columnar-on-disk. It is the only OSS vector DB that handles incremental indexing without rebuilds.
- Latency Tail (P95, P99)
P50 is the median; P95 and P99 are the 95th and 99th percentile latencies. The tail is what wakes oncall, not the median — a 200ms median with a 5s P99 means 1% of users see your system as broken.
- LLM Observability
The operational discipline of monitoring LLM-driven systems: tracing per-call inputs/outputs, eval-in-prod against held-out sets, drift detection on inputs and outputs, latency and cost percentiles.
- MDX
MDX is Markdown extended with JSX — write prose that imports components and renders them inline. The format that powers most modern docs sites (Next.js, Astro, Docusaurus).
- PII Redaction
Detecting and removing personally-identifiable information from LLM inputs and outputs — names, emails, phone numbers, addresses, IDs. A classic small-model task: high-volume, narrow, latency-sensitive, with structured target output.
- Pydantic
Pydantic is the runtime type-validation library that has quietly become a hard dependency of the Python ML ecosystem. You declare a `BaseModel`, get validation, JSON-schema export, and a v2 Rust core for free.
- Semantic Cache
A semantic cache returns a cached LLM response when an incoming query is similar enough — by embedding cosine similarity — to a previous query, rather than requiring exact-string match.
- Speculative Decoding
Use a small 'draft' model to predict the next several tokens, then have the big 'target' model verify them in a single forward pass. The standard latency-reduction trick for LLM inference — typically 2-4× faster generation at the same output quality.
- Throughput (Tokens per Second)
Tokens per second per GPU is the production planning metric for LLM serving. Throughput scales with batch size up to a memory-bound ceiling, then plateaus. The key number for capacity planning, autoscaling, and unit-economic analysis.
- Vector Database
A vector database is a database whose primary index is an approximate-nearest-neighbor structure over high-dimensional vectors. The system substrate for production dense retrieval — it wraps an ANN algorithm (HNSW, IVF, PQ) with persistence, replication, metadata filtering, and incremental updates.
- vLLM Serving
vLLM is the dominant open-source LLM serving framework. Its core innovations — PagedAttention for KV-cache memory management, continuous batching for throughput, and prefix caching for prompt reuse.
- Foundations 48
The bedrock primitives every other topic builds on.
- Data 18
The corpora, curation, and quality decisions that make models possible.
- Language Models 32
The foundational substrate of modern AI.
- Multimodal 13
When text isn't the only signal — vision, audio, and joint embedding spaces.
- Prompting 16
How you talk to an LLM, and when you stop.
- Agents 12
When LLMs become decision-makers in a loop.
- Search & Retrieval 21
How systems find relevant documents in the first place.
- Embeddings 16
The dense-vector layer of modern retrieval.
- Rerankers 9
The second stage that puts the right answer at the top.
- Evaluation 21
How to measure retrieval quality and trust the numbers.
- Training Methodology 21
How modern retrieval models get their relevance signal.
- Performance Engineering 25
Squeezing throughput, latency, and memory out of GPUs.
