Weak supervision is the practice of generating training labels programmatically. Domain experts encode their knowledge as rules or labeling functions; an aggregation layer combines those noisy signals into a single soft label per example; a model trains on the soft labels and generalizes past the rules’ explicit coverage. Stanford’s Snorkel popularized the modern shape in 2017, and the playbook has since shipped inside healthcare triage, e-commerce search, fraud detection, and large-scale corpus filtering for LLM pretraining .
Why programmatic labels work
Experts have enormous prior knowledge they can’t economically transfer one example at a time. A radiologist who would take six months to label 50,000 chest X-rays can write a hundred labeling functions — “if the report mentions ‘opacity’ and not ‘no evidence of’”, “if the previous study mentioned interstitial markings” — in a week. Each function is wrong some of the time, but their agreements and disagreements carry signal: when ten functions vote one way and two vote the other, the ten are usually right.
Snorkel formalizes this. Each labeling function emits a label or abstains. The label model learns each function’s accuracy and pairwise correlations without ground truth by exploiting the fact that a function which disagrees often with the majority is probably less accurate. The output is a probabilistic label per example, which a downstream discriminative model then trains on.
Where weak supervision shines
Production weak-supervision deployments
Medical triage and ICD coding. Clinically motivated rules over discharge summaries; soft labels train a transformer that codes faster and more consistently than residents.
Spam and fraud detection. Sender reputation, regex patterns, embedding-distance to known spam, link reputation, account age — the aggregate is the production classifier’s training label.
E-commerce attribute extraction. Template-based extraction from product titles plus catalog priors; the discriminative model handles templates the rules don’t cover.
LLM corpus filtering. Heuristic rules over Common Crawl drive downstream data curation .
Specialized retrieval. Labeling functions over (query, document) pairs produce noisy relevance labels for fine-tuning rerankers in domains where nobody has annotated data.
Each labeling function on example emits a vote in (negative, abstain, positive). Snorkel models the joint distribution where is the latent true label, with each conditionally dependent on and possibly on correlated peers.
The clever bit: the parameters — each function’s accuracy and the correlation structure — are learned with no labeled data. The agreement statistics among functions are observable and, under mild assumptions, identify each function’s accuracy. The output is — a soft label that a downstream discriminative model trains on. The downstream model often outperforms the label model itself, because it can use input features the labeling functions never touched.
The naive label model assumes conditional independence of labeling functions given the true label. They almost never are — two functions that both check for “pneumonia” agree whenever the keyword fires regardless of the actual label, and the model double-counts their evidence.
Snorkel learns the dependency structure from data, with optional user hints. The pragmatic shape is to keep the correlation graph sparse and lean on the discriminative model to absorb residual structure. Failure mode to watch: a tightly correlated cluster of functions all reflecting the same underlying heuristic. The cluster gets assigned high collective accuracy, the discriminative model learns the heuristic, and the system fails on examples outside it. Fix: write labeling functions that exploit different features.
What weak supervision can’t do
The fundamental limit: a weakly-supervised model inherits the bias of the rules. If your rules are systematically blind to a sub-population — non-English documents, atypical phrasings, novel product categories — the model is blind too. The bias is silent unless a ground-truth audit set surfaces it.
The other limit is rule-writability. Tasks where expertise is irreducibly perceptual (“does this scan look concerning”) or stylistic resist labeling functions. Those are where human labeling or LLM-assisted synthetic supervision still dominates. The right 2026 pattern is hybrid — labeling functions for the rule-shaped portion, humans or LLMs for the rest.
Go further
When does weak supervision beat human labeling?
When the labels are rule-derivable and the volume is high. Examples: detecting spam emails (regex patterns plus sender reputation lookups), classifying medical reports for follow-up codes (keyword presence plus negation parsing), labeling product attributes from listings (template extraction). Anywhere a domain expert would spend their first hour writing rules instead of labeling, those rules can be the labeler.
Users write labeling functions — small Python predicates that emit a label or abstain. Snorkel models the agreement and disagreement structure across functions to learn each function's accuracy without ground truth, then produces a probabilistic label per example. A discriminative model trains on those soft labels and generalizes beyond the rules' coverage.
Inheriting the bias of the rules. Labeling functions reflect the author's assumptions about the task, and a model trained on programmatic labels learns those assumptions as ground truth. If your spam-detection rules implicitly correlate with sender language (English speakers under-flagged), the model bakes the bias in. Holding out a hand-labeled audit slice is the only reliable check.