A system for rapidly creating training sets with weak supervision

View the Project on GitHub HazyResearch/snorkel

Snorkel: A System for Fast Training Data Creation

Snorkel is a system for rapidly creating, modeling, and managing training data.

Today's state-of-the-art machine learning models require massive labeled training sets--which usually do not exist for real-world applications. Instead, Snorkel is based around the new data programming paradigm, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel automatically models this process—learning, essentially, which labeling functions are more accurate than others—and then uses this to train an end model (for example, a deep neural network in TensorFlow).


Surprisingly, by modeling a noisy training set creation process in this way, we can take potentially low-quality labeling functions from the user, and use these to train high-quality end models. We see Snorkel as providing a general framework for many weak supervision techniques, and as defining a new programming model for weakly-supervised machine learning systems.

Users & Sponsors

Snorkel @ Google


[3/14/19] Check out a new blog post and accompanying SIGMOD paper on Snorkel's usage at Google, and some thoughts about applying weak supervision at industrial scale! Big theme: use organizational knowledge resources to quickly and effectively supervise modern ML. We're really excited to keep working with these teams!

Snorkel on GLUE Benchmark

Glue Leaderboard

[3/21/19] Using Snorkel MeTaL, our multi-task version of Snorkel, we achieved new state-of-the-art scores on the GLUE Benchmark and four of its component tasks. The primary theme of our success: bringing as much supervision signal to bear as possible. Read our blog post for more detail. The Massive Multi-Task Learning (MMTL) module of Snorkel MeTaL that we used will be released as a part of v0.5 in April.

Snorkel @ NeurIPS 2018


We're excited that Snorkel will be featured in Kunle Olukotun's keynote at NeurIPS this year; to support this, we've assembled some pointers:

Snorkel @ VLDB 2018


We're excited to be presenting on Snorkel at this year's VLDB conference in Rio De Janeiro, on Tuesday 8/28 in the "Database Techniques for Machine Learning" session. We're also honored that the corresponding paper, Snorkel: Rapid Training Data Creation with Weak Supervision, has been invited to the annual "Best Of VLDB" Special Issue!

Introducing Snorkel Metal: Weak Supervision for Massively Multi-Task Learning


Complex problems are often composed of multiple tasks, and may have many different types of weak supervision that provide labels for one or more of these tasks. In Snorkel MeTaL, we use a new modeling approach to denoise this massively multi-task weak supervision, and then train an auto-compiled multi-task network with it. Check out:

Snorkel ACM Summer School Workshop

Snorkel Check out the workshop materials for the upcoming ACM summer school lecture on Snorkel, Weak Supervision & Software 2.0!

Snorkel in the Wild

An partial sampling of some public reports of Snorkel / data programming usage:

Blog Posts and Tutorials

If you're new, get started with the first blog post on data programming, and then check out the Snorkel intro tutorial!


Best References:

Further Reading:



Sponsored in part by DARPA as part of the D3M program under contract No. FA8750-17-2-0095 and the SIMPLEX program under contract No. N66001-15-C-4043, and also by the NIH through the Mobilize Center under grant number U54EB020405.

Mailing List

Feel free to subscribe to the Snorkel-dev mailing list for Snorkel-related announcements, notifications, and discussions!

Next Steps

See the Snorkel repository