The System for Programmatically Building and Managing Training Data

View the Project on GitHub

Star Fork

Snorkel: The System for Programmatically Building and Managing Training Data

Snorkel is a system for programmatically building and managing training datasets to rapidly and flexibly fuel machine learning models.

Today's state-of-the-art machine learning models are more powerful and easy to use than ever before- however, they require massive training datasets. Traditionally, these training datasets require slow and often prohibitively expensive manual labeling by domain experts. Instead, in Snorkel, users write programmatic operations to label, transform, and structure training datasets for machine learning, without needing to hand label any training data; Snorkel then uses modern, theoretically-grounded modeling techniques to clean and integrate the resulting training data.

In a wide range of applications---from medical image monitoring to text information extraction to industrial deployments over web data---Snorkel provides a radically faster and more flexible to build machine learning applications, by letting users programmatically build and manipulate training data rather than label it by hand. Snorkel focuses on three key operations: labeling data, for example using heuristic rules or distant supervision techniques; transforming data, for example to perform data augmentation and express invariances in the data; and slicing data into different critical subsets.


Snorkel Users and Sponsors

Recent News

[6/4/19] Two papers using Snorkel (cardiac MRI imaging, GWAS studies) accepted to Nature Communications

[4/21/19] Two papers around learning weak supervision structure and augmentation theory accepted to ICML 2019

[3/14/19] SIGMOD paper and Google AI blog on Snorkel's usage at Google

[See References for more]

Snorkel Highlights

Snorkel has been used in production applications at places like Google, IBM, and Intel, and has recently been used to achieve state-of-the-art performance on the GLUE and SuperGLUE language understanding benchmarks!


Snorkel has been used extensively in medical settings, highlighted by two recent Nature Communication publications around automated GWAS curation and cardiac MRI classification, in various radiology and neurological monitoring settings where it has been used to replace person months of hand-labeling, and to extract information from electronic health record (EHR) data, offering a scalable solution for national medical device surveillance.

For more use cases, as well as the various academic publications explaining the technical work underlying Snorkel, see the References section below.

Snorkel in More Detail

Snorkel is a general framework that supports several weak supervision techniques and allows domain experts to encode their knowledge programmatically to provide supervision through the following operations:



Rather than requiring users to label training data points by hand, Snorkel takes as input labeling functions (LFs), functions that heuristically or noisily label some subset of the training examples. Snorkel then models the quality and correlations of these LFs using novel, theoretically-grounded statistical modeling techniques. Read more here:



Snorkel also lets users write transformation functions (TFs) to heuristically generate new, modified training examples by transforming existing ones---a strategy often referred to as data augmentation. Rather than requiring users to tune these data augmentation or transformation strategies by hand, Snorkel learns compositions of transformations across various domain-specific tasks to optimize for a representative training set. Read more here:



Finally, Snorkel also lets users write slicing functions (SFs) to heuristically identify subsets of the data the model should particularly care about, e.g. have extra representative capacity for, due to their difficulty and/or importance. It models slices in the style of multi-task learning and an attention-mechanism is then learned over these heads. Read more here:

Snorkel can also operate over other forms of weak supervision like crowdsourcing by modeling individual workers as labeling functions. To properly take advantage of all supervision signal available, Snorkel can takes advantage of multi-task learning and transfer learning, moving towards massive multi-task learning to facilitate incorporating diverse and varying granularities of supervision at a large scale.


Blogs and Tutorials

Papers and Pre-Prints

Snorkel Use Cases