A training data creation and management system focused on information extraction

View the Project on GitHub HazyResearch/snorkel

Blog

Acknowledgements

Sponsored in part by DARPA as part of the SIMPLEX program under contract number N66001-15-C-4043 and also by the NIH through the Mobilize Center under grant number U54EB020405.

Getting Started

Motivation

Snorkel is a system for rapidly creating, modeling, and managing training data, currently focused on accelerating the development of structured or "dark" data extraction applications for domains in which large labeled training sets are not available or easy to obtain.

Today's state-of-the-art machine learning models require massive labeled training sets--which usually do not exist for real-world applications. Instead, Snorkel is based around the new data programming paradigm, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel automatically models this process—learning, essentially, which labeling functions are more accurate than others—and then uses this to train an end model (for example, a deep neural network in TensorFlow).

Surprisingly, by modeling a noisy training set creation process in this way, we can take potentially low-quality labeling functions from the user, and use these to train high-quality end models. We see Snorkel as providing a general framework for many weak supervision techniques, and as defining a new programming model for weakly-supervised machine learning systems.

Users

We're lucky to have some amazing collaborators who are currently using Snorkel!

References

Next Steps

See the Snorkel repository