Scaling Up Snorkel with Spark

Scaling up Snorkel with Spark

Post by Alex Ratner, Stephen Bach, Theo Rekatsinas, and Chris Ré
Snorkel Blog

And referencing work by many other members of Hazy Research

For the attendees of the upcoming Spark Summit 2017, we’ve begun creating Spark hooks in Snorkel!

We'll start with a quick refresher about Snorkel:

Snorkel: Machine Learning without Hand-Labeled Training Data

In the age of deep learning, getting enough labeled training data is now the limiting bottleneck in developing real-world machine learning applications. Snorkel tackles this issue by providing a unifying framework for programmatically generated training data--i.e. training data labeled using domain expert patterns, heuristics, rules of thumb, data resources, etc. In Snorkel, these are all encoded as labeling functions, which are just Python functions, enabling arbitrary expressivity. We call this type of training data weak supervision because it’s noisier and less accurate than the expensive, manually-curated “gold” labels that machine learning models are usually trained on. However, Snorkel automatically de-noises this noisy training data, so that we can then use it to train state-of-the-art models.

Check out the basic Snorkel pipeline on the challenging task of information extraction from natural-language text with these tutorials:

Snark: Snorkel on Spark


One of the key advantages of using programmatic supervision is that it can be applied to massive quantities of unlabeled data. In our research, we've found that larger quantities of noisier training data (esp. denoised by Snorkel) can lead to higher quality trained models than using small sets of hand-labeled data. The key is to be able to apply the user's programmatic supervision at scale. With Spark integration, this is no problem for Snorkel!

Check out some basic examples of Snorkel + Spark integration:

We'll be rolling out more Spark hooks for different components of the Snorkel pipeline, but if you have ideas for your own,!

More Coming Soon!

As we approach Snorkel v0.6, we've been gradually releasing a lot of new features (including the Spark hooks but also categorical variables with sparse support, easy integration of labeled data, custom priors in the generative model, and much more) which will enable a new set of applications and capabilities. Stay tuned for upcoming posts and tutorials on: