Scaling up Snorkel with Spark
Post by Alex Ratner, Stephen Bach, Theo Rekatsinas, and Chris Ré
And referencing work by many other members of Hazy Research
For the attendees of the upcoming Spark Summit 2017, we’ve begun creating Spark hooks in Snorkel!
We'll start with a quick refresher about Snorkel:
Snorkel: Machine Learning without Hand-Labeled Training Data
In the age of deep learning, getting enough labeled training data is now the limiting bottleneck in developing real-world machine learning applications. Snorkel tackles this issue by providing a unifying framework for programmatically generated training data--i.e. training data labeled using domain expert patterns, heuristics, rules of thumb, data resources, etc. In Snorkel, these are all encoded as labeling functions, which are just Python functions, enabling arbitrary expressivity. We call this type of training data weak supervision because it’s noisier and less accurate than the expensive, manually-curated “gold” labels that machine learning models are usually trained on. However, Snorkel automatically de-noises this noisy training data, so that we can then use it to train state-of-the-art models.
Check out the basic Snorkel pipeline on the challenging task of information extraction from natural-language text with these tutorials:
- Intro Tutorial: A gentle introduction to Snorkel on a classic problem from the relation extraction literature: extraction mentions of (spouse) relations from news articles!
- Chemical-Disease Relations Tutorial: Get much closer to a real deployment application, tackling the challenging task of extracting chemical-causes-disease relations from PubMed abstracts using Snorkel.
Snark: Snorkel on Spark
One of the key advantages of using programmatic supervision is that it can be applied to massive quantities of unlabeled data. In our research, we've found that larger quantities of noisier training data (esp. denoised by Snorkel) can lead to higher quality trained models than using small sets of hand-labeled data. The key is to be able to apply the user's programmatic supervision at scale.
With Spark integration, this is no problem for Snorkel!
Check out some basic examples of Snorkel + Spark integration:
We'll be rolling out more Spark hooks for different components of the Snorkel pipeline, but if you have ideas for your own, https://github.com/HazyResearch/snorkel/issues!
- Spark-Distributed Labeling Functions Tutorial: In developing Snorkel applications, applying labeling functions is the central data-intensive operation. They also can be slow at scale. In this tutorial, we show how to run labeling functions at scale using Spark!
- Crowdsourcing Tutorial: Snorkel can subsume many other different types of noisy training signal as well. One great example is labels from potentially unreliable (and conflicting) crowd workers. Check out this tutorial for a simple example of how to process crowdsourced data using SparkSQL + DataFrames, and then load it into Snorkel's native format. Then we use the data to train an LSTM sentiment analysis model and classify new data!
More Coming Soon!
As we approach Snorkel v0.6, we've been gradually releasing a lot of new features (including the Spark hooks but also categorical variables with sparse support, easy integration of labeled data, custom priors in the generative model, and much more) which will enable a new set of applications and capabilities.
Stay tuned for upcoming posts and tutorials on:
- Data cleaning and integration in Snorkel
- Entity tagging + linking
- Image classification
- Extraction from messy tabular data (blog)
- And much more!