Yes People

Structure Learning: Are Your Sources Only Telling You What You Want to Hear?

Post by Stephen Bach, Bryan He, Alex Ratner, and Chris Ré
Snorkel Blog

And referencing work by many other members of Hazy Research

In data programming, we use a generative model to synthesize training labels from weak supervision sources like heuristic rules and distant supervision. The model's dependency structure directly affects the quality of the results. We've introduced a method to learn the structure without any labeled training data. Highlights include:

Protecting Your Generative Model from "Yes People"

In data programming, correlated supervision sources are like "yes people." Just because they agree doesn't mean they're correct. The generative model should account for correlation!

Nobody likes a "yes person." If someone isn't giving you their independent opinion, then you might think they're a more reliable source than they actually are. Trusting correlated sources can lead to disaster. You'll be overconfident, thinking that you're correct because so many sources agree. Take this case study from a British team of management consultants as an example:

In data programming, our sources of information aren't Hollywood movie writers; they're labeling functions. Recall that each labeling function is a little function written in code that votes (or abstains) on the true label for a data point. We want to combine their votes into an estimate of the true labels so that we can train machine learning models without hand-labeled data. Of course, these functions are noisy, so we use a generative model to estimate their accuracies and denoise their outputs.

What do these generative models look like? They all start with the same basic skeleton. There's a latent variable—the true class—that generates the outputs of each labeling function. If we stop there and only have one dependency connecting each labeling function with the true label, then we have an independent, or Naive Bayes, model. This structure makes a strong assumption: each labeling function provides conditionally independent information about the class label.

The independent model is susceptible to "yes people." If multiple labeling functions are actually correlated, independent of the true label, then the maximum likelihood estimate of the parameters will overweight the labeling functions' accuracies. It's exactly the same problem the Hollywood producer has: the sources of information seem trustworthy because they agree with each other.

I agree with them!

Of course, labeling functions aren't correlated because they're trying to please you and avoid getting fired. But they can be correlated for lots of other reasons. A few examples we often see with users are:

  1. Labeling functions that are variations on a theme, such as looking for keywords within different sized windows of text.
  2. Applying the same heuristic to correlated inputs, such as one that runs on the raw tokens of a sentence and another that runs on the lemmatized tokens.
  3. Using correlated sources of external domain knowledge, such as distantly supervising with knowledge bases that cover overlapping topics.
If we don't model these correlations, we're going to overestimate the accuracies of these labeling functions, which directly affects the quality of the trained discriminative model.

Learning the Structure of Generative Models without Labeled Data

Structure learning is challenging because the true class labels are latent variables. We identify which dependencies help us predict a single labeling function in isolation, marginalizing out our uncertainty about the true class label.

How can we identify dependencies like correlations between labeling functions? We need to be able to do it quickly, because users want to iterate rapidly on their labeling functions and get performance feedback fast. On top of that, we need to be able to do it without any labeled data.

We could try a brute force approach: add all possible dependencies to the independent model and try to learn the parameters via maximum likelihood estimation. This is the same procedure as learning the parameters of the independent model, but it's going to be much slower when all possible dependencies are included. To see why, consider the structure of the model when we consider all possible correlations, illustrated as a factor graph:

Possible correlations that could be added to conditionally independent model.
The number of possible correlations ("Cor") grows quadratically in the number of labeling functions. Maximizing the marginal likelihood of the target variables will become expensive because computing the gradient of the objective for a general factor graph requires Gibbs sampling (or some other approximation). Can we select which dependencies beyond the accuracy ("Acc") factors are worth including without computing the likelihood gradient over all these possible correlations? We can!

We propose maximizing a pseudolikelihood objective instead. The idea is to maximize the likelihood of each labeling function's output as a separate optimization, conditioned on the others' outputs. We use ℓ1-regularization and include the dependencies with sufficiently large weights in the final model. A related approach by Pradeep Ravikumar et al. works well for fully supervised problems. In data programming, the fact that the true class label is latent complicates matters. Here, each pseudolikelihood maximization is equivalent to maximum likelihood estimation of the parameters of a factor graph that looks like this:

Maximum pseudolikelihood estimator for correlations for a single labeling function.
This model structure gives us a computational advantage: with only one target variable and one latent variable, we can compute the gradient of the objective exactly in polynomial time. Recall that for factor graphs with latent variables, the gradient of the maximum likelihood objective is the difference between the expected value of the feature function conditioned on the target variables and the expected value without conditioning. We can compute these expectations exactly because of the simplified distribution structure.

Theoretical Guarantees and Experiments

Structure learning is more challenging than in the supervised case because the objective is nonconvex, but we show that it works under reasonable assumptions. It also is 100× faster than a maximum likelihood approach and gives a nice performance boost on existing applications.

In our paper, we analyze the probability that this structure learning methods succeeds, i.e., returns exactly the correct dependency structure. Our analysis shows that for n labeling functions, O(n log n) samples are sufficient to learn models with pairwise dependencies between labeling functions with high probability. (For higher-order dependencies, the sample complexity is O(n2 log n).) In contrast, in the supervised setting, the known guarantee is O(d3 log n), where d is the maximum variable degree in the true factor graph. The difference is that the supervised case admits an analysis using Lagrangian duality, enabling the bound to depend on the true distribution's sparsity structure. Since the objective for a generative model with a latent variable is nonconvex, we cannot apply this analysis.

We also measure the running time of our approach versus a brute-force maximum likelihood approach, for increasing numbers of labeling functions.

Structure learning is two orders of mangitude faster than a maximum likelihood approach.
We find that our approach is two orders of magnitude faster.

Finally, we tested structure learning on a number of existing data programming applications like disease tagging and chemical-disease relation extraction in PubMed, and extracting information from tables in hardware manufacturers' spec. sheets. We find that it gives an average 1.5 F1 boost.

How to Use Structure Learning in Snorkel

Structure learning is implemented in Snorkel and easy to use.

You can easily try out structure learning in Snorkel. An example is in the CDR tutorial. You can also easily add structure learning to your own Snorkel application. Before you train your generative model, apply the dependency selector to your label matrix as follows:

from snorkel.learning.structure import DependencySelector
ds = DependencySelector()
deps = ds.select(L, threshold=0.1)
where L is the label matrix produced by a LabelManager. A few notes: To incorporate the selected dependencies into your generative model, just pass them in as a keyword argument:
from snorkel.learning import GenerativeModel
gen_model = GenerativeModel()
gen_model.train(L, deps=deps)
That's all there is to it!

Next Steps

We're excited to continue research on automatically learning the structure of generative models for weak supervision. A few next steps: