Structure Learning: Are Your Sources Only Telling You What You Want to Hear?

Post by Stephen Bach, Bryan He, Alex Ratner, and Chris Ré

Snorkel Blog

*And referencing work by many other members of Hazy Research*

In data programming, we use a generative model to synthesize training labels from weak supervision sources like heuristic rules and distant supervision. The model's dependency structure directly affects the quality of the results. We've introduced a method to learn the structure

without any labeled training data. Highlights include:

Sublinear sample complexity in the number of possible dependencies for many model classes.

100× faster than maximum likelihood parameter estimation on all possible dependencies.

Average 1.5 F1 point boost on existing applications over a conditionally independent model.

In data programming, correlated supervision sources are like "yes people." Just because they agree doesn't mean they're correct. The generative model should account for correlation!

Nobody likes a "yes person." If someone isn't giving you their independent opinion, then you might think they're a more reliable source than they actually are. Trusting correlated sources can lead to disaster. You'll be overconfident, thinking that you're correct because so many sources agree. Take this case study from a British team of management consultants as an example:

In data programming, our sources of information aren't Hollywood movie writers; they're labeling functions. Recall that each labeling function is a little function written in code that votes (or abstains) on the true label for a data point. We want to combine their votes into an estimate of the true labels so that we can train machine learning models without hand-labeled data. Of course, these functions are noisy, so we use a generative model to estimate their accuracies and denoise their outputs.

What do these generative models look like? They all start with the same basic skeleton. There's a latent variable—the true class—that generates the outputs of each labeling function. If we stop there and only have one dependency connecting each labeling function with the true label, then we have an independent, or Naive Bayes, model. This structure makes a strong assumption: each labeling function provides conditionally independent information about the class label.

The independent model is susceptible to "yes people." If multiple labeling functions are actually correlated, independent of the true label, then the maximum likelihood estimate of the parameters will overweight the labeling functions' accuracies. It's exactly the same problem the Hollywood producer has: the sources of information seem trustworthy because they agree with each other.

Of course, labeling functions aren't correlated because they're trying to please you and avoid getting fired. But they can be correlated for lots of other reasons. A few examples we often see with users are:

- Labeling functions that are variations on a theme, such as looking for keywords within different sized windows of text.
- Applying the same heuristic to correlated inputs, such as one that runs on the raw tokens of a sentence and another that runs on the lemmatized tokens.
- Using correlated sources of external domain knowledge, such as distantly supervising with knowledge bases that cover overlapping topics.

Structure learning is challenging because the true class labels are latent variables. We identify which dependencies help us predict a single labeling function in isolation, marginalizing out our uncertainty about the true class label.

How can we identify dependencies like correlations between labeling functions? We need to be able to do it quickly, because users want to iterate rapidly on their labeling functions and get performance feedback fast. On top of that, we need to be able to do it without any labeled data.

We could try a brute force approach: add all possible dependencies to the independent model and try to learn the parameters via maximum likelihood estimation. This is the same procedure as learning the parameters of the independent model, but it's going to be much slower when all possible dependencies are included. To see why, consider the structure of the model when we consider all possible correlations, illustrated as a factor graph:

We propose maximizing a *pseudolikelihood* objective instead. The idea is to maximize the likelihood of each labeling function's output as a separate optimization, conditioned on the others' outputs. We use ℓ1-regularization and include the dependencies with sufficiently large weights in the final model. A related approach by Pradeep Ravikumar et al. works well for fully supervised problems. In data programming, the fact that the true class label is latent complicates matters. Here, each pseudolikelihood maximization is equivalent to maximum likelihood estimation of the parameters of a factor graph that looks like this:

Structure learning is more challenging than in the supervised case because the objective is nonconvex, but we show that it works under reasonable assumptions. It also is 100× faster than a maximum likelihood approach and gives a nice performance boost on existing applications.

In our paper, we analyze the probability that this structure learning methods succeeds, i.e., returns exactly the correct dependency structure. Our analysis shows that for n labeling functions, O(n log n) samples are sufficient to learn models with pairwise dependencies between labeling functions with high probability. (For higher-order dependencies, the sample complexity is O(n^{2} log n).) In contrast, in the supervised setting, the known guarantee is O(d^{3} log n), where d is the maximum variable degree in the true factor graph. The difference is that the supervised case admits an analysis using Lagrangian duality, enabling the bound to depend on the true distribution's sparsity structure. Since the objective for a generative model with a latent variable is nonconvex, we cannot apply this analysis.

We also measure the running time of our approach versus a brute-force maximum likelihood approach, for increasing numbers of labeling functions.

Finally, we tested structure learning on a number of existing data programming applications like disease tagging and chemical-disease relation extraction in PubMed, and extracting information from tables in hardware manufacturers' spec. sheets. We find that it gives an average 1.5 F1 boost.

Structure learning is implemented in Snorkel and easy to use.

You can easily try out structure learning in Snorkel. An example is in the CDR tutorial. You can also easily add structure learning to your own Snorkel application. Before you train your generative model, apply the dependency selector to your label matrix as follows:

```
from snorkel.learning.structure import DependencySelector
ds = DependencySelector()
deps = ds.select(L, threshold=0.1)
```

where `L`

is the label matrix produced by a `LabelManager`

. A few notes:
- The
`deps`

object is a collection of tuples specifying which labeling functions are related by which types of dependencies. - The keyword argument
`threshold`

is a positive float that indicates how strong the dependency has to be for it to be returned in the collection. Too many dependencies? Turn it up. Too few? Turn it down. - By default, the
`DependencySelector`

looks for pairwise correlations between labeling functions. Pass the keyword argument`higher_order=True`

to the`select`

method to also look for reinforcing and fixing dependencies (described in the data programming paper).

```
from snorkel.learning import GenerativeModel
gen_model = GenerativeModel()
gen_model.train(L, deps=deps)
```

That's all there is to it!
We're excited to continue research on automatically learning the structure of generative models for weak supervision. A few next steps:

- Can we strengthen the theoretical guarantees for structure learning to depend only on the sparsity of the true dependency structure, perhaps to match the guarantees known in the supervised case? We find in practice that structure learning scales linearly in the maximum labeling function degree of the true factor graph and logarithmically in the number of possible dependencies, analogous to the bounds known for the supervised case.
- Can we automatically tune the selection threshold for optimal performance? "Optimal" depends on both the degree to which the labeling functions are correlated and compute budget of the user. We're developing an optimizer to set the threshold automatically.
- We want to learn what other types of dependencies that can exist among labeling functions are useful to model. Which ones have the biggest impact on end quality?