Snorkel Blog

In recent years, *deep learning* models have become some of the most popular choices in machine learning for a variety of problems, in large part because they greatly reduce (or eliminate) the need for manual feature engineering.

In turn, TensorFlow has quicky become one of the most popular frameworks for training said *deep learning* models. TensorFlow's symbolic execution engine makes it easy to simply define an arbitary loss function--whether for a deep model or something more traditional--and then just call a favorite optimizer of choice to minimize the function using gradient descent. In this way, the barrier to deep learning has never been lower!

One of the biggest impediments to actually using deep learning in practice, however, is the requirement of ** large hand-labeled training sets**. Increasingly, one approach has been to use weaker forms of supervision, i.e. programmatic or heuristic generation of training set labels which are often noisy and give conflicting signals.

Whichever way you label your training set, however, there is *some* process that you follow. The core idea in **data programming (NIPS 2016)** is that by modeling this training set creation process, you can improve quality. Right now, we're working on using this to power a new information extraction framework, Snorkel, but the concept of data programming is much more general.

In this tutorial, we'll walk through a simple toy example with synthetic data, showing how you can use data programming with TensorFlow to train arbitrary models like neural networks with only weak supervision. We'll walk through the three high-level steps of data programming:

Creating a noisy training set by writing

*labeling functions*Modeling this training set to

*denoise it*Training a

*noise-aware*discriminative model

We note that for the most part, this is a tutorial on data programming, not deep learning. In fact, we won't use any neural networks or "deep" models in this tutorial--but everything here will be easily extendable to such models within TensorFlow! As you'll see below, step 3 can easily use a neural network but simply apply a different loss function after the top layer.

The goal of this tutorial is to go through a simple but end-to-end example of data programming, along with enough math to understand the basics, e.g. what objectives are we optimizing, how do they tie together, etc. If you are comfortable setting up and training machine learning models in a framework like TensorFlow, this tutorial should set you up to try out data programming with your favorite models! Other resources:

For a more detailed treatment, especially on the theory side, see our NIPS 2016 paper

To see how we use this technique on real information extraction problems, check out Snorkel, in particular the intro tutorial

For a slightly higher-level overview, see this blog post

Here we'll load the necessary libraries and generate some synthetic data, which we'll store as an $n \times d$ matrix $X_s$, where each row represents a data point $x \in \{0,1\}^d$.

We'll also generate a vector of ground-truth labels $Y_s \in \{-1,1\}^n$; we'll henceforth consider all but a small set of these labels *unseen*, as the whole point of our approach is to make do without labeled training data!

In [ ]:

```
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import tensorflow as tf
np.random.seed(123)
%matplotlib inline
%load_ext autoreload
%autoreload 2
# n is the number of data points, and d is the dimension of the feature vector
# that we represent each of them as
n = 10000
d = 100
Ys = 2 * np.random.randint(2, size=(n,)) - 1
# We think of the binary features as functions each having some (unknown)
# correlation with the target label, which we'll set randomly in [0.4,0.6]
feature_accs = 0.2 * np.random.random((d,)) + 0.4
Xs = np.zeros((n, d))
for i in range(n):
for j in range(d):
if np.random.random() > feature_accs[j]:
Xs[i,j] = 1 if Ys[i] == 1 else 0
else:
Xs[i,j] = 0 if Ys[i] == 1 else 1
```

In this tutorial our goal is to train a *classification model* that when given an unseen data point $x$, will predict a correct label $y$. Here, we'll consider the binary clasification setting ($y\in\{-1,1\}$) for simplicity.

The most important part about our setup is that we'll assume we **don't have access to any ground truth training labels**; instead, we'll use noisy labeling functions to approximate these training labels.

In this step, we'll:

i. Explain the concept of *labeling functions*

ii. Make some synthetic toy labeling functions, and generate a noisy label matrix $L$

Recall that we already generated some synthetic data, which we stores as an $n \times d$ matrix $X_s$, where each row represents a data point $x \in \{0,1\}^d$.

In most cases, we do not have access to more than a small number of the ground truth labels, and it is not feasible to get these labels. However, there are often many ways that we can provide *weaker* supervision- in other words, noisy and possibly conflicting approximations of subsets of $Y_s$. Data programming provides a simple, unifying framework for such strategies: namely, we express them as *labeling functions*, which simply take in a data point $x$ and either abstain or return a label.

In other words, instead of hand-labeling data to create a training set for our model, we write functions that look something like this:

In [2]:

```
def positive_heuristic(x):
"""Label points as true if they match some heuristic value"""
return 1 if match(pattern, x) else 0
def negative_distant_supervision(x):
"""Label as false if they are in a certain set of examples"""
return -1 if x in noisy_set else 0
def weak_classifier_1(x):
"""Use a weak or biased classifier of unknown accuracy / relevance."""
return weak_classifier_1.predict(x)
def crowd_worker_1(x):
"""Represent a crowdsourcer as an LF."""
return crowd_labels[0][x.id] if x.id in crowd_labels[0] else 0
```

We won't actually use the above LFs in this tutorial, but they serve to show some of the expressive range that labeling functions can capture. If you're interested in more on this sort of task--tutorials, results, etc.--check out Snorkel!

Writing and iterating on labeling functions is the core development task in a data programming-based ML pipeline; for this tutorial though, we'll just use synthetic labeling functions which have random accuracies between 45% and 75%, and coverage of 25%, generating a *noisy label matrix* $L_s$:

In [3]:

```
m = 20
lf_accs = 0.3 * np.random.random((m,)) + 0.45
LF_COVERAGE = 0.25
Ls = np.zeros((n, m))
for i in range(n):
for j in range(m):
if np.random.random() < LF_COVERAGE:
Ls[i,j] = Ys[i] if np.random.random() < lf_accs[j] else -Ys[i]
```

The problem with the labels $L_s$ that we generated above is, of course, that they're noisy, and conflict on certain examples. The key technical idea in our data programming approach is that we can automatically model and denoise them!

In this step, we'll:

Explain the idea of expressing our noisy training set as a

*generative model*$\pi_\theta(L, Y)$Solve this, as a matrix completion problemn in TensorFlow, to learn the accuracies of our labeling functions

Produce our "denoised" training set: the predictions of this generative model

We can see that our label matrix $L_s$ is considerably noisier than the actual ground truth labels $Y_s$; for example, if we take the majority vote of the labeling functions and compare to $Y_s$, we get the following pretty poor accuracy:

In [4]:

```
print "Accuracy: %0.3f" % (np.sum(0.5 * (np.sign(Ls.sum(1)) * Ys + 1)) / n,)
```

One of the key insights in the data programming approach is that instead of trying to directly use our noisy training set $L_s$ as supervision for our model, we will first *model it* as a **generative model**.

Here, we'll consider the simplest possible version of such a model, where we (correctly) assume that our LFs $\lambda_i$ are conditionally independent and have a probability of generating a non-zero label that is independent of $y$. For simplicity we'll also assume balanced classes. We are then just learning the standard model: $$ P(\vec{\lambda}(x), y) = \prod_{i=1}^m P(\lambda_i(x)|y) $$

Note that we can relax these assumptions, ending up with more complex generative models (as in the paper). Here, however, we'll stick with the simple model defined above, and express this modeling task as a matrix completion / low-rank matrix approximation one:

The way that we learn the accuracies of the labeling functions is by learning from the overlaps between them. Concretely, given our conditional independence assumptions, we learn the accuracies of the labeling functions that best explain the empirical overlaps we see between them.

We start by building an empirical "overlaps matrix" $Z\in [-1,1]^{m\times m}$ where $Z_{i,j} = 2\hat{p}_{i,j} - 1$, with $\hat{p}_{i,j}$ being the empirical probability of labeling functions $i$ and $j$ agreeing.

In [5]:

```
Z = np.dot(Ls.T, Ls) / (np.dot(np.abs(Ls).T, np.abs(Ls)))
```

If labeling functions $\lambda_i$ and $\lambda_j$ are conditionally independent (as we have assumed here), then letting $p_{i,j}$ be the probability of them agreeing, and $p_i,p_j$ be their respective accuracies, the true probability of them agreeing on a given example should be

$$ \begin{align*} p_{i,j} &= p_ip_j + (1-p_i)(1-p_j)\\ \end{align*} $$However, if you take the outer product of the label matrix with itself (i.e. the Gram matrix), properly normalized for labeling propensity (i.e. how often each labeling function does not abstain) as above, you get the sum of its agreements minus the sum of its disagreements. If we write those out with the above probabilities and simplify, we see that we can represent the above $Z$ matrix as the outer product of the below $q$ vector with itself:

$$ \begin{align*} Z_{i,j} &= (2p_i-1)(2p_j-1) = q_iq_j \end{align*} $$Where we have defined the rescaled LF accuracies $q_i=2p_i-1$. So, our objective is now to find the LF accuracies that best match the data, in other words, the low rank approximation problem:

$$ \min_q ||Z-qq^T||_F^2 $$In [6]:

```
# Here we set up TF placeholder variables for Z and q
z = tf.placeholder(tf.float32, Z.shape)
q = tf.Variable(tf.random_normal([ Z.shape[0],1], mean=0.5, stddev=.15))
# y = qq^T
y = q * tf.transpose(q)
# Here we just zero-out the diagonals, because we don't care about learning
# them (they are always 1 since an LF will always agree with itself!)
diag = tf.zeros((Z.shape[0]))
mask = tf.ones((Z.shape))
mask = tf.matrix_set_diag(mask, diag)
y_aug = tf.multiply(y, mask)
z_aug = tf.multiply(z, mask)
# Our loss function: sum((Z - qq^T)^2)
loss = tf.reduce_sum((z_aug - y_aug) * (z_aug - y_aug))
train_step = tf.train.GradientDescentOptimizer(0.0005).minimize(loss)
```

Note that we actually don't want to include the diagonal of $Z$ since it is always all ones, so we do a small trick to mask out the diagonals.

Now, we just run gradient descent to learn the LF accuracies which minimize our loss function

In [7]:

```
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for step in range(5000):
sess.run(train_step, feed_dict={z : Z})
q_final = sess.run(q)
est_accs = (q_final+1)/2
```

**and despite that we do not use any ground truth in this step**, we do pretty decently in recovering the accuracies of the labeling functions:

In [8]:

```
from pandas import DataFrame, Series
data = {
'Error' : Series(abs(est_accs[:,0] - lf_accs)),
'True Acc.' : Series(lf_accs),
'Est. Acc.' : Series(est_accs[:,0])
}
DataFrame(data=data)
```

Out[8]:

But how do we use this as a training set? Basically, we now want to plug the accuracies we just estimated into our simple generative model, and use it to produce the probability that each data point is true, $P(y=1|\lambda(x))$. We'll then use *these probabilities* as our training labels!

We'll now quickly go through the math to do this; you can also feel free to skip this section! First, we'll express our labeling functions' conditional probabilities in the following standard form:

$$ P(\lambda_i(x)|y) = \frac{1}{Z}\exp(\frac12w_i\lambda_i(x)y) $$Thus, we have re-expressed our labeling functions' accuracies $p_i$ (where we calculate accuracy over non-zero labels) in terms of the *log odds accuracy* $w$:

Then, recalling the simple generative model we defined earlier, we just have:

$$ \begin{align*} P(y=1|\lambda(x)) &= \frac{P(y=1)P(\lambda(x)|y=1)}{\sum_{y'\in\{-1,1\}}P(y=y') P(\lambda(x)|y=y')}\\ &= \frac{\prod_{i=1}^mP(\lambda_i(x)|y=1)}{\sum_{y'\in\{-1,1\}} \prod_{i=1}^mP(\lambda_i(x)|y=y')}\\ &= \frac{\exp\left(\frac12w^T\lambda(x)\right)}{\exp\left(\frac12w^T\lambda(x)\right) + \exp\left(-\frac12w^T\lambda(x)\right)}\\ &= \frac{1}{1 + \exp(-w^T\lambda(x))}\\ &= \sigma(w^T\lambda(x)) \end{align*} $$Where to convert our accuracies to log odds, we just do:

$$ w_i = \log\left(\frac{p_i}{1-p_i}\right) $$In [9]:

```
lo = np.log( est_accs / (1.0 - est_accs))
Yp = 1 / (1 + np.exp(-np.ravel(Ls.dot(lo))))
```

In [10]:

```
print "Accuracy: %0.3f" % (np.sum(0.5 * ((np.sign(2*Yp - 1) * Ys) + 1)) / n,)
```

** so that we could use it to train a higher-performance discriminative model**. On to that now!

In this step, we'll:

Set up a

*noise-aware*variant of our favorite discriminative modelTrain our model in TensorFlow

See how close we come to the fully supervised version!

We'll start by splitting our data into a train and test set; note we'll use the ground-truth labels only in the test set for evaluation:

In [11]:

```
# Convert Y to correct format
Yc = 0.5 * (Ys + 1)
# Split into training and test set
N_split = int(0.8*n)
X_train = Xs[:N_split, :]
X_test = Xs[N_split:, :]
# Note we *DO NOT* use the training set labels
Y_train = Yp[:N_split].reshape((-1,1))
Y_test = Yc[N_split:].reshape((-1,1))
```

We're now right back to the standard old goal in machine learning- train a discriminative model that learns to *generalize* beyond the training set, and thus get high performance on the test set.

Our only difference is that now we have a set of values $P(y=1|\lambda) \in [0,1]$ as our training labels, where these values express the varying degrees of confidence in our training labels, or alternatively, the amount of *noise* in our training set. Rather than naively treat these noisy training labels as ground truth, we want our discriminative model to be *noise-aware*, i.e. to learn more from high-confidence training labels. This is actually a quite simple tweak that's well supported 'right out of the box' in TensorFlow!

In this tutorial, we will just use a **linear model** over the $d$-dimensional feature vectors $x$, $h_{w,b}(x) = w^Tx + b$:

In [12]:

```
X = tf.placeholder(tf.float32, [None, d])
Y = tf.placeholder(tf.float32, [None, 1])
w = tf.Variable(tf.random_normal([d, 1], mean=0, stddev=(np.sqrt(6/d + 2)), name="weights"))
b = tf.Variable(tf.random_normal([1, 1], mean=0, stddev=(np.sqrt(6/d + 2)), name="bias"))
# Defining our predictive model h
h = tf.add(tf.matmul(X, w), b)
```

And, we'll use a **logistic loss**. In other words, we'll end up with our old friend logistic regression (dissapointed that we're not at least using a single-layer neural net? Algebra will make you feel better!)

Now, our *noise-aware loss* is just the expected loss with respect to this noisy training set model

If we do out the algebra,

$$ \begin{align*} l(w,b) &= \mathbb{E}_{(x,y)\sim \pi_\theta}\left[ l(h_{w,b}(x), y) \right]\\ &= \sum_{x\in T} P(y=1)l(h_{w,b}(x),1) + (1-P(y=1))l(h_{w,b}(x),-1)\\ &= \sum_{x\in T} P(y=1)\log(1 + \exp(h_{w,b}(x))) + (1-P(y=1))\log(1 + \exp(-h_{w,b}(x)))\\ &= \sum_{x\in T} -P(y=1)\log(\sigma(h_{w,b}(x))) - (1-P(y=1))\log(\sigma(h_{w,b}(x)))\\ &= H_{x\in T}\left(p, q\right) \end{align*} $$we see that our noise-aware loss function is actually just the *cross-entropy* between our training set predictions $p(x) = P_{\pi}(y=1|\lambda(x))$, and our logistic function $q(x) = \sigma(h_{w,b}(x)) = \frac{1}{Z}\exp(w^Tx+b)$.

In TensorFlow, this loss function (`sigmoid_cross_entropy_with_logits`

) comes included right out of the box. **This means that we can use data programming for any model by just swapping it in for h in the code below!**

In [13]:

```
# Defining the noise-aware logistic loss function
loss_fn = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(h, Y))
# Some setup for computing test set accuracy, setting learning rate, etc.
correct_OP = tf.equal(tf.greater(h, 0), tf.equal(Y, 1))
accuracy_OP = tf.reduce_mean(tf.cast(correct_OP, "float"))
learningRate = tf.train.exponential_decay(learning_rate=0.00001,
global_step= 1,
decay_steps=X_train.shape[0],
decay_rate=0.95,
staircase=True)
training_OP = tf.train.GradientDescentOptimizer(learningRate).minimize(loss_fn)
```

Now, finally, we'll train our noise aware model!

In [14]:

```
def train_model(X_t, Y_t, n_epochs=5000, display=False):
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Train the model
prev_loss = 0
diff = 1
for i in range(n_epochs):
if i > 1 and diff < .0001:
print("Change in loss = %g; done!" % diff)
break
else:
step = sess.run(training_OP, feed_dict={X: X_t, Y: Y_t})
# Report occasional stats
if display and i % 500 == 0:
loss = sess.run(loss_fn, feed_dict={X: X_t, Y: Y_t})
diff = abs(loss - prev_loss)
prev_loss = loss
print("Step %0.4d \tLoss %0.2f" % (i, loss))
print("Final accuracy on test set: %s" %str(sess.run(accuracy_OP, feed_dict={X: X_test, Y: Y_test})))
train_model(X_train, Y_train)
```

Ok, we've been able to achieve **much higher accuracy** on the test set with our newly-trained discriminative model, than we had before with our heuristic labeling functions. But how far are we from what we could have accomplished with supervised learning, if we had ground truth labels? Let's see...

In [16]:

```
Y_train_supervised = Yc[:N_split].reshape((-1,1))
train_model(X_train, Y_train_supervised)
```

Let's quickly recap what we did:

We created a dataset $X_s$ and generated noisy and conflicting labels for it ($L_s$) by writing

*labeling functions*We

*modeled*this noisy training set generation process by learning a generative model, i.e. learning our labeling functions' accuracies by observing their overlaps (using a matrix completion objective)We trained a

*noise-aware*discriminative model using our denoised labels, and showed that the performance was close to that of a directly-supervised model!

Key takeaways:

- You can use data programming to weakly supervise machine learning models--i.e. to train them without large hand-labeled training sets--and still get strong end performance.
- You can prototype these models quickly and efficiently with TensorFlow.

For more, check out the NIPS 2016 data programming paper and our system using data programming for information extraction, Snorkel!