Data Programming in TensorFlow


By Dan Iter, Alex Ratner, and Chris RĂ©
Snorkel Blog

TensorFlow

In recent years, deep learning models have become some of the most popular choices in machine learning for a variety of problems, in large part because they greatly reduce (or eliminate) the need for manual feature engineering.

In turn, TensorFlow has quicky become one of the most popular frameworks for training said deep learning models. TensorFlow's symbolic execution engine makes it easy to simply define an arbitary loss function--whether for a deep model or something more traditional--and then just call a favorite optimizer of choice to minimize the function using gradient descent. In this way, the barrier to deep learning has never been lower!

Data Programming

One of the biggest impediments to actually using deep learning in practice, however, is the requirement of large hand-labeled training sets. Increasingly, one approach has been to use weaker forms of supervision, i.e. programmatic or heuristic generation of training set labels which are often noisy and give conflicting signals.

Whichever way you label your training set, however, there is some process that you follow. The core idea in data programming (NIPS 2016) is that by modeling this training set creation process, you can improve quality. Right now, we're working on using this to power a new information extraction framework, Snorkel, but the concept of data programming is much more general.

This tutorial

In this tutorial, we'll walk through a simple toy example with synthetic data, showing how you can use data programming with TensorFlow to train arbitrary models like neural networks with only weak supervision. We'll walk through the three high-level steps of data programming:

  1. Creating a noisy training set by writing labeling functions

  2. Modeling this training set to denoise it

  3. Training a noise-aware discriminative model

We note that for the most part, this is a tutorial on data programming, not deep learning. In fact, we won't use any neural networks or "deep" models in this tutorial--but everything here will be easily extendable to such models within TensorFlow! As you'll see below, step 3 can easily use a neural network but simply apply a different loss function after the top layer.

Who this is for

The goal of this tutorial is to go through a simple but end-to-end example of data programming, along with enough math to understand the basics, e.g. what objectives are we optimizing, how do they tie together, etc. If you are comfortable setting up and training machine learning models in a framework like TensorFlow, this tutorial should set you up to try out data programming with your favorite models! Other resources:

  • For a more detailed treatment, especially on the theory side, see our NIPS 2016 paper

  • To see how we use this technique on real information extraction problems, check out Snorkel, in particular the intro tutorial

  • For a slightly higher-level overview, see this blog post

STEP 0: Set up


Here we'll load the necessary libraries and generate some synthetic data, which we'll store as an $n \times d$ matrix $X_s$, where each row represents a data point $x \in \{0,1\}^d$.

We'll also generate a vector of ground-truth labels $Y_s \in \{-1,1\}^n$; we'll henceforth consider all but a small set of these labels unseen, as the whole point of our approach is to make do without labeled training data!

In [ ]:

Code Snippet

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import tensorflow as tf

np.random.seed(123)

%matplotlib inline
%load_ext autoreload
%autoreload 2

# n is the number of data points, and d is the dimension of the feature vector
# that we represent each of them as
n  = 10000
d  = 100
Ys = 2 * np.random.randint(2, size=(n,)) - 1

# We think of the binary features as functions each having some (unknown)
# correlation with the target label, which we'll set randomly in [0.4,0.6]
feature_accs = 0.2 * np.random.random((d,)) + 0.4
Xs           = np.zeros((n, d))
for i in range(n):
    for j in range(d):
        if np.random.random() > feature_accs[j]:
            Xs[i,j] = 1 if Ys[i] == 1 else 0
        else:
            Xs[i,j] = 0 if Ys[i] == 1 else 1

STEP 1: Creating a noisy training set by writing labeling functions


In this tutorial our goal is to train a classification model that when given an unseen data point $x$, will predict a correct label $y$. Here, we'll consider the binary clasification setting ($y\in\{-1,1\}$) for simplicity.

The most important part about our setup is that we'll assume we don't have access to any ground truth training labels; instead, we'll use noisy labeling functions to approximate these training labels.

In this step, we'll:

i. Explain the concept of labeling functions

ii. Make some synthetic toy labeling functions, and generate a noisy label matrix $L$

Recall that we already generated some synthetic data, which we stores as an $n \times d$ matrix $X_s$, where each row represents a data point $x \in \{0,1\}^d$.

(1.i) Labeling functions (LFs): A unifying framework for weak supervision

In most cases, we do not have access to more than a small number of the ground truth labels, and it is not feasible to get these labels. However, there are often many ways that we can provide weaker supervision- in other words, noisy and possibly conflicting approximations of subsets of $Y_s$. Data programming provides a simple, unifying framework for such strategies: namely, we express them as labeling functions, which simply take in a data point $x$ and either abstain or return a label.

In other words, instead of hand-labeling data to create a training set for our model, we write functions that look something like this:

In [2]:

Code Snippet

def positive_heuristic(x):
    """Label points as true if they match some heuristic value"""
    return 1 if match(pattern, x) else 0

def negative_distant_supervision(x):
    """Label as false if they are in a certain set of examples"""
    return -1 if x in noisy_set else 0

def weak_classifier_1(x):
    """Use a weak or biased classifier of unknown accuracy / relevance."""
    return weak_classifier_1.predict(x)

def crowd_worker_1(x):
    """Represent a crowdsourcer as an LF."""
    return crowd_labels[0][x.id] if x.id in crowd_labels[0] else 0

(1.ii) Using synthetic (toy) LFs

We won't actually use the above LFs in this tutorial, but they serve to show some of the expressive range that labeling functions can capture. If you're interested in more on this sort of task--tutorials, results, etc.--check out Snorkel!

Writing and iterating on labeling functions is the core development task in a data programming-based ML pipeline; for this tutorial though, we'll just use synthetic labeling functions which have random accuracies between 45% and 75%, and coverage of 25%, generating a noisy label matrix $L_s$:

In [3]:

Code Snippet

m           = 20
lf_accs     = 0.3 * np.random.random((m,)) + 0.45
LF_COVERAGE = 0.25
Ls          = np.zeros((n, m))
for i in range(n):
    for j in range(m):
        if np.random.random() < LF_COVERAGE:
            Ls[i,j] = Ys[i] if np.random.random() < lf_accs[j] else -Ys[i]

STEP 2: Modeling our noisy training set


The problem with the labels $L_s$ that we generated above is, of course, that they're noisy, and conflict on certain examples. The key technical idea in our data programming approach is that we can automatically model and denoise them!

In this step, we'll:

  1. Explain the idea of expressing our noisy training set as a generative model $\pi_\theta(L, Y)$

  2. Solve this, as a matrix completion problemn in TensorFlow, to learn the accuracies of our labeling functions

  3. Produce our "denoised" training set: the predictions of this generative model

(2.i) Defining a generative model of our noisy labeling process

We can see that our label matrix $L_s$ is considerably noisier than the actual ground truth labels $Y_s$; for example, if we take the majority vote of the labeling functions and compare to $Y_s$, we get the following pretty poor accuracy:

In [4]:

Code Snippet

print "Accuracy: %0.3f" % (np.sum(0.5 * (np.sign(Ls.sum(1)) * Ys + 1)) / n,)
Accuracy: 0.693

One of the key insights in the data programming approach is that instead of trying to directly use our noisy training set $L_s$ as supervision for our model, we will first model it as a generative model.

Here, we'll consider the simplest possible version of such a model, where we (correctly) assume that our LFs $\lambda_i$ are conditionally independent and have a probability of generating a non-zero label that is independent of $y$. For simplicity we'll also assume balanced classes. We are then just learning the standard model: $$ P(\vec{\lambda}(x), y) = \prod_{i=1}^m P(\lambda_i(x)|y) $$

Note that we can relax these assumptions, ending up with more complex generative models (as in the paper). Here, however, we'll stick with the simple model defined above, and express this modeling task as a matrix completion / low-rank matrix approximation one:

(2.ii) The labeling function model as a matrix completion problem

The way that we learn the accuracies of the labeling functions is by learning from the overlaps between them. Concretely, given our conditional indpendnce assumptions, we learn the accuracies of the labeling functions that best explain the empirical overlaps we see between them.

We start by building an empirical "overlaps matrix" $Z\in [-1,1]^{m\times m}$ where $Z_{i,j} = 2\hat{p}_{i,j} - 1$, with $\hat{p}_{i,j}$ being the empirical probability of labeling functions $i$ and $j$ agreeing.

In [5]:

Code Snippet

Z = np.dot(Ls.T, Ls) / (np.dot(np.abs(Ls).T, np.abs(Ls)))

If labeling functions $\lambda_i$ and $\lambda_j$ are conditionally independent (as we have assumed here), then letting $p_{i,j}$ be the probability of them agreeing, and $p_i,p_j$ be their respective accuracies, the true probability of them agreeing on a given example should be

$$ \begin{align*} p_{i,j} &= p_ip_j + (1-p_i)(1-p_j)\\ \end{align*} $$

However, if you take the dot product of the $\vec{\lambda}$ with its transpose, you get the sum of its agreements minus the sum of its disagreements. If we write those out with the above probabilities and simplify, we see that we can represent the above $Z$ matrix by doing a dot product of the below $q$ vector with its transpose.

$$ \begin{align*} Z_{i,j} &= (2p_i-1)(2p_j-1) = q_iq_j \end{align*} $$

Where we have defined the rescaled LF accuracies $q_i=2p_i-1$. So, our objective is now to find the LF accuracies that best match the data, in other words, the low rank approximation problem:

$$ \min_q ||Z-qq^T||_2^2 $$
In [6]:

Code Snippet

# Here we set up TF placeholder variables for Z and q
z = tf.placeholder(tf.float32, Z.shape)
q = tf.Variable(tf.random_normal([ Z.shape[0],1], mean=0.5, stddev=.15))

# y = qq^T
y = q * tf.transpose(q)

# Here we just zero-out the diagonals, because we don't care about learning 
# them (they are always 1 since an LF will always agree with itself!)
diag  = tf.zeros((Z.shape[0]))
mask  = tf.ones((Z.shape))
mask  = tf.matrix_set_diag(mask, diag)
y_aug = tf.multiply(y, mask)
z_aug = tf.multiply(z, mask)

# Our loss function: sum((Z - qq^T)^2)
loss = tf.reduce_sum((z_aug - y_aug) * (z_aug - y_aug))
train_step = tf.train.GradientDescentOptimizer(0.0005).minimize(loss)

Note that we actually don't want to include the diagonal of $Z$ since it is always all ones, so we do a small trick to mask out the diagonals.

Now, we just run gradient descent to learn the LF accuracies which minimize our loss function

In [7]:

Code Snippet

sess = tf.Session()
sess.run(tf.global_variables_initializer())
for step in range(5000):
    sess.run(train_step, feed_dict={z : Z})
q_final  = sess.run(q)
est_accs = (q_final+1)/2

We see that, despite the low coverage and accuracies of the labeling functions, and despite that we do not use any ground truth in this step, we do pretty decently in recovering the accuracies of the labeling functions:

In [8]:

Code Snippet

from pandas import DataFrame, Series
data = {
    'Error'     : Series(abs(est_accs[:,0] - lf_accs)),
    'True Acc.' : Series(lf_accs),
    'Est. Acc.' : Series(est_accs[:,0])
}
DataFrame(data=data)
Out[8]:
Error Est. Acc. True Acc.
0 0.004139 0.576521 0.572382
1 0.001266 0.734905 0.733639
2 0.018826 0.720113 0.701287
3 0.003343 0.695316 0.691973
4 0.007981 0.495499 0.503481
5 0.008199 0.587892 0.596091
6 0.024090 0.561517 0.585607
7 0.018300 0.528726 0.547026
8 0.005928 0.562733 0.556805
9 0.014248 0.597528 0.611776
10 0.005118 0.708596 0.703478
11 0.024335 0.521921 0.546256
12 0.018733 0.661807 0.680540
13 0.028933 0.538604 0.509671
14 0.042563 0.617966 0.660529
15 0.016062 0.543253 0.527191
16 0.035134 0.530542 0.565676
17 0.023854 0.761859 0.738005
18 0.025054 0.585379 0.610434
19 0.017244 0.647492 0.664736

(2.iii) Getting our denoised training set

But how do we use this as a training set? Basically, we now want to plug the accuracies we just estimated into our simple generative model, and use it to produce the probability that each data point is true, $P(y=1|\lambda(x))$. We'll then use these probabilities as our training labels!

We'll now quickly go through the math to do this; you can also feel free to skip this section! First, we'll express our labeling functions' conditional probabilities in the following standard form:

$$ P(\lambda_i(x)|y) = \frac{1}{Z}\exp(\frac12w_i\lambda_i(x)y) $$

Thus, we have re-expressed our labeling functions' accuracies $p_i$ (where we calculate accuracy over non-zero labels) in terms of the log odds accuracy $w$:

$$ p_i = P(\lambda_i(x)=1|y=1,\lambda_i(x)\neq0) = \frac{\exp\left(\frac{w_i}{2}\right)}{\exp\left(-\frac{w_i}{2}\right) + \exp\left(\frac{w_i}{2}\right)} $$

Then, recalling the simple generative model we defined earlier, we just have:

$$ \begin{align*} P(y=1|\lambda(x)) &= \frac{P(y=1)P(\lambda(x)|y=1)}{\sum_{y'\in\{-1,1\}}P(y=y') P(\lambda(x)|y=y')}\\ &= \frac{\prod_{i=1}^mP(\lambda_i(x)|y=1)}{\sum_{y'\in\{-1,1\}} \prod_{i=1}^mP(\lambda_i(x)|y=y')}\\ &= \frac{\exp\left(\frac12w^T\lambda(x)\right)}{\exp\left(\frac12w^T\lambda(x)\right) + \exp\left(-\frac12w^T\lambda(x)\right)}\\ &= \frac{1}{1 + \exp(-w^T\lambda(x))}\\ &= \sigma(w^T\lambda(x)) \end{align*} $$

Where to convert our accuracies to log odds, we just do:

$$ w_i = \log\left(\frac{p_i}{1-p_i}\right) $$
In [9]:

Code Snippet

lo = np.log( est_accs / (1.0 - est_accs))
Yp = 1 / (1 + np.exp(-np.ravel(Ls.dot(lo))))

We now have our training set labels $P(y=1|\lambda)$, and see that they're more accurate than a simple majority vote:

In [10]:

Code Snippet

print "Accuracy: %0.3f" % (np.sum(0.5 * ((np.sign(2*Yp - 1) * Ys) + 1)) / n,)
Accuracy: 0.727

However, the point here was just to model our noisy training set so that we could use it to train a higher-performance discriminative model. On to that now!

STEP 3: Training a noise-aware discriminative model


In this step, we'll:

  1. Set up a noise-aware variant of our favorite discriminative model

  2. Train our model in TensorFlow

  3. See how close we come to the fully supervised version!

We'll start by splitting our data into a train and test set; note we'll use the ground-truth labels only in the test set for evaluation:

In [11]:

Code Snippet

# Convert Y to correct format
Yc = 0.5 * (Ys + 1)

# Split into training and test set
N_split = int(0.8*n)
X_train = Xs[:N_split, :]
X_test  = Xs[N_split:, :]

# Note we *DO NOT* use the training set labels
Y_train = Yp[:N_split].reshape((-1,1))
Y_test  = Yc[N_split:].reshape((-1,1))

(3.i) Setting up a noise-aware discriminative model

We're now right back to the standard old goal in machine learning- train a discriminative model that learns to generalize beyond the training set, and thus get high performance on the test set.

Our only difference is that now we have a set of values $P(y=1|\lambda) \in [0,1]$ as our training labels, where these values express the varying degrees of confidence in our training labels, or alternatively, the amount of noise in our training set. Rather than naively treat these noisy training labels as ground truth, we want our discriminative model to be noise-aware, i.e. to learn more from high-confidence training labels. This is actually a quite simple tweak that's well supported 'right out of the box' in TensorFlow!

In this tutorial, we will just use a linear model over the $d$-dimensional feature vectors $x$, $h_{w,b}(x) = w^Tx + b$:

In [12]:

Code Snippet

X = tf.placeholder(tf.float32, [None, d])
Y = tf.placeholder(tf.float32, [None, 1])
w = tf.Variable(tf.random_normal([d, 1], mean=0, stddev=(np.sqrt(6/d + 2)), name="weights"))
b = tf.Variable(tf.random_normal([1, 1], mean=0, stddev=(np.sqrt(6/d + 2)), name="bias"))

# Defining our predictive model h
h = tf.add(tf.matmul(X, w), b)

And, we'll use a logistic loss. In other words, we'll end up with our old friend logistic regression (dissapointed that we're not at least using a single-layer neural net? Algebra will make you feel better!)

Now, our noise-aware loss is just the expected loss with respect to this noisy training set model

$$ l(w,b) = \mathbb{E}_{(x,y)\sim \pi_\theta}\left[ l(h_{w,b}(x), y) \right] $$

If we do out the algebra,

$$ \begin{align*} l(w,b) &= \mathbb{E}_{(x,y)\sim \pi_\theta}\left[ l(h_{w,b}(x), y) \right]\\ &= \sum_{x\in T} P(y=1)l(h_{w,b}(x),1) + (1-P(y=1))l(h_{w,b}(x),-1)\\ &= \sum_{x\in T} P(y=1)\log(1 + \exp(h_{w,b}(x))) + (1-P(y=1))\log(1 + \exp(-h_{w,b}(x)))\\ &= \sum_{x\in T} -P(y=1)\log(\sigma(h_{w,b}(x))) - (1-P(y=1))\log(\sigma(h_{w,b}(x)))\\ &= H_{x\in T}\left(p, q\right) \end{align*} $$

we see that our noise-aware loss function is actually just the cross-entropy between our training set predictions $p(x) = P_{\pi}(y=1|\lambda(x))$, and our logistic function $q(x) = \sigma(h_{w,b}(x)) = \frac{1}{Z}\exp(w^Tx+b)$.

In TensorFlow, this loss function (sigmoid_cross_entropy_with_logits) comes included right out of the box. This means that we can use data programming for any model by just swapping it in for h in the code below!

In [13]:

Code Snippet

# Defining the noise-aware logistic loss function
loss_fn = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(h, Y))

# Some setup for computing test set accuracy, setting learning rate, etc.
correct_OP   = tf.equal(tf.greater(h, 0), tf.equal(Y, 1))
accuracy_OP  = tf.reduce_mean(tf.cast(correct_OP, "float"))
learningRate = tf.train.exponential_decay(learning_rate=0.00001,
                                          global_step= 1,
                                          decay_steps=X_train.shape[0],
                                          decay_rate=0.95,
                                          staircase=True)

training_OP = tf.train.GradientDescentOptimizer(learningRate).minimize(loss_fn)

(3.ii) Training our model in TensorFlow

Now, finally, we'll train our noise aware model!

In [14]:

Code Snippet

def train_model(X_t, Y_t, n_epochs=5000, display=False):
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())

    # Train the model
    prev_loss = 0
    diff      = 1
    for i in range(n_epochs):
        if i > 1 and diff < .0001:
            print("Change in loss = %g; done!" % diff)
            break
        else:    
            step = sess.run(training_OP, feed_dict={X: X_t, Y: Y_t})
        
        # Report occasional stats
        if display and i % 500 == 0:
            loss      = sess.run(loss_fn, feed_dict={X: X_t, Y: Y_t})
            diff      = abs(loss - prev_loss)
            prev_loss = loss
            print("Step %0.4d \tLoss %0.2f" % (i, loss))
    print("Final accuracy on test set: %s" %str(sess.run(accuracy_OP, feed_dict={X: X_test, Y: Y_test})))

train_model(X_train, Y_train)
Final accuracy on test set: 0.8545

(3.iii) Comparing to supervised learning

Ok, we've been able to achieve much higher accuracy on the test set with our newly-trained discriminative model, than we had before with our heuristic labeling functions. But how far are we from what we could have accomplished with supervised learning, if we had ground truth labels? Let's see...

In [16]:

Code Snippet

Y_train_supervised = Yc[:N_split].reshape((-1,1))
train_model(X_train, Y_train_supervised)
Final accuracy on test set: 0.869

As you can see, the final accuracy for the fully supervised training with ground truth labels is remarkably close to the data programming model accuracy; and we could close this gap further by iterating further on our labeling functions.

Conclusion


Let's quickly recap what we did:

  1. We created a dataset $X_s$ and generated noisy and conflicting labels for it ($L_s$) by writing labeling functions

  2. We modeled this noisy training set generation process by learning a generative model, i.e. learning our labeling functions' accuracies by observing their overlaps (using a matrix completion objective)

  3. We trained a noise-aware discriminative model using our denoised labels, and showed that the performance was close to that of a directly-supervised model!

Key takeaways:

  • You can use data programming to weakly supervise machine learning models--i.e. to train them without large hand-labeled training sets--and still get strong end performance.
  • You can prototype these models quickly and efficiently with TensorFlow.

For more, check out the NIPS 2016 data programming paper and our system using data programming for information extraction, Snorkel!