\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" Error Est. Acc. True Acc.\n",
"0 0.004139 0.576521 0.572382\n",
"1 0.001266 0.734905 0.733639\n",
"2 0.018826 0.720113 0.701287\n",
"3 0.003343 0.695316 0.691973\n",
"4 0.007981 0.495499 0.503481\n",
"5 0.008199 0.587892 0.596091\n",
"6 0.024090 0.561517 0.585607\n",
"7 0.018300 0.528726 0.547026\n",
"8 0.005928 0.562733 0.556805\n",
"9 0.014248 0.597528 0.611776\n",
"10 0.005118 0.708596 0.703478\n",
"11 0.024335 0.521921 0.546256\n",
"12 0.018733 0.661807 0.680540\n",
"13 0.028933 0.538604 0.509671\n",
"14 0.042563 0.617966 0.660529\n",
"15 0.016062 0.543253 0.527191\n",
"16 0.035134 0.530542 0.565676\n",
"17 0.023854 0.761859 0.738005\n",
"18 0.025054 0.585379 0.610434\n",
"19 0.017244 0.647492 0.664736"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pandas import DataFrame, Series\n",
"data = {\n",
" 'Error' : Series(abs(est_accs[:,0] - lf_accs)),\n",
" 'True Acc.' : Series(lf_accs),\n",
" 'Est. Acc.' : Series(est_accs[:,0])\n",
"}\n",
"DataFrame(data=data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### (2.iii) Getting our denoised training set\n",
"\n",
"But how do we use this as a training set? Basically, we now want to plug the accuracies we just estimated into our simple generative model, and use it to produce the probability that each data point is true, $P(y=1|\\lambda(x))$. We'll then use _these probabilities_ as our training labels!\n",
"\n",
"We'll now quickly go through the math to do this; you can also feel free to skip this section! First, we'll express our labeling functions' conditional probabilities in the following standard form:\n",
"\n",
"$$\n",
"P(\\lambda_i(x)|y) = \\frac{1}{Z}\\exp(\\frac12w_i\\lambda_i(x)y)\n",
"$$\n",
"\n",
"Thus, we have re-expressed our labeling functions' accuracies $p_i$ (where we calculate accuracy over non-zero labels) in terms of the _log odds accuracy_ $w$:\n",
"\n",
"$$\n",
"p_i = P(\\lambda_i(x)=1|y=1,\\lambda_i(x)\\neq0)\n",
"= \\frac{\\exp\\left(\\frac{w_i}{2}\\right)}{\\exp\\left(-\\frac{w_i}{2}\\right) + \\exp\\left(\\frac{w_i}{2}\\right)}\n",
"$$\n",
"\n",
"Then, recalling the simple generative model we defined earlier, we just have:\n",
"\n",
"$$\n",
"\\begin{align*}\n",
"P(y=1|\\lambda(x)) &= \\frac{P(y=1)P(\\lambda(x)|y=1)}{\\sum_{y'\\in\\{-1,1\\}}P(y=y') P(\\lambda(x)|y=y')}\\\\\n",
"&= \\frac{\\prod_{i=1}^mP(\\lambda_i(x)|y=1)}{\\sum_{y'\\in\\{-1,1\\}} \\prod_{i=1}^mP(\\lambda_i(x)|y=y')}\\\\\n",
"&= \\frac{\\exp\\left(\\frac12w^T\\lambda(x)\\right)}{\\exp\\left(\\frac12w^T\\lambda(x)\\right) + \\exp\\left(-\\frac12w^T\\lambda(x)\\right)}\\\\\n",
"&= \\frac{1}{1 + \\exp(-w^T\\lambda(x))}\\\\\n",
"&= \\sigma(w^T\\lambda(x))\n",
"\\end{align*}\n",
"$$\n",
"\n",
"Where to convert our accuracies to log odds, we just do:\n",
"\n",
"$$\n",
"w_i = \\log\\left(\\frac{p_i}{1-p_i}\\right)\n",
"$$"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"lo = np.log( est_accs / (1.0 - est_accs))\n",
"Yp = 1 / (1 + np.exp(-np.ravel(Ls.dot(lo))))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have our training set labels $P(y=1|\\lambda)$, and see that they're more accurate than a simple majority vote:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 0.727\n"
]
}
],
"source": [
"print \"Accuracy: %0.3f\" % (np.sum(0.5 * ((np.sign(2*Yp - 1) * Ys) + 1)) / n,)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, the point here was just to model our noisy training set **_so that we could use it to train a higher-performance discriminative model_**. On to that now!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## _STEP 3:_ Training a _noise-aware_ discriminative model\n",
"---\n",
"\n",
"In this step, we'll:\n",
"\n",
"1. Set up a _noise-aware_ variant of our favorite discriminative model\n",
"\n",
"2. Train our model in TensorFlow\n",
"\n",
"3. See how close we come to the fully supervised version!\n",
"\n",
"We'll start by splitting our data into a train and test set; note we'll use the ground-truth labels only in the test set for evaluation:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Convert Y to correct format\n",
"Yc = 0.5 * (Ys + 1)\n",
"\n",
"# Split into training and test set\n",
"N_split = int(0.8*n)\n",
"X_train = Xs[:N_split, :]\n",
"X_test = Xs[N_split:, :]\n",
"\n",
"# Note we *DO NOT* use the training set labels\n",
"Y_train = Yp[:N_split].reshape((-1,1))\n",
"Y_test = Yc[N_split:].reshape((-1,1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### (3.i) Setting up a noise-aware discriminative model\n",
"\n",
"We're now right back to the standard old goal in machine learning- train a discriminative model that learns to _generalize_ beyond the training set, and thus get high performance on the test set.\n",
"\n",
"Our only difference is that now we have a set of values $P(y=1|\\lambda) \\in [0,1]$ as our training labels, where these values express the varying degrees of confidence in our training labels, or alternatively, the amount of _noise_ in our training set. Rather than naively treat these noisy training labels as ground truth, we want our discriminative model to be _noise-aware_, i.e. to learn more from high-confidence training labels. This is actually a quite simple tweak that's well supported 'right out of the box' in TensorFlow!\n",
"\n",
"In this tutorial, we will just use a **linear model** over the $d$-dimensional feature vectors $x$, $h_{w,b}(x) = w^Tx + b$:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X = tf.placeholder(tf.float32, [None, d])\n",
"Y = tf.placeholder(tf.float32, [None, 1])\n",
"w = tf.Variable(tf.random_normal([d, 1], mean=0, stddev=(np.sqrt(6/d + 2)), name=\"weights\"))\n",
"b = tf.Variable(tf.random_normal([1, 1], mean=0, stddev=(np.sqrt(6/d + 2)), name=\"bias\"))\n",
"\n",
"# Defining our predictive model h\n",
"h = tf.add(tf.matmul(X, w), b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And, we'll use a **logistic loss**. In other words, we'll end up with our old friend logistic regression (dissapointed that we're not at least using a single-layer neural net? Algebra will make you feel better!)\n",
"\n",
"Now, our _noise-aware loss_ is just the expected loss with respect to this noisy training set model\n",
"\n",
"$$\n",
"l(w,b) = \\mathbb{E}_{(x,y)\\sim \\pi_\\theta}\\left[ l(h_{w,b}(x), y) \\right]\n",
"$$\n",
"\n",
"If we do out the algebra,\n",
"\n",
"$$\n",
"\\begin{align*}\n",
"l(w,b) &= \\mathbb{E}_{(x,y)\\sim \\pi_\\theta}\\left[ l(h_{w,b}(x), y) \\right]\\\\\n",
"&= \\sum_{x\\in T} P(y=1)l(h_{w,b}(x),1) + (1-P(y=1))l(h_{w,b}(x),-1)\\\\\n",
"&= \\sum_{x\\in T} P(y=1)\\log(1 + \\exp(h_{w,b}(x))) + (1-P(y=1))\\log(1 + \\exp(-h_{w,b}(x)))\\\\\n",
"&= \\sum_{x\\in T} -P(y=1)\\log(\\sigma(h_{w,b}(x))) - (1-P(y=1))\\log(\\sigma(h_{w,b}(x)))\\\\\n",
"&= H_{x\\in T}\\left(p, q\\right)\n",
"\\end{align*}\n",
"$$\n",
"\n",
"we see that our noise-aware loss function is actually just the _cross-entropy_ between our training set predictions $p(x) = P_{\\pi}(y=1|\\lambda(x))$, and our logistic function $q(x) = \\sigma(h_{w,b}(x)) = \\frac{1}{Z}\\exp(w^Tx+b)$.\n",
"\n",
"In TensorFlow, this loss function (`sigmoid_cross_entropy_with_logits`) comes included right out of the box. **This means that we can use data programming for _any_ model by just swapping it in for h in the code below!**"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Defining the noise-aware logistic loss function\n",
"loss_fn = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(h, Y))\n",
"\n",
"# Some setup for computing test set accuracy, setting learning rate, etc.\n",
"correct_OP = tf.equal(tf.greater(h, 0), tf.equal(Y, 1))\n",
"accuracy_OP = tf.reduce_mean(tf.cast(correct_OP, \"float\"))\n",
"learningRate = tf.train.exponential_decay(learning_rate=0.00001,\n",
" global_step= 1,\n",
" decay_steps=X_train.shape[0],\n",
" decay_rate=0.95,\n",
" staircase=True)\n",
"\n",
"training_OP = tf.train.GradientDescentOptimizer(learningRate).minimize(loss_fn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### (3.ii) Training our model in TensorFlow\n",
"\n",
"Now, finally, we'll train our noise aware model!"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def train_model(X_t, Y_t, n_epochs=5000, display=False):\n",
" sess = tf.Session()\n",
" sess.run(tf.global_variables_initializer())\n",
"\n",
" # Train the model\n",
" prev_loss = 0\n",
" diff = 1\n",
" for i in range(n_epochs):\n",
" if i > 1 and diff < .0001:\n",
" print(\"Change in loss = %g; done!\" % diff)\n",
" break\n",
" else: \n",
" step = sess.run(training_OP, feed_dict={X: X_t, Y: Y_t})\n",
" \n",
" # Report occasional stats\n",
" if display and i % 500 == 0:\n",
" loss = sess.run(loss_fn, feed_dict={X: X_t, Y: Y_t})\n",
" diff = abs(loss - prev_loss)\n",
" prev_loss = loss\n",
" print(\"Step %0.4d \\tLoss %0.2f\" % (i, loss))\n",
" print(\"Final accuracy on test set: %s\" %str(sess.run(accuracy_OP, feed_dict={X: X_test, Y: Y_test})))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Final accuracy on test set: 0.8545\n"
]
}
],
"source": [
"train_model(X_train, Y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### (3.iii) Comparing to supervised learning\n",
"\n",
"Ok, we've been able to achieve **much higher accuracy** on the test set with our newly-trained discriminative model, than we had before with our heuristic labeling functions. But how far are we from what we could have accomplished with supervised learning, if we had ground truth labels? Let's see..."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Final accuracy on test set: 0.869\n"
]
}
],
"source": [
"Y_train_supervised = Yc[:N_split].reshape((-1,1))\n",
"train_model(X_train, Y_train_supervised)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the final accuracy for the fully supervised training with ground truth labels is remarkably close to the data programming model accuracy; and we could close this gap further by iterating further on our labeling functions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"---\n",
"\n",
"Let's quickly recap what we did: \n",
"1. We created a dataset $X_s$ and generated noisy and conflicting labels for it ($L_s$) by writing _labeling functions_\n",
"\n",
"2. We _modeled_ this noisy training set generation process by learning a generative model, i.e. learning our labeling functions' accuracies by observing their overlaps (using a matrix completion objective)\n",
"\n",
"3. We trained a _noise-aware_ discriminative model using our denoised labels, and showed that the performance was close to that of a directly-supervised model!\n",
"\n",
"Key takeaways:\n",
"* You can use data programming to weakly supervise machine learning models--i.e. to train them without large hand-labeled training sets--and still get strong end performance.\n",
"* You can prototype these models quickly and efficiently with TensorFlow. \n",
"\n",
"For more, check out the [NIPS 2016 data programming paper](https://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly) and our system using data programming for information extraction, [Snorkel](http://snorkel.stanford.edu)!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Error | Est. Acc. | True Acc. | |
---|---|---|---|

0 | 0.004139 | 0.576521 | 0.572382 |

1 | 0.001266 | 0.734905 | 0.733639 |

2 | 0.018826 | 0.720113 | 0.701287 |

3 | 0.003343 | 0.695316 | 0.691973 |

4 | 0.007981 | 0.495499 | 0.503481 |

5 | 0.008199 | 0.587892 | 0.596091 |

6 | 0.024090 | 0.561517 | 0.585607 |

7 | 0.018300 | 0.528726 | 0.547026 |

8 | 0.005928 | 0.562733 | 0.556805 |

9 | 0.014248 | 0.597528 | 0.611776 |

10 | 0.005118 | 0.708596 | 0.703478 |

11 | 0.024335 | 0.521921 | 0.546256 |

12 | 0.018733 | 0.661807 | 0.680540 |

13 | 0.028933 | 0.538604 | 0.509671 |

14 | 0.042563 | 0.617966 | 0.660529 |

15 | 0.016062 | 0.543253 | 0.527191 |

16 | 0.035134 | 0.530542 | 0.565676 |

17 | 0.023854 | 0.761859 | 0.738005 |

18 | 0.025054 | 0.585379 | 0.610434 |

19 | 0.017244 | 0.647492 | 0.664736 |