Fig. 1: A schematic of a multi-task network with a single shared hidden layer, from Rich Caruana’s 1997 thesis. [1]

Multi-task learning (MTL), the technique of training a model to predict multiple tasks using a shared representation, has recently experienced an upsurge of interest and usage [3] as a powerful technique for amortizing labeling costs and learning more robust representations across multiple tasks. The idea of multi-task learning was first proposed in the 90’s [1], and has recently reemerged in new forms using modern deep learning techniques, concepts, and tools [3]. While many recent works have focused on novel MTL model architectures (i.e., various models for learning which parameters to share between tasks and to what extent), there are equally important questions about how to build the next generation of MTL systems—the data management systems and high-level frameworks that support MTL. We focus on three of these in this post:

A Unifying Paradigm for Information Sharing

One emerging topic in MTL is how it relates to other approaches for sharing information between tasks. For example, a related and quite popular technique is transfer learning (TL), which broadly describes the approach of training a model on some task or dataset A, and then using this model (potentially with some form of fine-tuning) on a separate task or dataset B. While often treated as a separate concept, to a first approximation TL can in fact be viewed as a sub-case of MTL where two tasks are trained in a serial fashion (and potentially over different datasets). Similarly, MTL can be viewed as a natural generalization of TL to multiple tasks, trained jointly and potentially with more complex information sharing patterns.

Fig. 2: A basic modern MTL model, with hard parameter sharing: two tasks, A and B, have certain shared layers in common, while having their own final task head layers.

To make this connection more concrete, one potential formalism is to model the training procedure of a multi-layered neural network with a 2D training schedule matrix:

With these two simple axes, we can capture a wide variety of potential information sharing techniques:
  1. MTL (“round robin”): One classic MTL approach is to alternate between batches of task A and task B, freezing the task head that does not correspond to the current batch.
  2. TL (“fine-tuning”): A classic TL approach is to first train completely on task A, then freeze those layers and only train the task head for task B.
  3. TL (“gradual unfreezing”): (Howard and Ruder, 2018) [7] saw improvements by gradually unfreezing the shared parameters while fine-tuning task B.
Note that for simplicity here, we omit a specification of which dataset is being fed in at each time step; however, this can be easily incorporated as well.

Fig. 3: Example training schedule matrices depicting three different multi-task or transfer learning techniques for a hard parameter sharing neural network.

Additional techniques that can be captured by this simple abstraction include learning rate decay (gradual reduction in intensity from left to right), learning rate warmup (gradually increasing from 0 to maximum LR in the first few columns on the left), proportional sampling of batches [8] (rotating between tasks proportional to the number of batches they contribute rather than uniformly), curriculum learning (training on the tasks serially in increasing order of difficulty), and many others. While these simple 2D training matrices do not capture all the possible ways to frame an MTL/TL training procedure (e.g., task-specific loss scaling, “soft” parameter sharing, etc.), we find this paradigm useful for highlighting some of the differences (and many of the similarities) between these approaches, and are currently exploring its implementation in our prototype MTL system, Snorkel MeTaL [4,5].

Bringing More Signal Into the Fold: Multi-Task Weak Supervision

Current MTL approaches generally take advantage of existing hand-labeled training sets. However, labeling training datasets has become one of the most significant bottlenecks in real-world ML development pipelines; this is exacerbated in MTL settings where multiple such datasets are required.

One increasingly popular approach to tackling this training data annotation bottleneck is to use weaker, higher-level supervision—for example, labels generated by noisy sources and/or specified programmatically [1,11]. So far, systems for managing weak supervision resources—such as our system Snorkel [2] (snorkel.stanford.edu)—have mostly focused on modeling the accuracies and correlations of noisy label sources for a single task. But the increasing prevalence of MTL scenarios invites the question: what happens when our noisy, possibly correlated label sources now label multiple related tasks? Can we benefit by modeling the supervision for these tasks jointly?

Fig. 4: A diagram of our new multi-task supervision approach (featured at AAAI ‘19), where weak supervision sources s1, s2, s3 output noisy multi-task label vectors (left), which we model and combine using a new matrix completion approach (middle), and finally use to train a multi-task model [5].

We tackle these questions in a new multitask-aware version of Snorkel, Snorkel MeTaL [4,5], which can support multi-task weak supervision sources that provide noisy labels for one or more related tasks. We handle this with a new modeling approach, which we recently presented at AAAI ‘19 [5]. One example we consider is the setting of label sources with different granularities. For example, suppose we are aiming to train a fine-grained named entity recognition (NER) model to tag mentions of specific types of people and locations, and we have some noisy labels that are fine-grained—e.g. Labeling “Lawyer” vs. “Doctor” or “Bank” vs. “Hospital”—and some that are coarse-grained, e.g. labeling “Person” vs. “Location”. By representing these sources as labeling different hierarchically-related tasks, we can jointly model their accuracies, and reweight and combine their multi-task labels to create much cleaner, intelligently aggregated multi-task training data that improves the end MTL model performance.

Development on Snorkel MeTaL is ongoing, as are collaborations with application partners to try this approach of weakly-supervised MTL on a range of problems involving electronic health records, parole reports, HTML data, and more. If this sounds interesting to you, please drop us a line—we’re excited to talk!

Building Data Management Systems for the Massively Multi-Task Regime

We believe that the most exciting aspects of building data management systems for MTL will revolve around handling what we refer to as the massively multi-task regime, where tens to hundreds of weakly-supervised (and thus highly dynamic) tasks interact in complex and varied ways. While most MTL work to date has considered tackling a handful of tasks, defined by static hand-labeled training sets, the world is quickly advancing to a state where organizations—whether large companies [9], academic labs, or online communities—maintain tens to hundreds of weakly-supervised, rapidly changing, and interdependent modeling tasks. For example, even a standard text information extraction application today (Fig. 5) contains multiple models that have complex data dependencies (e.g. the output of one model is used both as weak supervision and as input features for a subsequent model) and share representations (e.g. using transfer or multi-task learning). Moreover, because these tasks are weakly supervised, developers can add, remove, or change tasks (i.e. training sets) in hours or days, rather than months or years, potentially necessitating retraining of the entire model.

Fig. 5: A growing problem for machine learning workflows in the massively multi-task regime is the increasing number of interacting tasks and models, which introduces complex dependencies and data management issues. For example, in a relation extraction application, the outputs of one model often serve as the supervision, input, and/or features of another (green arrows), and models often share common representations (orange nodes), such as word embeddings or pretrained networks used for fine-tuning.

Handling this new massively multi-task regime requires building data management systems that address a range of questions:

In a recent paper (presented at CIDR ’19 [6]), we outlined some initial thoughts in response to the above questions, envisioning a massively multi-task setting where MTL models effectively function as a central repository for training data that is weakly labeled by different developers, and then combined in a central “mother” multi-task model (Fig. 6).

Fig. 6: An envisioned massively multi-task workflow where developers effectively pool and share information in a central multi-task model, which can then be deployed via smaller distilled [10] single-task models (CIDR ‘19) [6].

Regardless of exact form factor, it is clear there’s lots of exciting progress for MTL techniques ahead—not just new model architectures, but also increasing unification with transfer learning approaches, new weakly-supervised approaches, and new software development and systems paradigms. We’ll be continuing to post our thoughts and code at snorkel.stanford.edu and the Snorkel MeTaL repo—feedback is always welcome!

  1. R. Caruana. Multitask learning. Machine Learning, 28(1), 41-75. 1997. (http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf)
  2. A. Ratner, S. Bach, H. Ehrenberg, J. Fries, S. Wu, C. Ré. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB 2018. (https://arxiv.org/abs/1711.10160)
  3. S. Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint. (https://arxiv.org/abs/1706.05098)
  4. A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, and C. Ré. Snorkel MeTaL: Weak supervision for multi-task learning. DEEM 2018. (https://dl.acm.org/citation.cfm?id=3209898)
  5. A. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey, and C. Ré. Training complex models with multi-task weak supervision. AAAI 2019. (https://arxiv.org/abs/1810.02840)
  6. A. Ratner, B. Hancock, and C. Ré. The Role of Massively Multi-Task and Weak Supervision in Software 2.0. CIDR 2019. (http://cidrdb.org/cidr2019/papers/p58-ratner-cidr19.pdf)
  7. J. Howard and S. Ruder. Universal Language Model Fine-tuning for Text Classification(ULMFit). ACL 2018. (https://arxiv.org/pdf/1801.06146.pdf)
  8. V. Sanh, T. Wolf, and S. Ruder. A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks. AAAI 2019. (https://arxiv.org/pdf/1811.06031.pdf)
  9. S. Bach, D. Rodriguez, Y. Liu, C. Luo, H. Shao, C. Xia, ... , and R. Kuchhal. Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale. 2018. arXiv preprint. (https://arxiv.org/abs/1812.00417)
  10. G. Hinton, O. Vinyals, and J. Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint. (https://arxiv.org/abs/1503.02531)
  11. A. Ratner, S. Bach, P. Varma, J. Fries, S. Wu, C. Ré. Weak Supervision: The New Programming Paradigm for Machine Learning (https://hazyresearch.github.io/snorkel/blog/ws_blog_post.html)