Constructing fashions that perceive and generate pure language effectively is one the grand objectives of machine studying (ML) analysis and has a direct impression on constructing good methods for on a regular basis functions. Bettering the standard of language fashions is a key goal for researchers to make progress towards such a aim.
Most typical paradigms to construct and prepare language fashions use both autoregressive decoder-only architectures (e.g., PaLM or GPT-3), the place the mannequin is educated to foretell the subsequent phrase for a given prefix phrase, or span corruption-based encoder-decoder architectures (e.g., T5, ST-MoE), the place the coaching goal is to recuperate the subset of phrases masked out of the enter. On the one hand, T5-like fashions carry out effectively on supervised fine-tuning duties, however battle with few-shot in-context studying. Then again, autoregressive language fashions are nice for open-ended era (e.g., dialog era with LaMDA) and prompt-based studying (e.g., in-context studying with PaLM), however could carry out suboptimally on fine-tuning duties. Thus, there stays a chance to create an efficient unified framework for pre-training fashions.
In “Unifying Language Studying Paradigms”, we current a novel language pre-training paradigm referred to as Unified Language Learner (UL2) that improves the efficiency of language fashions universally throughout datasets and setups. UL2 frames totally different goal capabilities for coaching language fashions as denoising duties, the place the mannequin has to recuperate lacking sub-sequences of a given enter. Throughout pre-training it makes use of a novel mixture-of-denoisers that samples from a diverse set of such aims, every with totally different configurations. We reveal that fashions educated utilizing the UL2 framework carry out effectively in quite a lot of language domains, together with prompt-based few-shot studying and fashions fine-tuned for down-stream duties. Moreover, we present that UL2 excels in era, language understanding, retrieval, long-text understanding and query answering duties. Lastly, we’re excited to publicly launch the checkpoints for our greatest performing UL2 20 billion parameter mannequin.
Background: Language Modeling Targets and Architectures
Frequent goal capabilities for coaching language fashions can largely be framed as studying knowledge transformations that map inputs to targets. The mannequin is conditioned on totally different types of enter to foretell goal tokens. To this finish, totally different aims make the most of totally different properties of the inputs.
The usual Causal Language modeling goal (CausalLM) is educated to foretell full sequence lengths and so, solely acknowledges tokens within the goal output. The prefix language modeling goal (PrefixLM) modifies this course of by randomly sampling a contiguous span of okay tokens from the given tokenized textual content to type the enter of the mannequin, known as the “prefix”. The span corruption goal masks contiguous spans from the inputs and trains the mannequin to foretell these masked spans.
Within the desk beneath, we listing the widespread aims on which state-of-the-art language fashions are educated together with totally different traits of the enter, i.e., how it’s introduced to the mannequin. Furthermore, we characterize the instance effectivity of every goal by way of the flexibility of the mannequin for exploiting supervision alerts from a single enter, e.g., how a lot of the enter tokens contribute to the calculation of the loss.
Goal Operate |
Inputs (Bi-directional) |
Targets (Causal) |
Enter Properties |
Instance Effectivity |
CausalLM | none | textual content | N/A | full seq_len |
PrefixLM | textual content (as much as place okay) | textual content (after place okay) | contiguous | seq_len – okay |
Span corruption | masked textual content | masked_tokens | non-contiguous, could also be bi-directional | sometimes decrease than others |
Frequent aims utilized in as we speak’s language fashions. All through, “textual content” signifies tokenized textual content. |
UL2 leverages the strengths of every of those goal capabilities by way of a framework that generalizes over every of them, which permits the flexibility to purpose and unify widespread pre-training aims. Based mostly on this framework, the primary activity for coaching a language mannequin is to study the transformation of a sequence of enter tokens to a sequence of goal tokens. Then all the target capabilities launched above will be merely decreased to other ways of producing enter and goal tokens. As an illustration, the PrefixLM goal will be seen as a metamorphosis that strikes a phase of okay contiguous tokens from the inputs to the targets. In the meantime, the span corruption goal is a knowledge transformation that corrupts spans (a subsequence of tokens within the enter), changing them with masks tokens which might be shifted to the targets.
It’s price noting that one can decouple the mannequin structure and the target perform with which it’s educated. Thus, it’s attainable to coach totally different architectures, such because the widespread single stack decoder-only and two-stack encoder-decoder fashions, with any of those aims.
Combination of Denoisers
The UL2 framework can be utilized to coach a mannequin on a combination of pre-training aims and provide it with capabilities and inductive bias advantages from totally different pre-training duties. Coaching on the combination helps the mannequin leverage the strengths of various duties and mitigates the weaknesses of others. As an illustration, the mixture-of-denoisers goal can strongly enhance the prompt-based studying functionality of the mannequin versus a span corruption-only T5 mannequin.
UL2 is educated utilizing a combination of three denoising duties: (1) R-denoising (or common span corruption), which emulates the usual T5 span corruption goal; (2) X-denoising (or excessive span corruption); and (3) S-denoising (or sequential PrefixLM). Throughout pre-training, we pattern from the accessible denoising duties primarily based on user-specified ratios (i.e., totally different combos of the R, X, and S-denoisers) and put together the enter and goal appropriately. Then, a paradigm token is appended to the enter (one in all [R]
, [X]
, or [S]
) indicating the denoising activity at hand.
An summary of the denoising aims utilized in UL2’s mixture-of-denoisers. |
Bettering Commerce-Offs Throughout Studying Paradigms
Many present generally used language studying paradigms sometimes excel at one sort of activity or software, similar to fine-tuning efficiency or prompt-based in-context studying. Within the plot beneath, we present baseline goal capabilities on totally different duties in comparison with UL2: CausalLM (known as GPT-like), PrefixLM, Span Corrupt (additionally known as T5 within the plot), and a baseline goal perform proposed by UniLM. We use these aims for coaching decoder solely architectures (inexperienced) and encoder-decoder architectures (blue) and consider totally different combos of goal capabilities and architectures on two most important units of duties:
- High-quality-tuning, by measuring efficiency on SuperGLUE (y-axis of the plot beneath)
- In-context studying, by measuring efficiency of the mannequin on a set of 1-shot GEM duties (e.g., XSUM, SGD or Schema guided dialog and TOTTO) (x-axis of the plot beneath).
For many of the present language studying paradigms, there’s a trade-off between the standard of the mannequin on these two units of duties. We present that UL2 bridges this trade-off throughout in-context studying and fine-tuning.
UL2 for Few-Shot Prompting and Chain-of-Thought Reasoning
We scale up UL2 and prepare a 20 billion parameter encoder-decoder mannequin on the general public C4 corpus and reveal some spectacular capabilities of the UL2 20B mannequin.
UL2 is a strong in-context learner that excels at each few-shot and chain-of-thought (CoT) prompting. Within the desk beneath, we examine UL2 with different state-of-the-art fashions (e.g, T5 XXL and PaLM) for few-shot prompting on the XSUM summarization dataset. Our outcomes present that UL2 20B outperforms PaLM and T5, each of that are in the identical ballpark of compute value.
Mannequin | ROUGE-1 | ROUGE-2 | ROUGE-L |
LaMDA 137B | – | 5.4 | – |
PaLM 62B | – | 11.2 | – |
PaLM 540B | – | 12.2 | – |
PaLM 8B | – | 4.5 | – |
T5 XXL 11B | 0.6 | 0.1 | 0.6 |
T5 XXL 11B + LM | 13.3 | 2.3 | 10.7 |
UL2 20B | 25.5 | 8.6 | 19.8 |
Comparability of UL2 with T5 XXL, PaLM and LamDA 137B on 1-shot summarization (XSUM) by way of ROUGE-1/2/L (larger is healthier), which captures the standard by evaluating the generated summaries with the gold summaries as reference. |
Most CoT prompting outcomes have been obtained utilizing a lot bigger language fashions, similar to GPT-3 175B, PaLM 540B, or LaMDA 137B. We present that reasoning by way of CoT prompting will be achieved with UL2 20B, which is each publicly accessible and a number of other instances smaller than prior fashions that leverage chain-of-thought prompting. This permits an open avenue for researchers to conduct analysis on CoT prompting and reasoning at an accessible scale. Within the desk beneath, we present that for UL2, CoT prompting outperforms normal prompting on math phrase issues with a variety of difficulties (GSM8K, SVAMP, ASDiv, AQuA, and MAWPS). We additionally present that self-consistency additional improves efficiency.
Chain-of-thought (CoT) prompting and self-consistency (SC) outcomes on 5 arithmetic reasoning benchmarks. |
Conclusion and Future Instructions
UL2 demonstrates superior efficiency on a plethora of fine-tuning and few-shot duties. We publicly launch checkpoints of our greatest performing UL2 mannequin with 20 billion parameters, which we hope will encourage quicker progress in creating higher language fashions within the machine studying group as an entire.
Acknowledgements
It was an honor and privilege to work on this with Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Received Chung, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby and Donald Metzler. We additional acknowledge Alexey Gritsenko, Andrew M. Dai, Jacob Devlin, Jai Gupta, William Fedus, Orhan Firat, Sebastian Gerhmann, Nan Du, Dave Uthus, Siamak Shakeri, Slav Petrov and Quoc Le for assist and discussions. We thank the Jax and T5X staff for constructing such fantastic infrastructure that made this analysis attainable.