"Show Your Work: Scratchpads for Intermediate Computation with Language Models", Anonymous et al 2021 {Google} (LaMDA)
Abstract: Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations --- even in the few-shot regime --- when asked to perform the operation "step by step", showing the results of intermediate computations. In particular, we train transformers to perform multi-step computations by asking them to emit intermediate computation steps into a "scratchpad". On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.
Sounds great! That sounds similar to adding an "inner voice" as Starspawn0 had described before (or, failing that, adding recursion -- adding an inner voice is easier, since you don't have to change the architecture)
That should make the language model a lot "smarter".
Addendum: This sounds
great, and shows that language models are a lot more powerful when you add a scratchpad! (or inner-voice):
We introduce (Section 2) the notion of a “scratchpad” for Transformers, in order to make them better at performing complex discrete computations without modifying the underlying architecture.
• We show (Section 3) that scratchpads help Transformers learn to perform long addition in the fine-tuning regime, and in particular that they improve out-of-distribution generalization to larger problem instances.
• We also find (Section 4) that scratchpads help Transformers perform a somewhat higher level task: polynomial evaluation. This is true in both the few-shot and fine-tuning regimes.
• Finally, we move to a much more general context and show (Section 5) that training Transformers to emit full program traces line by line annotated with local variables dramatically improves their ability to predict the result of executing a given computer program on a particular input. This application in some sense subsumes the others.
Also, with a scratchpad / inner-voice, I wouldn't be surprised if language models do a much better job at theorem-proving.
Addendum 2: As they say in section 7, the method is limited by the context window size, and then they discuss possibly coaxing the model into using a scratchpad without direct supervision. I'd say a context window of size about 100,000 to 1 million tokens would be enough to achieve an AGI-like system, provided it used a scratchpad / inner-voice. Perhaps the scratchpad / inner-voice could even be used to keep track of important information, which might improve upon what attention already provides -- e.g. having it write something like this:
... {window of text} ... <scrach> Important! </scratch>...
to signal that the preceding text should be remembered.
If they can get model with a 1 million token window to use a scratchpad / inner-voice effectively, generalizing to new contexts after each use, then I'd say that's pretty much an AGI. If like a trillion tokens are used to train it, it's untelling just how intelligent it might be.
And remember my friend, future events such as these will affect you in the future