Speculation on the future of language models with long-term memory
(Warning: link may not work if you're not part of this subreddit forum)
Given that people are now starting to give language models the ability to "ponder", as in this recent work (using a scratchpad / inner-voice)
and are seeing success, perhaps the next major "obstruction" on the path towards AGI systems is the need for long-term memory. Currently, language models are limited to a context window of a few thousand tokens, which is too short to hold "memories" of things from any appreciably long time in the past. There have been proposals for building language models with much longer context windows; but unless this is hundreds of thousands to millions of tokens, it probably won't be enough for AGI.
One solution, perhaps, is to build in a separate "memory module"; however, it would be best if one didn't have to fiddle much with existing language model architectures, so that all that training used to build those GPT-3-scale models can be reused. Furthermore, at least when it comes to modeling working memory, language model context window lengths seem adequate; so it's probably not a good idea to replace them with some more general type of "memory".
I could see machine learning engineers keeping language models mostly as they are, and simply making some very minimal changes, in order to greatly expand their memory -- without needing to extend context window much or any at all. One path they might try is something like this: split the context window up into an initial segment of, say, 200 vectors, and then let the rest be for the text stream. Those vectors might represent a section of memory currently under consideration. Initially, the vectors might represent a lossy-compressed version of all the tokens that have ever passed through the model, in chronological order (e.g. the first token represents a compressed version of the first 1,000 tokens the system ever saw; the second one represents a compressed version of tokens 501 through 1,500; the third represents a compressed version of tokens 1,001 through 2,000; and so on). Perhaps better than compression at the token level would be to use some kind of average over embeddings of those tokens or something -- something that would be easier for the model to learn to use, requiring less additional training (it should be easier for it to pick out that a memory block is relevant using features rather than tokens). When the system sees the vector corresponding to each of those first 200 slots, it gets some vague idea of what happened at a given window of time. When it needs greater precision about the memory, it might write
<scratch> Zoom in on the vector 11.</scratch>
That would then cause the "memory manager" to replace the entire set of 200 memory window vectors with a compressed version of tokens 5,001 through 6,000. At this point, the model might have pinpointed a relevant memory to help it solve some problem it was asked about.
Fine-tuning might be used periodically to update its skills (arithmetic, theorem-proving, physics reasoning, etc.), and also to train it to use the scratchpad / inner-voice to zero-in on past memories, to think through problems in greater depth, and also to plan ahead (and explore possibilities exhaustively via backtracking) -- fine-tuning would act kind of like a procedural memory update at various levels.
Thus, perhaps, one doesn't need to wait for breakthroughs in extending the length of the context window, or for fancy new Transformer models (or even post-Transformer models). Like with adding a scratchpad, maybe just some minor tweaks is all that is needed. Just imagine what these language models would be capable of if all the stars line up and what I have described happens...
And remember my friend, future events such as these will affect you in the future