Deep Learning’s Diminishing Returns
DEEP LEARNING IS NOW being used to translate between languages, predict how proteins fold, analyze medical scans, and play games as complex as Go, to name just a few applications of a technique that is now becoming pervasive. Success in those and other realms has brought this machine-learning technique from obscurity in the early 2000s to dominance today.
Although deep learning's rise to fame is relatively recent, its origins are not. In 1958, back when mainframe computers filled rooms and ran on vacuum tubes, knowledge of the interconnections between neurons in the brain inspired Frank Rosenblatt at Cornell to design the first artificial neural network, which he presciently described as a "pattern-recognizing device." But Rosenblatt's ambitions outpaced the capabilities of his era—and he knew it. Even his inaugural paper was forced to acknowledge the voracious appetite of neural networks for computational power, bemoaning that "as the number of connections in the network increases...the burden on a conventional digital computer soon becomes excessive."
gwern
My comment: this is a rehash of that MIT Arxiv paper which was circulating a while ago. The paper in question uses a very dumb methodology where instead of doing actual scaling law research (where you directly measure how much compute it takes to increase a specific model's performance on some error metric), they just dump a bunch of random Arxiv papers (all by different researchers, model architectures, goals etc) into a blender and try to deduce some sort of trend between compute and error rates. Unsurprisingly, because people are always publishing very disparate papers examining many different topics or aspects of something (many quite bad), this implies you need ~∞ compute to do much better. They do not recover known scaling laws and at least in the version I read, completely ignore the entire scaling literature. Garbage. (That Marcus is cheering on Twitter tells you everything you need to know.)
starspawn0
Seems like I've been reading this worry for years now, but the limits haven't yet been reached. And systems are doing pretty well, so far. Speech recognition is pretty good, for example; so is machine translation and image recognition. How much more improvement do we really need on these tasks??
Regarding systems built on "expert knowledge" that use less compute, that hides the amount of effort it took for humans to discover the rules used in those models. No fair to leave that out! Human effort + compute cycles may be greater than for a system trained from scratch using much less human effort, even though it learns inefficiently.
Quote:
Our analysis of this phenomenon also allowed us to compare what's actually happened with theoretical expectations. Theory tells us that computing needs to scale with at least the fourth power of the improvement in performance. In practice, the actual requirements have scaled with at least the ninth power.
This ninth power means that to halve the error rate, you can expect to need more than 500 times the computational resources.
It's heavily dependent on the type of data being used, though; and probably also depends heavily on the choice of loss function. I seem to recall some people from OpenAI giving a talk, where they said that scaling curves for image processing changed as you change the resolution of the images. Also, if the data has less noise, learning is usually quicker. Finally, there's possibility of new datasets arriving with the coming boom in BCIs -- these may produce even better scaling curves, still.
I would need to look at this article again, but I don't recall seeing mention of transfer learning. Mention is made of meta-learning, but I don't recall seeing transfer learning (I read it yesterday and may have forgotten). Transfer learning could severely reduce the amount of compute needed to learn new tasks.
I don't think shrinking the neural nets is the answer. That will just make them less robust -- and adding symbolic processing also won't help. The human brain still uses far more compute than any of these neural net models, and if we want to make AI that emulates it, we probably are going to have to use at least as much in AI models.