So, what if I told you that I worked on the basis for the control system for robots like you see in
I Robot back in 2017?
Well, as a STEM professor myself (not in robotics; full-professor; tier-1 research university), I am not the least bit intimidated by people with technical talent:
There's even the chance you were one of my students, at some time in the past! (I am not young.)
(And, yes, I realize those problems are a bit-out-reach -- mainly due to the length -- at the moment; but not too far out of reach.)
As far as I can tell, this system has no possibility of ever being human-level.
Define what you mean by "human level", and be precise. Because, I have little doubt that you have learned nowhere near as many basic facts as the system can pull up. In that case, it is "superhuman". Not only has it memorized facts, but it can also blend them together. Even if you had access to the internet, to look up facts, there might be some very simple tests / tasks that it could beat you at.
I thought it might be worth mentioning what OpenAI might work on next: it looks like they want to continually improve their API, which can already do some amazing things. However, it has been reported -- or suggested, in a Tech Review article -- that they also want to train something like a GPT-3, but multimodal (GPT-4? Something else?). In other words, it would not only use text from which to build its internal representations of knowledge, but also audio, video, and images.
Let's say that images can be compressed down into 10,000 bytes each. You can do fairly well with that level of compression -- hit major details.
So, then, if they train it with 100 million images, that adds up to 1 terabyte.
And now, let's say audio clips are compressed to maybe 10,000 to 100,000 bytes, depending on the length, using e.g. neural net-based compresson.
Video is harder to compress; but it can be done.
Ok, so, maybe they train a system with several terabytes of images, audio, video. How would they do it? Well, they probably would have to alter the architecture. Gwern suggests they may add new attention mechanisms; and that's probably the least of the modifications. And maybe they will make the context window a lot longer, so that it's over 100,000 tokens long, instead of 2,048.
It might make sense to mix the modalities together, as they appear in the real world. For example, when someone writes a post to Reddit, in the middle of their post, they might link to a jpeg image. One way to represent that post is as a combination of text, with special tokens in the middle to represent the compressed form of the image. That might not actually work -- but there's a chance that it would.
Do the same for audio and video, too.
Now, train a giant model with many more parameters than the current one they trained. Maybe this takes $20 million... or even $50 million... or more.
What might one do with it?...
Here's some suggestions -- one, unified system to do them all:
1. Let's say you are business owner, and you have some old records, in different formats, that you scanned into your phone over the years, and onto a computer. You've got thousands and thousands of jpeg files, with digitized documents. They're not in uniform font; some have little logos embedded; and so on. You can't pay a company to automatically map it to text for you. But, maybe you could feed into some next-gen version of GPT-3, and say, "Make a list of the names of the customers in all these records," and then list out the records (as compressed image files). Maybe you have to give it some examples, first, to prime it, so it knows what you want.
In the process of learning to predict tokens (including in images), maybe it learns good representations of the content of images -- to such a degree that it learns Optical Character Recognition implicitly. In fact, you can even write characters up the side of the page, rotate them, in different fonts, and it still works. You can put distractors in the image, include little photos; and it will still work.
2. Let's say you have some bar charts and pie charts and other kinds of charts, and just for your own amusement (not for any serious application), you ask the computer to map them to spreadsheets. So, it has to recognize the number scale being used (by reading the numbers on the chart), recognize that the height of the bars says how large the numbers are, read off the names of the items at the bottom of the chart, and so on; and it has to tolerate lots of different kinds of charts. And it has to know how to map that to a CSV file, for example. It's a lot of work! -- but... there are going to be millions of graphs like that in the training data, that the system will have learned useful features from.
3. Maybe you want to test its ability to interpret diagrams? So... you take a photo of a circuit diagram, and write to API, "What is the resistance between terminal A and terminal B?" And then you add the photo. Now, the model will have seen many thousands of diagrams like that, with accompanying text describing the circuits; so it will have representations for that, too. So, it may look at the image, recognize the squiggly resistor symbol, see the numbers written next to it, and then report that. It' possible that it will even be able to do some rudimentary Ohm's Law calculations, and answer questions that must be derived from the image.
4. Let's say you give it a couple examples of audio clips (that include singing and multiple instruments), followed by a short MIDI file representation of just the core melody in a single "voice" (no harmony). Then you give it one more audio clip, and ask it to complete... And it will output the MIDI for the core melody.
5. Let's say you have a couple short video clips of people preparing some basic meals. There are several different ingredients involved; and several steps in the cooking process. Just for your own amusement (not that you actually want to use it for anything), you show the system some examples of "here is the video" and "here is the recipe". Then, you show it one more video, and it completes it... outputting a plausible recipe. It figures out what you want it to do.
Now, if you have a system that can do those things -- and those are only a tiny snapshot of its capability -- just think how much more you could do with OpenAI's API when it gets that capable...