Near the end, beginning 1 hour, 17 minutes, 29 seconds in:
Brockman: Ilya and I are kicking off a new team, called the "reasoning team", and this is to really try to tackle, "How do you get neural networks to reason?" And we think this will be a long-term project. It's one we are very excited about.
Lex: In terms of reasoning -- a super-exciting topic -- what kind of benchmarks, what kind of tests of reasoning would you envision?
He also mentions how "programming", "security analysis". and theorem-proving all capture the same core processes of reasoning. (In fact, if you can get a machine to prove theorems, you can transform the solution into a method to write code. I've explained this before.)
Coincidentally, Deepmind released a dataset just today on getting a machine to solve basic math problems:
Today we're releasing a large-scale extendable dataset of mathematical questions, for training (and evaluating the abilities of) neural models that can reason algebraically.
Reasoning algebraically (i.e. solving math exam problems) is very, very far from actual theorem-proving, which is orders of magnitude harder; but it's a step in that direction.
It seems like there is something in the air -- teams are now attempting to tackle this very hard problem with Deep Learning... a problem that skeptics have said it shouldn't be able to solve!
As I've said before, mathematical theorem-proving is an excellent benchmark task to training machines to do complex reasoning. It's great, because it's domain-limited, and doesn't contain all the messiness we see in the real world. AND, if they can get a machine to solve IMO-style problems, say, then probably they can tweak their program to solve problems in pretty much any axiomatic system -- including ones important to biology, physics, medicine, you name it. This won't quite get us to an "automated scientist", though, since a large part of science is dealing with the messy, real world, where you have to come up with a reasonable axiom system.
(You could argue, perhaps, that coming up with good axioms is, itself, a reasoning process that has its own meta-system of axioms (quantifying over a higher-order set of objects); but it might be really difficult to figure out what these are -- and it might involve a lot of them, whereas many mathematical systems have only very few axioms.)
Still, if mathematical theorem-proving becomes the new "Atari benchmark" for OpenAI and Deepmind, this one will be one that actually gives real-world benefits. It will make scientific progress go a lot faster.
Another thing worth pointing out from that interview with Brockman: he was asked whether just scaling-up GPT-2 would lead to a system that can pass a Turing Test, and was a little evasive. He said that he thinks that a true Turing Test would require the system to not just hold a conversation, but to also do reasoning of an indeterminate length (GPT-2 does bounded-depth logic / "reasoning"), and should also have the ability to learn as you converse with it. He also mentioned that it's not clear what the limits are of its world-modeling / commonsense reasoning, based on pure text training. Reasoning and learning (not at the training phase, but during the conversation phase) will be things that have to be added.
I suspect, though, that just given the bounded-depth logical inference and limited learning ability language models currently possess, with enough data, existing language models would go pretty far towards convincing the average human that they aren't talking to a machine.
He mentions how for 2019 OpenAI wants to scale up language modelling 100x to 1000x to see what will happen. By the sound of it, they want to build far larger models than even GPT-2-large -- say, GPT-2-HUGE.