> Its easier to see the flaws in this one because its more general, so spread ou...

> Its easier to see the flaws in this one because its more general, so spread out more thinly.

I really think this is due to the very limited number of parameters in GATO: 1.2B vs. 175B for GPT-3. They intentionally restricted the model size so that they could control a robot arm (!) in real time.

> these models need a sense of self and relational categories.

The places where I personally see GPT-3 getting hung up on higher level structure seem very related to the limited context window. It can't remember more than a few pages at most, so it essentially has to infer what the plot is from a limited context window. If that's not possible, then it either flails (with higher temperatures) or outputs boring safe completions that are unlikely to be contradicted (with lower temperatures)