You don't need RL for the conclusion "trained to predict next token => only things one token ahead" to be wrong. After all, the LLM is predicting that next token from something - a context, that's many tokens long. Human text isn't arbitrary and random, there are statistical patterns in our speech, writing, thinking, that span words, sentences, paragraphs - and even for next token prediction, predicting correctly means learning those same patterns. It's not hard to imagine the model generating token N is already thinking about tokens N+1 thru N+100, by virtue of statistical patterns of preceding hundred tokens changing with each subsequent token choice.
True. See one of Anthropic's researcher's comment for a great example of that. It's likely that "planning" inherently exists in the raw LLM and RL is just bringing it to the forefront.
I just think it's helpful to understand that all of these models people are interacting with were trained with the _explicit_ goal of maximizing the probabilities of responses _as a whole_, not just maximizing probabilities of individual tokens.