to be fair there are some degree of "hand curation" of the data so while "it is the internet", the actual trained data is a derivation of that.
in a mild but productive analogy:
I could actually hand a K&R book C programming book + lots of specs to say "this is the linux source code" (the raw data that were all observations were made, aka "the internet")
...or just send them the "kernel the source code" (the refined training data, after a LOT of manual stuff)
... that your compiler consumes to generate the kernel. (the Open Weights model, what they actually shared)
Mildly related rant: honestly its a bit shit to say "open source model" in a "open weights" model, its like saying World of Warcraft is opensource because they gave you an executable of the game. (you can still change it, but in more restricted ways)
give me the training data?
reply