Everything from simple sentiment analysis, to archive.org, to another mirror. I hope that does not discourage you from releasing the data.
Edit: I see the other comment about archive team already collecting and releasing this data, for free in an open format. I think that will be a good first source as well.