N-grams from Google
An entry appeared in my rss reader from Google Labs that got me excited, they are sharing a massive amount of data from their n-gram models research:
"That's why we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times. "
That is some data! (available in 6 DVDs) I rushed to the linked Linguistic Data Consortium and well, end of story for me, they have two subscriptions for commercial members: $20,000(USD) - $25,000(USD)