« Robin Williams plays Spore | Main | Feynman book, I can't believe it took me so long to discover it »

N-grams from Google

An entry appeared in my rss reader from Google Labs that got me excited, they are sharing a massive amount of data from their n-gram models research:

"That's why we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times. "

That is some data! (available in 6 DVDs) I rushed to the linked Linguistic Data Consortium and well, end of story for me, they have two subscriptions for commercial members: $20,000(USD) - $25,000(USD)

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on August 9, 2006 7:47 PM.

The previous post in this blog was Robin Williams plays Spore.

The next post in this blog is Feynman book, I can't believe it took me so long to discover it.

Many more can be found on the main index page or by looking through the archives.