Main

Resources Archives

August 9, 2006

N-grams from Google

An entry appeared in my rss reader from Google Labs that got me excited, they are sharing a massive amount of data from their n-gram models research:

"That's why we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times. "

That is some data! (available in 6 DVDs) I rushed to the linked Linguistic Data Consortium and well, end of story for me, they have two subscriptions for commercial members: $20,000(USD) - $25,000(USD)

About Resources

This page contains an archive of all entries posted to [Brain dump] in the Resources category. They are listed from oldest to newest.

Personal is the previous category.

UI is the next category.

Many more can be found on the main index page or by looking through the archives.