BigEst corpus

Automatically sentence-split, tokenized, PoS-tagged and morphologically analyzed Estonian texts. Sentences without a single letter and with only one token have been filtered out.

CommonCrawl, Est, part-1 (17.7M sentences, 198.3M tokens)
CommonCrawl, Est, part-2 (19.1M sentences, 232.3M tokens)
CommonCrawl, Est, part-3 (19.7M sentences, 293.0M tokens)
CommonCrawl, Est, part-4 (16.5M sentences, 208.3M tokens)
EtTenTen (20.0M sentences, 314.0M tokens)
Estonian Reference Corpus (18.9M sentences, 260.2M tokens)
Total: 110.0M sentences, 1.51G tokens
Uploaded on September 18th, 2017

Format: one sentence per line, "Word|lemma|pos-tag|morph-analysis word2|lemma2|etc"

Original sources:
CommonCrawl, deduplicated, sorted into languages: click
EtTenTen: cluck
Estonian Reference Corpus: clack
Processing Estonian done with: EstNLTK

TempEst corpus

Download, released 16. May 2011

Estonian input into the Estonian-English online SMT system, translated by Liisi Pool with up to 4 alternatives

Number of sentences: 2650

Number of words: