DepCC: A Dependency-Parsed Text Corpus from the Common Crawl

Together with our colleagues at the University of Hamburg, we just released a new web-scale dependency-parsed corpus based on the CommonCrawl. DepCC is a large linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl.

You can find the corpus here:

A description is available in this paper: