DepCC: A Dependency-Parsed Text Corpus from the Common Crawl

Together with our colleagues at the University of Hamburg, we just released a new web-scale dependency-parsed corpus based on the CommonCrawl. DepCC is a large linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl.

You can find the corpus here: https://commoncrawl.s3.amazonaws.com/contrib/depcc/CC-MAIN-2016-07/index.html

A description is available in this paper: https://arxiv.org/abs/1710.01779