Charles Explorer logo
🇬🇧

SYN v11: corpus of contemporary written Czech

Publication

Abstract

Corpus of contemporary written Czech sized 5.0 billion running words (i.e. 6.0 billion tokens). It covers mostly the period of 1990-2021.

SYN v11 features rich metadata including detailed bibliographical information, revised text-type classification etc. Although it contains a wide range of text types (fiction, non-fiction, newspapers), the newspapers prevail noticeably.

The corpus is lemmatized and morphologically annotated by a combination of stochastic and rule-based methods. The main differences when compared to its predecessor, SYN v10, lie in the update of the newspaper part (added texts from 2021 sized 150 million running words), as well as in the improved lemmatization and morphological tagging.