Charles Explorer logo
🇬🇧

SYN v10: corpus of contemporary written Czech

Publication

Abstract

Corpus of contemporary written Czech sized almost 4.9 billion running words (i.e. 5.9 billion tokens). It covers mostly the period of 1990-2020.

SYN v10 features rich metadata including detailed bibliographical information, revised text-type classification etc. Although it contains a wide range of text types (fiction, non-fiction, newspapers), the newspapers prevail noticeably.

The corpus is lemmatized and morphologically annotated by a combination of stochastic and rule-based methods. The main differences when compared to its predecessor, SYN v9, lie in the update of the newspaper part (added texts from 2020 sized more than 150 million running words), as well as in the improved lemmatization and morphological tagging.