Charles Explorer logo
🇬🇧

Koditex: A corpus of diversified texts

Publication

Abstract

Koditex is a synchronic, representative and reference 9-million-word corpus (excl. punctuation) compiled for the purpose of conducting a multidimensional analysis (MDA) of Czech. When compiling the corpus, the primary goal was for it to be as diverse and representative as possible, reflecting the variability of Czech in all of its modes and ranges of use (written, spoken, online communication) and featuring rich annotation (the texts were lemmatized, morphologically tagged using two different systems, and furthermore they were annotated for phrasemes and so-called named entities).

As far as writtenness and spokenness are concerned, the Koditex is a mixed corpus.