Charles Explorer logo
🇬🇧

Corpus Defects

Publication at Faculty of Mathematics and Physics |
2007

Abstract

The article proposes a typology of errors that occur in morphologically annotated corpora, demonstrated on the example of the Czech National Corpus, its version SYN2000. It is morphologically annotated corpus with 3 attributes: word form, lemma and morphological tag.

Word forms come from original texts acquired from various providers, the other two are added by corpus builders during the annotation. It explains the process of morphological annotation, its three phases – morphological analysis, guesser and disambiguation.

It describes types of errors that can occur during the individual phases and why. And it discusses possibilities of their removal.

There are three main categories of errors: original errors coming from original texts, coding errors that come from possible recoding of various texts into one common format, and annotation errors due to faults in morphological dictionary and imperfections in the disambiguation – the statistical as well as rule-based. All types of corpus defects are docume