Charles Explorer logo
🇬🇧

Morphological Tags in Parallel Corpora

Publication at Faculty of Arts |
2010

Abstract

Tagsets, used to annotate corpora, often classify word classes and morphological categories according to different criteria, even within a single language. Texts tagged in disparate ways make searching and automatic processing harder.

For a parallel corpus a single "harmonized" tagset could be designed (similarly as in the project MULTEXT-East), or - even better - to encode the information from all tagsets into a morphosyntactic "interlingua" (see Dan Zeman's Interset). The parallel with natural languages is appropriate: problems with missing equivalents occur in the translation of words as well as tags.

Thus we propose a tagset interlingua as a hierarchy (lattice) of categories, corrosponding to language-specific tags. A missing tag in a language can be substituted by a more general tag or a by a disjunction of more specific tags.

Similarly as with multilingual lexical databases the methods of Formal Concept Analysis can be used.