Multilingual parallel corpora can be annotated with monolingual tools, such as morphosyntactic taggers. However, even taggers for typologically similar languages often use incompatible tagsets, which results in conceptual and formal variety of tags within a single corpus.
Retraining taggers on data annotated with a common tagset is not a realistic option. Differences between tagsets are often rooted in different linguistic perspectives rather than in real distinctions between the languages, which means good chances to find a common ground.
Moreover, a different perspective may provide additional information missing in one tagset but present in another. Our first goal is to delegate the task of dealing with multiple tagsets to an abstract interlingual representation of linguistic categories.
Ideally, each tag in every language-specific tagset used in the corpus is linked to a position in a tangled hierarchy of concepts. To accommodate the different perspectives, the hierarchy takes three views of word class.
The Czech tag for a relative pronoun is decoded as a category with the properties of inflectional adjective, syntactic noun, and semantic pronoun, each with its appropriate morphological characteristics. Comparison of different tagsets reveals mismatches, where tags are seen as ambiguous wrt concepts.
Such mismatches are properly represented, which allows for a principled mapping strategy between languages-specific tagsets, and for intuitive and underspecified queries. The hierarchy can be built and the mismatches partially resolved using Formal Concept Analysis (Ganter & Wille, 1999).
Our second goal is to refine existing morphosyntactic annotation by projecting distinctions in one tagset onto a conceptually different tagset. The hierarchy and automatic word-to-word alignment is used to learn from word tokens in another language.
We show results of an experiment for different languages and tagsets, including untagged texts.