InterCorp - A Multilingual Parallel Corpus of the Czech National Corpus

Publication

Abstract

This review describes and evaluates the InterCorp, a multilingual parallel corpus with referential character, developed by the Institute of the Czech National Corpus and the Institute of Theoretical and Computer Linguistics at the Charles University (Prague). In its current version 10, which was published in 2017, it comprises 2 108 703 589 tokens of language data in 40 different languages.

It is developed according to the translation-principle with Czech as its pivot language. Therefore, each integrated text is available in Czech and at least one other language.

A substantial part of the corpus, the core, which comprises mostly fiction, is aligned manually in the project itself. Other parts of the corpus, the so-called collections, are integrated from other projects, where they have been aligned automatically.

Besides a detailed description of the structure and content of the InterCorp, this review focuses the accessiblitiy via the online corpus manager KonText and assesses the value of the corpus for research questions that do not primarily focus Czech.

Keywords

InterCorp Multilingual Parallel Corpus Czech National Corpus