Czech Named Entity Corpus and SVM-based Recognizer

Publication at Faculty of Mathematics and Physics |

2009

Abstract

This paper deals with recognition of named entities in Czech texts. We present a recently released corpus of Czech sentences with manually annotated named entities, in which a rich two-level classification scheme was used.

There are around 6000 sentences in the corpus with roughly 33000 marked named entity instances. We use the data for training and evaluating a named entity recognizer based on Support Vector Machine classification technique.

The presented recognizer outperforms the results previously reported for NE recognition in Czech.

Keywords

Czech Named Entity Corpus SVM-based Recognizer