The ssj500k Training Corpus for Slovene Language Processing

Publication

Abstract

This paper presents recent developments and the content of the ssj500k training corpus, the largest and most widely used open-source collection of training data for Slovene language processing, which has been manually annotated with respect to segmentation, tokenisation, lemmatisation, JOS morphosyntax and dependency syntax, Universal Dependencies, semantic role labelling, named entities and verbal multi-word expressions. After a short history of the development of the corpus, we give an overview of the dataset as a whole, and the details of each annotation layer, including a survey of existing natural language processing tools that used it for training.

Most ssj500k annotations were carried out using the dedicated Q-CAT querying-supported corpus annotation tool, which is also presented, and the directions for future development of the corpus are discussed.

Keywords

NLP training corpus Slovene language