Charles Explorer logo
🇬🇧

Towards a Polish Question Answering Dataset (PoQuAD)

Publication

Abstract

This paper presents the efforts towards creating PoQuAD, a dataset for training automatic question answering models in Polish. It justifies why having native data is vital for training accurate Question Answering systems.

PoQuAD broadly follows the methodology of SQuAD 2.0 (including impossible questions), but detracts from it in a few aspects. The first of these concerns reducing annotation density in order to broaden the range of topics included.

The second is the inclusion of a generative answer layer to better suit the needs of a morphologically rich language. PoQuAD is a work in progress and so far consists of over 29000 question-answer pairs with contexts extracted from Polish Wikipedia.

The planned size of the dataset is over 50 thousand such entries. The paper describes the annotation process and the guidelines which were given to annotators in order to ensure quality of the data.

The collected data is subjected to analysis in order to shed some light on its linguistic properties and on the difficulty of the task.