We present a dataset of telephone conversations in English and Czech, developed to train acoustic models for automatic speech recognition (ASR) in spoken dialogue systems (SDSs). The data comprise 45 hours of speech in English and over 18 hours in Czech.
All audio data and a large part of transcriptions was collected using crowdsourcing; the rest was transcribed by hired transcribers. We release the data together with scripts for data re-processing and building acoustic models using the HTK and Kaldi ASR toolkits.
We publish the trained models described in this paper as well. The data are released under the CC-BY-SA 3.0 license, the scripts are licensed under Apache 2.0.
In the paper, we report on the methodology of collecting the data, on the size and properties of the data, and on the scripts and their use. We verify the usability of the datasets by training and valuating acoustic models using the presented data and scripts.