Charles Explorer logo
🇬🇧

Spoken Czech corpora and their potential use in observing the differing worlds of spoken and written language

Publication at Faculty of Mathematics and Physics, Faculty of Arts |
2011

Abstract

The reconstruction of standardized text from the Prague Dependency Treebank of Spoken Czech removes specific aspects of spoken language (non-standard elements from Common Czech and dialects, superfluous demonstrative pronouns and connectors, filler words, subjective word order, repetitions and repairs, restarts etc.). For example, the spoken utterance "takže jako tam byla dobrá parta a dlouho tedy no" ("so like there was a good crowd and for a long time yeah") becomes the standardized sentence "Byla tam dlouho dobrá parta" ("For a long time, there was a good crowd there.").

This enables a vivid and interesting comparison of authentic spoken expressions and standardized texts. The question of how to categorize the standardized texts thus arises; they are still spoken texts - but they are correct, standard? Or does standardization transform spoken utterances into written texts? If we leave aside phonetic and morphological phenomena and concentrate on the syntactic transformations which occur during stan