Creating a sociologically balanced spoken corpus

Publikace na Filozofická fakulta |

2019

Abstrakt

The article presents the corpora of spoken Czech, which were created for language research and are publicly accessible. These are corpora that capture private spontaneous dialogues, therefore they were compiled according to the sociological criteria of each speaker.

These corpora have been binary balanced from the beginning in the categories of gender, age and the highest achieved level of education. Later, dialect regions were added, in which the speaker spent his childhood.

It is quite difficult to combine these criteria when recording longer interviews. Full balancing of all categories is accomplished in ORTOFON corpus.

Klíčová slova

spoken corpus metadata balancing