We present a large scale corpus study of the distributions of inflected nouns in contemporary Czech. Using SYN2015, a representative corpus of written Czech, we extract grammatical profiles (case and number) for all the lemmata tagged as nouns in the corpus (159804 items).
To identify lemmata with similar profiles, we then perform a hierarchical cluster analysis (Levshina 2015) which on the basis of high-profile intralemmatic tokens splits up the dataset into nominal subtypes. Obtained clusters are analyzed from several perspectives.
We focus especially on the relationship between the clusters and grammatical gender and number. The patterning of personal pronouns, names, and common masculine nouns, which are marked for animacy, is studied as we predict the animacy hierarchy (Comrie 1989 [1981]) to be one of the main contributing factors.
The clusters are then contrasted with the traditional declension classes to see whether there is any relationship or level of similarity between the lexemes belonging to the individual classes with respect to their usage patterns.