This pilot study focuses on a methodological issue in n-gram-based research: determining the most informative length of n-grams for examining a specific genre, comparing English and Czech. Also, we outline some n-gram based characteristics of parliamentary debates in Czech and English.
The material comes from British and Czech parliament proceedings (Hansard, CzechParl corpus). N-grams are extracted (n = 2-10) and compared for each language separately, categorised by grammatical structure and discourse functions.
The typological factor seems relevant. Unlike English, long grams (n = 9, 10), often corresponding to a clause, occur in the Czech material.
The material displays the specificities of parliamentary discourse, allowing for a functional classification of some discourse-specific n-grams. These differ for the respective languages, demonstrating that the criteria of typological characteristics and genre are closely interrelated in n-gram-based analysis.
The English parliamentary discourse contains highly specific honorifics (my hon friend), while the Czech debates are characterised by discourse-specific performative formulae (zahajuji hlasování ptám se kdo je pro kdo je proti).