Charles Explorer logo
🇬🇧

An n-gram-based analysis of Czech and English parliamentary debates: In search of optimal n-gram length

Publication at Faculty of Arts |
2018

Abstract

The study focuses on a crucial methodological issue in n-gram-based research: determining the most informative length of n-grams for examining a given genre (Biber 2009, Hyland 2008). As shown in contrastive studies, the optimum length of n-grams is not only genre- but also language-specific (Granger 2014, Cvrček&Václavík 2015, Hasselgård 2017).

Applying the n-gram method to the study of typologically different languages, such as English and Czech, seems to be particularly "challenging" (Čermáková&Chlumská, 2017: 76). We examine the relatively well-defined genre of parliamentary debates, relying on comparable corpora of English and Czech parliamentary discourse (Hansard, CzechParl corpus).

The study is corpus-driven: it proceeds from the identification of 2-10-grams (i.e. continuous "recurring strings, with or without linguistic integrity (Lindquist&Levin 2008: 144)) to the qualitative functional description of n-grams of various lengths. The optimal n-gram length was determined for each language separately, considering the frequency and the amount of genre-specific information (structural and functional) obtained based on n-grams of a particular length.

The optimum length of n-grams appears to be different in Czech and in English. While the differences can be accounted for primarily by the typological differences between the languages (Czech is predominantly a synthetic language with rich inflection), some specific features of English and Czech parliamentary debates can also be pointed out.

The number of n-grams (min. frequency 20 pmw) is similar in both languages, except for bigrams. These occur more frequently in English.

The structure of the English bigrams reveals the analytic character of the language: they typically comprise a combination of function words (of/in/to/on/for the, that/and the, it is, I am). Lexical words are few, mostly honorifics (hon friend, member for).

In Czech, the representation of lexical words is much higher, comprising more varied forms of address (vážená paní, pane ministře), discourse-organizing and stance bigrams (děkuji za 'I-thank for', myslím že 'I-think that') and aboutness words (e.g. numerals). Among the function words, demonstratives and other deictic words are particularly frequent, indicating ties to the immediate co(n)text.

The structure of trigrams displays higher variability in Czech than in English, where the structures a * of, of the *, the * of together form about 7% of 3-gram tokens. When exploring the content of parliamentary debates, longer n-grams seem more informative (at least 3-grams for Czech and 4-grams for English).

While n-gram based research often focuses on n-grams of up to 5 words, the study shows that the register of parliamentary debates relies on longer n-grams to a large extent (6-8 words in Czech, 5-6 in English). The English parliamentary discourse contains highly specific honorifics (My hon.

Friend the Member for) and other politeness markers, while the Czech debates were characterised by a high frequency of discourse-specific performative formulae (zahajuji hlasování ptám se kdo je pro kdo je proti - 'I-open the-vote who is for who is against).