The study sets out to address two issues raised by previous studies dealing with phraseology and children's literature. The first question is methodological, focussing on "the potential contribution" of an n-gram-based approach to language comparison (Granger, 2014).
N-grams have proved a useful starting point when comparing languages which are linguistically close, and a rather "challenging" one when dealing with typologically different languages (Čermáková & Chlumská, 2017; Hasselgård, 2017; Ebeling & Ebeling, 2013), such as predominantly analytical English and inflectional Czech, which are compared in this study. We test the advantages and limitations of the n-gram method, considering the possibilities of combining it with other quantitative methods (POS-grams, frequency word/lemma lists) as a first step in a comparative analysis.
The frequency and types of n-grams are highly sensitive to language as well as to register (Biber & Conrad, 2009: 6). We examine imaginative fiction written for child and teenage audience as a register delimited primarily by its intended audience but also by its linguistic features, which serve specific communicative functions (Hunt, 2005; Thompson & Sealey, 2007).
The study aims to explore to what extent n-grams can help characterise and point out differences between English and Czech children's fiction. The study relies on comparable English and Czech corpora of children's fiction: two small corpora of approximately 650,000 words each, and two large corpora of approx. 2,700,000 words each - children's literature sub-corpora of the Czech National Corpus (SYN) and the British National Corpus.
For technical reasons, we restrict the queries to 250,000 hits in the large corpora. For the time being (this stage of our study being a preliminary probe), we consider this limitation acceptable, as the large corpora present a unique option to use an otherwise inaccessible dataset containing a wide range of children's fiction.
The two small corpora allowed for a detailed examination, whereas the large ones served to test and verify our findings based on the small corpora, supplementing them by lemma and POS queries. We extracted 2- to 5-grams (i.e. continuous sequences of 2-5 words excluding punctuation) from the smaller English and Czech corpora (with the minimum range set at 2 texts, and the frequency cut-off point at 50, 10, 5 and 3 tokens, respectively).
The numbers of n-grams (types and tokens) above the threshold are consistently higher in English; the difference is statistically significant at p < .001 (Table 1). The ratios suggest a much larger extent of recurrent patterning in analytical English than in Czech, characterized by high morphological variability and free word-order (cf. the Czech 4-grams: se nedá nic dělat, nedá se nic dělat, nedalo se nic dělat).
The slightly higher type/token ratios in Czech again point to higher variability of Czech as compared to English. Another difference between the two corpora consists in the representation of verbs and nouns within the most frequent n-grams.
Based on the small corpora the percentage of n-grams comprising a verb appeared higher in Czech than in English. This was verified using the larger tagged corpora: the most frequent 3-5-grams comprise verbs in Czech (e.g. pronoun-verb-preposition-noun, se vydal na cestu), while the most frequent English ones include prepositions and nouns (e.g. preposition-determiner-adjective-noun, for a long time).
This is again in accord with the typological expectations, Czech generally preferring (finite) verbal expression and English being more 'nominal'. The POS observations highlighted not only the importance of verbs for Czech but also their high morphological variability as a potential hindrance to the use of the n-gram approach.
Frequent 3-5-grams identified in the small corpora were classified semantically. Both languages contain n-grams which fall into the categories of time (for the first time, od rána do večera), space (the edge of the, na všechny strany) and modality (we've got to, zdálo se mu).
These categories seem to be essential for the purposes of narrative fiction (Thompson & Sealey, 2007: 21). In addition, the English n-grams contained members of other semantic categories, such as verbs of communication or thinking (I'll tell you, I don't think).
The absence of these verbs from the Czech n-grams was surprising as these verbs are to be expected in fiction. Therefore, we looked into frequency lists of verb lemmas occurring in the Czech corpus.
They indeed contained verbs of communication or thinking (říci - 'say', vědět - 'know'). However, these verbs were not present in the n-grams due to the morphological diversity of the Czech verb forms (e.g. říci: a řekl jim, a řekl mu, a řekla mu).
This confirms that to examine Czech, a combination of methods is required, including partial lemmatization and perhaps identification of patterns on the basis of n-grams (Ebeling & Ebeling, 2013; Gries, 2008). For Czech, frequent 3-5-grams also include idioms in the traditional (taxonomic) sense (než bys řekl švec) as well as phraseological units (to je dost že jdete), which were not found in the English material (cf.
Altenberg, 1998: 105-6). To conclude, the n-gram method proved to be a useful corpus-driven starting point in a contrastive analysis of large quantities of text.
While highlighting typological characteristics of the languages compared, it also pointed to semantic similarities and contrasts within the given genre. Complemented by semantic analysis, n-grams show effectively the basic categories present in a narrative text.
The n-gram method has more limitations in Czech due to the inflectional character of the language. Therefore, a combination of methods seems beneficial for the description of Czech, including frequency lists, partial lemmatization and n-gram based patterns.