Researchers compiling learner corpora frequently encounter the problem of establishing the learners' proficiency, which is in this context sometimes called a fuzzy variable (Carlsen, 2012). Researchers use institutional definitions of proficiency (Ortega & Byrnes, 2008), which may be unreliable, or more sophisticated metrics which are however costly both in terms of time and expense, such as post-hoc evaluations by professional exam raters (Huang et al., 2018).
Informed by studies which explore the predictive power of reading skills (Cilibrasi et al. 2019), we deployed a reading-out-loud task when compiling a spoken learner corpus with a view to exploring how far reading skills are related to proficiency. A carefully selected text containing a variety of linguistic features was submitted to 68 Taiwanese speakers who participated in the construction of a large spoken corpus of L2 English across proficiency levels.
The speakers performed a variety of oral tasks which served as a basis of post-hoc proficiency rating by professional IELTS raters. These ranged from A1 to B2.
The performances in the reading passage were analysed for reading rate (words per minute), disfluencies (false starts, self-correction, repeats, filled pauses) and mispronounced or omitted words. Differences between the proficiency groups were compared using ANOVA and regression analysis.
These showed that the speech rate is a strong predictor but not entirely fool-proof as in all of the groups there were some fast and slow readers. Significant results were also returned by the pronunciation error analysis, which reveals that certain linguistic features are read with difficulty by lower-proficiency speakers.
The study demonstrates that including a carefully selected reading passage among the tasks during the compilation of spoken learner corpora might be an efficient way of collecting valuable data relating to learner performance in speech.