Research project & corpus


In the last 15 years or so, corpus-based research into learner language has contributed to a much clearer picture of advanced interlanguage. These studies have yielded substantial empirical evidence that, for example, texts produced by advanced learners and native speakers differ in terms of frequencies of certain words, phrases and syntactic structures. This research also shows that learners use features that are more typical of speech than of academic prose, which suggests that they are largely unaware of register differences.

Moreover, there is evidence that learners of various language backgrounds have similar problems and face similar challenges on their way to near-native proficiency. For example, advanced learners still struggle with the acquisition of linguistic phenomena that are optional and/or highly L2-specific, often located at the interfaces of linguistic subfields (e.g. syntax-semantics, syntax-pragmatics). Also, when writing academic prose, many of the observed difficulties appear to stem from a lack of understanding of the rules surrounding academic writing, or from a lack of practice, rather than as a result of interference from first language academic conventions. Due to these similarities, the project refers to the interlanguage of these learners as advanced learner varieties (ALVs).


  1. create an electronic corpus for a detailed, empirical, quantitative and qualitative description of ALVs, the Corpus of Academic Learner English (CALE);
  2. produce detailed case studies, examining individual (or interplay of several) determinants of lexico-grammatical variation, e.g. weight/complexity, information status, animacy; genre, writing proficiency
  3. develop a corpus-driven, text-centred method based on linguistic criteria for the assessment of writing proficiency in the academic register
  4. apply the findings to teaching academic writing at advanced levels (e.g. English for Academic Purposes

The corpus

The Corpus of Academic Learner English (CALE) is a specialised corpus of academic learner writing for a detailed, empirical, quantitative and qualitative description of advanced learner writing in the academic register. The corpus includes a variety of academic texts written in university content classes in (applied) linguistics by students of English as a Foreign Language. Currently, the CALE primarily consists of texts produced by German EFL learners, but texts produced by learners of other first-language backgrounds, e.g. Turkish, Lithuanian, Russian, are available and will be added to enable cross-linguistic and typological comparison. We are currently negotiating co-operations with several partners. If you are interested in joining the project, please send an e-mail to the principal investigator. Download our information leaflet.

CALE.text_typesThe CALE comprises seven academic text types that are typically produced in university content courses, e.g. research papers, reading reports, abstracts, reviews etc. Download the text classification.

Native-speaker control corpora for CALE are the Michigan Corpus of Upper-Level Student Papers (MICUSP) or the British Academic Written English corpus (BAWE).

The research project “Lexico-grammatical variation in advanced learner varieties” in the context of which we started compiling the corpus  was funded from January 2014 to Janaury 2017 by the Central Research Development Fund of the University of Bremen, Germany, one of eleven “Universities of Excellence” in Germany that compete internationally in top-level research.