BEYOND THE IMITATION GAME: EVALUATING AI-GENERATED OLD ENGLISH FOR TEACHING MATERIALS
Abstract
This study is part of the broader Disinventing Old English project, an ongoing initiative aimed at developing a communicative, level-graded method for the teaching of Old English. The project seeks to move beyond traditional philological approaches and instead adapts frameworks commonly used in contemporary second language acquisition, foregrounding interaction, progressive input sequencing, and functional communicative competence. Initial teaching units have already been produced and tested, with linguistic content selected and validated through corpus-driven analysis. However, scaling this approach requires access to a wider range of texts and exercise materials calibrated to learner proficiency levels—resources that are currently limited for Old English.To support this expansion, we are now integrating AI-assisted language generation as a controlled means of producing novel and pedagogicallyappropriate Old English input. The generation and assessment procedures discussed in this presentation therefore directly inform the ongoing development of the learning modules, enabling the creation of new texts, dialogues, and task-based activities aligned with communicative teaching goals.This presentation advances an evaluation framework for Large Language Models (LLMs) generating Old English (OE), combining a Turing Test-inspired discrimination task with parallel-corpus alignment and expert linguistic assessment. Framing our inquiry with Turing’s (1950) notion of indistinguishability, we test whether generalist LLMs can produce OE text that is grammatically accurate, semantically coherent, and stylistically faithful without relying on back-translation. Our workflow proceeds in four stages: (1) segmentation of Tolkien’s Sellic Spell into syntactic units to create a reference scaffold; (2) prompt-engineered generation of original OE fragments aligned one-to-one with the reference; (3) segment-level alignment in a parallel corpus to enable controlled comparison; and (4) qualitative evaluation including scores for inflection, word order/syntax, lexical choice, and semantic coherence. Results indicate high average acceptability in morphology and basic syntax, with recurrent weaknesses in idiomatic collocations and maintenance of discourse coherence (especially in dialogue), where semantic drift is most pronounced. Ultimately, this framework represents a plausible workflow for expanding the curriculum, ensuring that the growing body of materials is both linguistically reliable and pedagogically suitable within contemporary language-teaching environments.