Highly accurate children’s speech recognition for interactive reading tutors using subword units Andreas Hagen, Bryan Pellom *, Ronald Cole
Center for Spoken Language Research, University of Colorado at Boulder, 1777 Exposition Drive, Suite #171, Boulder, CO 80301, USA Received 15 December 2005; received in revised form 20 February 2007; accepted 9 May 2007
Abstract Speech technology oﬀers great promise in the ﬁeld of automated literacy and reading tutors for children. In such applications speech recognition can be used to track the reading position of the child, detect oral reading miscues, assessing comprehension of the text being read by estimating if the prosodic structure of the speech is appropriate to the discourse structure of the story, or by engaging the child in interactive dialogs to assess and train comprehension. Despite such promises, speech recognition systems exhibit higher error rates for children due to variabilities in vocal tract length, formant frequency, pronunciation, and grammar. In the context of recognizing speech while children are reading out loud, these problems are compounded by speech production behaviors aﬀected by diﬃculties in recognizing printed words that cause pauses, repeated syllables and other phenomena. To overcome these challenges, we present advances in speech recognition that improve accuracy and modeling capability in the context of an interactive literacy tutor for children. Speciﬁcally, this paper focuses on a novel set of speech recognition techniques which can be applied to improve oral reading recognition. First, we demonstrate that speech recognition error rates for interactive read aloud can be reduced by more than 50% through a combination of advances in both statistical language and acoustic modeling. Next, we propose extending our baseline system by introducing a novel token-passing search architecture targeting subword unit based speech recognition. The proposed subword unit based speech recognition framework is shown to provide equivalent accuracy to a whole-word based speech recognizer while enabling detection of oral reading events and ﬁner grained speech analysis during recognition. The eﬃcacy of the approach is demonstrated using data collected from children in grades 3–5, namely 34.6% of partial words with reasonable evidence in the speech signal are detected at a low false alarm rate of 0.5%. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Literacy tutors; Subword unit based speech recognition; Language modeling; Reading tracking
1. Introduction In recent years, automated reading tutors that utilize speech recognition technology to track and assess a child’s reading ability have become more feasible due to increased computer power and advances in accurate and eﬃcient methods for speech recognition (Mostow et al., 1994; Cole et al., 2003). Previous studies have considered acoustic analysis of children’s speech (Lee et al., 1997; Lee et al., Corresponding author. Tel.: +1 303 735 5382; fax: +1 303 735 5072. E-mail addresses: email@example.com (A. Hagen), pellom@ cslr.colorado.edu (B. Pellom), firstname.lastname@example.org (R. Cole). 0167-6393/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2007.05.004 *
1999; Li and Russell, 2002). This work has shed light onto the challenges faced by systems that will be developed to automatically recognize and eﬀectively model children’s speech patterns. For example, it has been shown that children below the age of 10 exhibit a wider range of vowel durations relative to older children and adults, larger spectral and suprasegmental variations, and wider variability in formant locations and fundamental frequencies in the speech signal. In recent years, several studies have attempted to address these issues by adapting the acoustic features of children’s speech to match that of acoustic models trained from...