Information Sciences 173 (2005) 115–139 www.elsevier.com/locate/ins
Investigating spoken Arabic digits in speech recognition setting Yousef Ajami Alotaibi
Computer Engineering Department, College of Computer and Information Sciences, King Saud University, P.O. Box 57168, Riyadh 11574, Saudi Arabia Received 3 October 2003; received in revised form 18 May 2004; accepted 14 July 2004
Abstract Arabic language is a Semitic language that has many diﬀerences when compared to European languages such as English. One of these diﬀerences is how to pronounce the 10 digits, zero through nine. Except for zero, all Arabic digits are polysyllabic words. In this paper Arabic digits were investigated from the speech recognition problem point of view. An artiﬁcial neural network based speech recognition system was designed and tested with automatic Arabic digit recognition. The system is an isolated whole word speech recognizer and it was implemented as both a multi-speaker and speaker-independent modes. During the recognition process, noise was removed from digitized speech by means of band-pass ﬁlters, the signal was also pre-emphasized, and windowed and blocked by Hamming window. A time alignment algorithm was used to compensate for diﬀerences in utterance lengths and misalignments between phonemes. Frame features were extracted by using MFCC coeﬃcients to reduce the amount of the information in the input signal. Finally the neural network classiﬁed the unknown digit. This recognition system achieved a 99.5% correct digit recognition in the multispeaker mode, and 94.5% in speaker-independent mode. This paper also investigated Arabic digits as ‘‘patterns on paper’’ by using spectrogram and waveform information to cross check and investigate digit recognition system results and to try to locate the causes of miss-recognized digits. All Arabic digits were described by showing their
E-mail address: email@example.com 0020-0255/$ - see front matter Ó 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2004.07.008
Y.A. Alotaibi / Information Sciences 173 (2005) 115–139
constructing phonemes and syllables. Comparisons of all possible pairs of digits were also investigated and comments were stated with links to digit recognition system output. An understanding of the causes of automatic digit recognition system errors may help in building digit recognition systems that are simple, cheap, and fast. Ó 2004 Elsevier Inc. All rights reserved. Keywords: Neural network; Arabic digits; Speech; Recognition; Spectrogram
1. Introduction 1.1. Arabic language Arabic is a Semitic language, and it is one of the oldest languages in the world today. It is the ﬁfth widely used language nowadays. Arabic is the ﬁrst language in the world today and Arabic alphabets are used in several languages, such as Persian and Urdu . Standard Arabic has 34 basic phonemes, of which six are vowels, and 28 are consonants . A phoneme is the smallest element of speech that indicates a diﬀerence in meaning, word, or sentence. Arabic has fewer vowels than English. It has three long and three short vowels, while American English has at least 12 vowels . Arabic phonemes contain two distinctive classes, which are named pharyngeal and emphatic phonemes. These two classes can be found only in Semitic languages like Hebrew [2,4]. The allowed syllables in Arabic language are: CV, CVC, and CVCC where V indicates a (long or short) vowel while C indicates a consonant. Arabic utterances can only start with a consonant . All Arabic syllables must contain at least one vowel. Also Arabic vowels cannot be initials and can occur either between two consonants or ﬁnal in a word. Arabic syllables can be classiﬁed as short or long. CV type is a short one while all others are long. Syllables can also be classiﬁed as open or closed. An open syllable ends with a vowel, while a closed syllable ends with a consonant. For Arabic, a vowel always forms a...
References:  M. Al-Zabibi, An Acoustic–Phonetic Approach in Automatic Arabic Speech Recognition, The British Library in Association with UMI, 1990.  A. Muhammad, Alaswaat Alaghawaiyah, Daar Alfalah, Jordan, 1990 (in Arabic).  J. Deller, J. Proakis, J.H. Hansen, Discrete-Time Processing of Speech Signal, Macmillan, NY, 1993.  M. Elshafei, Toward an arabic text-to-speech system, The Arabian Journal for Science and Engineering 16 (4B) (1991) 565–583.  Y.A. El-Imam, An unrestricted vocabulary arabic speech synthesis system, IEEE Transactions on Acoustic, Speech, and Signal Processing 37 (12) (1989) 1829–1845.  E. Hagos, Implementation of an Isolated Word Recognition System, Master thesis, University of Petroleum and Minerals, Dhahran, Saudi Arabia, 1985.  W. Abdulah, M. Abdul-Karim, Real-time spoken arabic recognizer, International Journal of Electronics 59 (5) (1984) 645–648.  A. Al-Otaibi, Speech Processing, The British Library in Association with UMI, 1988.  G. Pullum, W. Ladusaw, Phonetic Symbol Guide, The University of Chicago Press, 1996.  R. Lippmann, Review of Neural Networks for Speech Recognition, Neural Computation, MIT press, 1989, pp. 1–38.  S. Haykin, Neural Networks: A Comprehensive Foundation, second ed., Prentice-Hall, Englewood Cliﬀs, NJ, 1999.  T.H. Nong, J. Yunus, S.H. Salleh, Classiﬁcation of Malay speech sounds based on place of articulation and voicing using neural networks, in: Proceedings of IEEE Electrical and Electronic Technology, TENCON, 2001, pp. 170–173.  S.A. Selouani, D. OÕShaughnessy, Hybrid architectures for complex phonetic features classiﬁcation: a uniﬁed approach, in: International Symposium on Signal Processing and its Applications (ASSPA), Kuala Lumpur, Malaysia, August 2001, pp. 719–722.  M. Salam, D. Mohamad, S. Salleh, Neural Network speaker dependent isolated malay speech recognition system: handcrafted vs. genetic algorithm, in: International Symposium on Signal Processing and its Application (ISSPA), Kuala Lumpur, Malaysia, August 2001, pp. 731–734.  L. Rabiner, M. Samber, An algorithm for determining the endpoints of isolated utterances, The Bell System Technical Journal 54 (2) (1975) 297–315.  N. Nocerino, F. Soong, L. Rabiner, D. Klatt, Comparative study of several distortion measures for speech recognition, Speech Communication 4 (1985) 317–331.  S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustic, Speech, and Signal Processing ASSP-28 (4) (1980) 357–366.  Yousef A. Alotaibi, A simple and eﬀective time-alignment algorithm for spoken arabic digits, unpublished.  P.C. Loizou, A.S. Spanias, High-performance alphabet recognition, IEEE Transactions on Speech and Audio Processing 4 (6) (1996) 430–445.
Please join StudyMode to read the full document