Improved Weighted Matching for
Ozan Mut, Mehmet Göktürk
Speaker Identification process consists of two main phases;
namely, Enrollment (Training) and Identification (Matching). In enrollment phase, all samples from the speakers are trained and stored in a database. The goal of training is to create a reference model for each speaker to be used in classification of unknown utterances in recognition phase.
In this paper, a closed-set Text-Independent Speaker
Identification System is reviewed and a new modified
algorithm for the matching part is introduced.
Abstract—Matching algorithms have significant importance in
speaker recognition. Feature vectors of the unknown utterance are compared to feature vectors of the modeled speakers as a last step in speaker recognition. A similarity score is found for every model in the speaker database. Depending on the type of speaker recognition, these scores are used to determine the author of unknown speech samples. For speaker verification, similarity score is tested against a predefined threshold and either acceptance or rejection result is obtained. In the case of speaker identification, the result depends on whether the identification is open set or closed set. In closed set identification, the model that yields the best similarity score is accepted. In open set identification, the best score is tested against a threshold, so there is one more possible output satisfying the condition that the speaker is not one of the registered speakers in existing database. This paper focuses on closed set speaker
identification using a modified version of a well known matching algorithm. The results of new matching algorithm indicated better performance on YOHO international speaker recognition database.
II. FEATURE EXTRACTION
This stage is often referred as speech processing front end. The primary goal of feature extraction is to simplify
recognition by summarizing the vast amount of speech data
and obtaining the acoustic properties that define speaker
individuality. MFCC (Mel Frequency Cepstral Coefficients) is one of the most widely used feature extraction techniques . Since speech signal varies over time, it is more appropriate to analyze the signal in short time intervals where the signal is more stationary. To find the MFCC, the signal is split into
short frames and a windowing function is applied for each
frame to eliminate the effect of discontinuities at edges of the frames. Then the windowed signal is converted to frequency
domain by taking the FFT (Fast Fourier Transform) and Mel
scale filter bank is applied to the resulting frames. Average human ear has nonlinear frequency response. Previous
research indicates that scaling is linear up to 1 kHz and
logarithmic above that frequency. The Mel-Scale (Melody
Scale) filter bank characterizing the frequency response of
human ear is shown in Fig. 2.1. It is used as a band pass filter during first phase of identification.
Keywords— Automatic Speaker Recognition, Voice
Recognition, Pattern Recognition, Digital Audio Signal Processing.
PEAKER recognition can be classified into two
categories; Speaker Verification (SV) and Speaker
Identification (SI) . Speaker verification is the task of accepting or rejecting the identity of a speaker claimed to be someone. Speaker Identification is the task of finding the
identity of an unknown speaker among a stored database of
speakers. Speaker Identification can be done in closed-set or open-set forms. In closed-set form, the unknown speaker is
definitely one of the speakers in the database. In open-set
form on the other hand, the speaker may not belong to one of the registered speakers in the database, therefore an open-set identification system has one more possible output for
rejection. Yet, there is another...