IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-34, NO. 4, AUGUST
Speech Analysis/Synthesis Based on a Sinusoidal Representation Abstract-A sinusoidal model for the speech waveform used to de- speech compression. The amplitudes and frequencies of is velop a new analysislsynthesis technique that is characterized by the the underlying sine waves are estimated using Kalman filamplitudes,frequencies, andphases of thecomponentsine waves. tering techniques, and each sine-wave phase is defined to These parameters are estimated from the short-time Fourier transform be the integral of the associated instantaneous frequency. using a simple peak-picking algorithm. Rapid changes in the highly Another sine-wave-based speech compression system is resolved spectral components are tracked using the concept“birth” of and “death” of the underlying sine waves. For a given frequency track being developed by Almeida and Silva . In contrast to a cubic function isused to unwrap and interpolate the phase such thatHedelin’s approach, their system uses a pitch estimate to the phase track is m,aximally smooth. This phase function is applied to establisha harmonic set of sinewaves.Thesine-wave a sine-wave generator, which is amplitude modulated and added to the To other sinewaves to give the final speech output. The resulting syntheticphases are computed at the harmonic frequencies. compensate for any errors that might be introduced as a waveform preserves the general waveform shape and is essentially perceptually indistinguishable from the original speech. Furthermore, in result of the harmonic sine-wave representation, a residthe presence of noise the perceptual characteristics of the speech as ual waveform is codedalong with the underlying sinewell as the noise are maintained. In addition, it was found that the wave parameters. representation was sufficiently general that high-quality reproduction In this paper a sinusoidal model for the speech wavewas obtained for a larger class of inputs including: two overlapping, form is derived thatleads to a new analysis/synthesis superposed speech waveforms; music waveforms; speech in musical backgrounds; and certain marine biologic sounds. technique that is characterized by theamplitudes,frenew ap- quencies, and phases of the component sinewaves. In Finally, the analysis/synthesis system forms the basis for proaches to the problems of speech transformations including timeSection I1 the glottal excitation is represented in terms of scale and pitch-scale modification, and midrate speech coding[SI, .
I.INTRODUCTION NEapproachtothe problem of representation of speech signals is to use the speech production model in which speech is viewed as the result of passing a glottal excitation waveform through a time-varying linear filter that models the resonant characteristics of the vocal tract. In many speech applications it suffices to assume that the glottal excitation can be in one of two possible states, corresponding to voiced or unvoiced speech. In attempts to design high-quality speech coders at the midband rates, generalizations of the binary excitation model have been developed. One such approach that is currently popular is multipulse [l]. In this paper the goal is also to generalize the model for the glottal excitation; but instead of using impulses as in multipulse, the excitation waveform is assumed to be composed of sinusoidal components of arbitrary amplitudes, frequencies, and phases. A number of other approaches to analysis/synthesis that are based on sine-wavemodels have been discussed in the literature. Hedelin  proposed a pitch-independent sinewave model foruse in coding the baseband signal for
Manuscript received April 1, 1985; revised January 1986. This work was supported by the U.S. Department of the Air Force. The views expressed are those of the authors and do not reflect the o’ficial policy or position of the U.S. Government. The authors...