The text-to-speech systems based on the concatenation of acoustic units codified with the linear prediction method, require that the sysìnthesis filter is excited by an artificial source signal. In the case of voiced speech sounds, the periodic source is represented by pulses that are spaced by the required period. The signal reproduced using this source is characterized by a fuzzy and tense quality. A very effective method used to reproduce with a high quality the speech signal using linear prediction is the multi-pulse method. It has been experimentally shown that the use of fixed multipulse patterns chosen from the analysis of the voice of a speaker determines a relevant improvement of the synthetic signal, also if these patterns are used to reproduce other realizations of the speaker, or even of other speakers. This result is not easily explainable, because the data obtained in the perception tests performed on signal synthetized with sources of particular spectral characteristics of amplitude and phase show that the amplitude of the harmonics of the sources is critical parameter for the correct reproduction. It is moreover important that these components reproduce a natural distribution of the phase that is variable in function of frequency and time. The use of fixed multi-pulse sequences does not however satisfy these conditions. In this paper an hypothesis about the reasons that determine the improvement obtained using multipulse patterns is verified and the characteristics of the implementation of a real-time text-to-speech system using a multipulse source are briefly considered.

Experimental analysis of fixed multipulse excitation patterns in PC synthesis

MUMOLO, ENZO;
1988

Abstract

The text-to-speech systems based on the concatenation of acoustic units codified with the linear prediction method, require that the sysìnthesis filter is excited by an artificial source signal. In the case of voiced speech sounds, the periodic source is represented by pulses that are spaced by the required period. The signal reproduced using this source is characterized by a fuzzy and tense quality. A very effective method used to reproduce with a high quality the speech signal using linear prediction is the multi-pulse method. It has been experimentally shown that the use of fixed multipulse patterns chosen from the analysis of the voice of a speaker determines a relevant improvement of the synthetic signal, also if these patterns are used to reproduce other realizations of the speaker, or even of other speakers. This result is not easily explainable, because the data obtained in the perception tests performed on signal synthetized with sources of particular spectral characteristics of amplitude and phase show that the amplitude of the harmonics of the sources is critical parameter for the correct reproduction. It is moreover important that these components reproduce a natural distribution of the phase that is variable in function of frequency and time. The use of fixed multi-pulse sequences does not however satisfy these conditions. In this paper an hypothesis about the reasons that determine the improvement obtained using multipulse patterns is verified and the characteristics of the implementation of a real-time text-to-speech system using a multipulse source are briefly considered.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11368/2796731
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact