Abstract Demenko DA - Speech Synthesis

Speech synthesis history started in XII century when attempts of making mechanical "speaking" head have been undertaken. For today, achievements in this area are great enough. At first sight, can even seem, that the speech synthesizing problem is already completely solved. And it is valid: we very often should "communicate" with robots-secretaries, household appliances "talk" to us in different languages …. However, it not so. "Speaking engines" use in advance predetermined set of phrases and cannot tell anything, besides, that is written down in their memory. "True" speech synthesizers are developed already in 80th years of the XXth century. Exist both program, and hardware realizations, however all of them have serious lacks and, as a rule, represent a parody to human speech - the"iron"accent and abruptness allow to guess only sense of the text, and that, not always. Therefore working out of a speech synthesizer, which "pronunciation" would be not possible to distinguish from human it is represented an actual problem.

In sphere of speech synthesis, for today, there are many directions. Here the cores from them:

Each of approaches possesses the merits and demerits. It is represented to the most simple compilative synthesis which is based on connection of "pieces" in conjoint speech. Despite seeming simplicity, a method is tough in realisation as in places of concatanation ruptures are audible, and using of large pieces (words and word-combinations instead of syllables) as phonetic base it is impossible owing to system restrictions. Besides, intonation transfer at such approach is very inconvenient. At use of parametrical synthesis, varying characteristics, it is possible to carry out modelling of emotional loading of the text, however to get rid of metal colouring of speech it is problematic. Parametrical synthesis is more flexible, owing to parametrization on the basis of small phonetic units (alofones, difones, syllables …). However the results shown by this method while are far from perfect in every respect. On the basis of the analysis of the basic methods, it is possible to draw conclusions on existing problems in the field of synthesising of speech [4]:

  - Artificiality of speech;
  - Absence of emotional loading;
  - Low noise stability of the synthesised speech.

The problem of artificiality of speech consists that, despite seeming quality of a pronunciation of speech synthesizers, such speech is perceived by the person hardly. At the heart of technology of synthesis of speech in advance written down phonetic base is used and words are formed by means of statistical calculation by a principle of the maximum credibility of phonetic compatibility, and blanks and defects are filled with a human brain. I.e. qualitative enough synthesizer with well picked up phonetic base it will be fine to be perceived on hearing during 15 - 20 minutes, but then the overwhelming majority of people ceases to perceive sense of that is said. There is it of that for listening of the synthesised speech the additional centres of processing of a brain are used, and the brain simply gets tired. Thus, the brain does not perceive the synthesised speech as natural which is processed at once in the speech centre. Many tested similar effect on themselves, studying foreign language. The following problem is absence of emotional loading - personal perception of the said text the reader. When the text is read by the person, it necessarily passes sense reproduced through itself, and in intonations and nuances its relation to the reproduced is felt. Modern programs of it cannot, but the front lines from them try to simulate intonation by modulation of a timbre, duration of phonemes and pauses. But it too only imitation, therefore a brain quickly gets tired to correct reproduction flaws, and the listener loses a narration thread. Obviously, for the decision of this problem methods from area of the theory of an artificial intellect for "sense extraction" from the reproduced text are required. Therefore such synthesizers should be under construction with the account of results of interdisciplinary researches the Third problem - a low noise stability of the synthesised speech. As have shown and show experiments, enough only small noise that the listener has ceased to perceive sense of the text said by a synthesizer. The explanation to it also is in neurophysiology area. Since For processing of the synthesised speech the brain uses the additional centres in the presence of extraneous noise, conversation or necessity of performance by the listener of any work, the brain simply does not consult ("is overloaded"), and the person ceases to understand sense of the said. The effect of hindrances essentially limits possibilities of application of a synthesizer in real conditions of technogenic and natural noise [3].

The work purpose is research and search of the optimum algorithm synthesising human speech and then its program realisation. For achievement of the specified purpose in master's work following problems are put and solved:

Scientific novelty of researches already spent and planned in work is supposed in the following:

At a current stage working out auxiliary software which help with formation of phonetic base will be carried out, the analysis of characteristics of a target signal and experimenting over realness of a resultant of a sound wave is conducted.Basic work are planed on 11th semestr, when synthesizer will be developed.

On the basis of the spent researches and experiments it is possible to draw some conclusions. The most perspective decisions in the field of speech synthesis are based today on the statistical models which parametres miscalculate on the annotated text-phonetic database. An ideal variant for the account of the multifactorial nature of it просодического a phenomenon, is such database which contains the information, statistically significant on volume, and the list of considered parametres whenever possible is expanded and includes all significant factors (semantic, syntactic, phonetic, punctuation). As the basis of the further working out, has been chosen the approach based on a combination of methods компилятивного of synthesis and формантного of synthesis by rules which will be taken as a principle constructions of system of synthesis of speech under the text with context-dependent grammar as parts of the channel of voice-activated control.

Dutoit t., Аn introduction to text-to-speech synthesis. - Boston-London, 1997. - 269 p.
Галунов В.И., Помехоустойчивость как системообразующий фактор речи. Проблемы и методы экспериментально-фонетических исследований. - СПБ, 2002. - 327 с.
D.Kraft, Speech perception. J.Phonetics, 1979. , p.279-312.
P.K.Kuhl, P.Inverson Linguistic experience and the "perceptual magnet effect". In W.Strange (Ed). Speech perception and linguistic experience, 1995. p. 121-154.
G.A.Fowler, An event approach in the study of speech perception from direct-realist perception. J.Phonetics, 1986. , p. 3-28.
K.N.Stevens, On the quantal theory of speech. J.Phonetics, 1989, p. 3-15.
Д.В.Разумихин, Использование нейронных сетей на уровне семантики в системе распознавания речи. IV всероссийская конференция "Нейрокомпьютеры и их применение, с. 208-210.
Д.Разумихин, А.Соловьев, Системы автоматического распознавания речи с различными моделями организации диалога. XIII сессия российского акустического общества, т.3, с. 141-144.

Demenko Denis Anatolievich

Theme of master's work: "Automatic synthesis of speech signals for intelligencing text output by its represantation in voice"

Scientific adviser: Ph. D. Fedyaev н.І.

Content