Modern systems are not only able to perform various operations on the calculation and information processing, but also to perform data recognition and recovery. In recent years, such systems are of great interest not only in scientific, but in practical areas of human activity.
One of the actual tasks of these systems is speech synthesis.
Speech synthesis is the process of reconstruction of the shape of the speech signal by its parameters.
During design and development of modern speech synthesizers is used frequently sound database [1]. Such a database contains either a single phoneme or a combination thereof, or a full record of words or phrases, depending on the approach to the synthesis. On the first stages of synthesizer creation the filling of database can be performed manually. But such an approach is fraught with time costs to find required segments, subjective human perception of sound, difficulty in reproducing the results of the filling.
Therefore there is a problem of automation of the filling process via usage of the methods of speech segmentation.
There are many methods of speech segmentation based on different mathematical apparatus such as discrete wavelet transform [2], dynamic programming [3], the logarithmic transformation of the spectrum [4]. They allow to receive reproducible results with different quality and segmentation speech accuracy. Specific requirements for methods of segmentation are a speaker independency, since this property allow to receive sound database for the various voices without changing the parameters of the method for individual speaker.
To modern systems of speech synthesis have requirements of legibility and naturalness of sound. Legibility means correct recognition by the person of all the words of the synthesized speech. Most modern speech synthesis systems demonstrate good legibility reaching the legibility of natural speech. The naturalness is evaluated by similarity measures between synthesized and naturally-spoken speech.
Under the speech with intonation we understand the reader's expression relationship to the content of this text and the audience. Usually, the style and meaning of the text dictate a choice of style of speech. In a speech with the intonation the individual words are underlined, certain sections of text are highlighted by pauses, etc.
1. Theme urgency
Speech synthesis can be used in communication engineering, in information systems, in military and space technology, in robotics. Also it can be used for delivery of the information about technological processes and for help. In perspective development of high-quality speech synthesis systems is a necessary step towards closer human communication with the computer. Speech synthesis may be required in all cases where the information receiver is a human.
Based on the above it can be concluded that the master's work is devoted to actual scientific task of speech synthesis.
2. Goal and tasks of the research, planned results
The goal of master's work is software development for the synthesis of words and phrases of Russian speech with modeling of intonational coloring.
To achieve this goal it is necessary to solve the following tasks:
- Segmentation of the speech signal for automatic filling of the synthesizer's database with different sound combinations.
- Analysis of the entered text.
- Building transcription of the text.
- Gluing sound combinations from the synthesizer's database without the clicks.
- Determination of intonational constructions of the text.
- Adduction of the speech signal to a given melodic contour.
Research object: speech as a sequence of sounds, which are implementations of specific phonemes.
Research subject: algorithms of modeling of melodic contour for synthesis of speech and software implementation.
For the experimental evaluation of received theoretical results the development of software implementation of a speech synthesizer is planned.
3. Synthesis of intonational component of speech signal using spline interpolation
In this approach [17] to the compilative synthesis if speech signal we need to obtain formal description of its phonetic and intonational properties. As part of the description proposed we need to specify intonational characteristics for all phonemes. These characteristics include the number of control points of parametric curves. The parameters of the neighboring phonemes should be smoothly coordinated.
Figure 1 shows the main stages of compilative synthesis of speech signal with the using of smooth parametric curves with a limited set of control points given.
To achieve the qualitative synthesis it is important to smoothly adjust the following parameters of speech signal:
- Contour of pitch frequency is the main component of intonational speech.
- The amplitude envelopes , the primary purpose of which is dynamically adjusts the signal amplitude. The joint increase in the amplitude and frequency of the signal leads to an increase of its volume.
Figure 1 — Stages of imposing intonational construction on the of speech signal (animation: 8 frames, 5 cycles repeat, 145 kb)
In compilative synthesis [8] via various algorithmic manipulations on the audio signal the necessary form is achieved. The specified form of the speech signal depends on many factors: the language, the speaker's voice, text, required intonation, speed and volume of pronunciation, etc.
Prepared, normalized on duration of phonemes and the overall level of the amplitudes, smoothly connected from different fragments speech signal is input to the parameters control system. Depending on the required intonational characteristics pitch frequency contour is formed and superimposed over the original speech signal. Then the amplitude envelopes are superimposed over the signal.
A limited number of control points is to define curve allocated. So the original function of control parameter is approximated in a best way. Extremums of the approximating function are selected as initial reference point.
Speech synthesis is an actual task. The following results were obtained during work on this essay:
- an algorithm for speech signal segmentation is developed for automatic filling of the synthesizer's sounds database;
- intonational constructions of the Russian language were examined;
- existing approaches to the creation of intonational coloring were review.
Further research is aiming on the following aspects:
- definition of intonational constructions of the text;
- adduction of the speech signal to a given melodic contour.
While writing this essay master's work had not been completed yet. Final completion term is December 2012. The full text of the work and materials on the subject could be obtained from the author or his supervisor after that date.
