Verenich Ivan

Theme of master's degree dissertation: Analysis of methods of construction of the systems of speech recognition using of hybrid of the hidden markovskoy model and network

Leader of work: Ph.D. Oleg Fedyaev

RUS

Last years with multiplying the productivity of calculable machines an actual task is become by development of new, more simple, clear and friendly interfaces of the programs with users. In particular is a synthesis and human speech recognition. Such interface will help a man, to not having skills of work with a computer, quick than him to master, and also will save time from simplification of serve of commands. Also technology of speech recognition will be irreplaceable and for men-invalids with violations of the oporno-dvigatel'noy system. So a man can execute some work, remaining in place.

The process of recognition of voice passes in a few stages. On each of the stages for the vocal signal processing a number of different methods is used. The process of recognition of voice can be broken up on three stages:

receipt of vocal signal and preliminary treatment of speech;
recognition of phonemes and words;
understanding of speech.

Receipt of vocal signal and preliminary treatment of speech. The receipt of vocal signal or diskretizatsiya of voice is determined as a process of receipt and acoustic signal shaping. Voice appears as acoustic pressure fluctuations in a microphone, characterized relatively nizkochastotnymi signals in a range approximately ot 0 to 4 kgts. There are two types of sounds: ringings and deaf. Ringings are generated the vibration of vocal cords at passing air. This acoustic signal is modulated tension of vocal cords. Vibrations resound in a vocal channel (it is a nose, throat and cavity of mouth). A blast, creating a sound, is named a «wave, formed in a glottis». This signal is kvaziperiodical, and his period is named the period of basic tone. The resonance signal of ringings sounds usually consists of four frequency components, urgent formantami. Formanty serve as the «vocal seal» of different sounds, producible the vocal vehicle of man. Deaf sounds appear at passing of air through a vocal channel in default of vibrations of vocal cords. Both types of sounds are examined as temporal rows of information, taken for regular time domains. For the isolation of necessary interval spatial windows are used. Some functions of windows expect middle amplitude, number of zero crossings and transformation of Fur'e of signal for an interval. For the removal of noise the different methods of filtration are used.

Recognition of phonemes and words. For recognition of phonemes, groups of phonemes and words such methods, as hidden markovskaya model or NMM (hidden Markov modelling), artificial networks of neurons or their combinations, are used.

Understanding of speech. To «understand» speech — it most difficult. On this stage of sequence of words (suggestions) must be regenerate in the pictures of tom, that wanted to say talking. It is well known that understanding of speech leans against the enormous volume of linguistic and cultural knowledges. Greater part of the systems of recognition of voice takes into account knowledges about a human language and concrete circumstances here. A task, related to recognition of voice, is recognition talking, I.e. «who talks» the process of automatic determination on the basis of incoming in a vocal signal individual information. Thus the question can is about authentication or about verifikatsii of talking. Authentication is finding in the known great number of phrases of controls of copy, proper the manner of this announcer to talk. Verifikation of announcer is determination of identity of talking: tot is it a man? Technology of recognition of announcer allows to use voice for providing of access control; for example, telephone access to bank services, to the databases, to the systems of electronic commerce or vocal mail, and also access to the secret equipment. Both technologies require, that an user was «added to the system», I.e. he must leave the standard of speech, on which the system can build a template. Attempts to develop hardware representation of the systems of recognition of voice were undertaken. Some products provide both golosonezavisimoe and golosozavisimoe speech recognition on one chip. A chip is supported by golosozavisimoe recognition on the base of dictionary, kept in permanent data storages of chip (ROM, read only memory). The dictionaries of the golosozavisimykh systems are kept out of chip and can be loaded during work of the system.

In work, the followings decide 3 tasks: Primary sound signal processing, application to the got signal of vehicle of the hidden markovskikh models, application of neyroseti for a receipt out signal

Hidden Markov Models - HMM.

Notations:

- number of distinct observation objects (colored balls, for example);

- a discrete set of all possible object observations (for example, the balls of color , color and so on)

- the number of states in the model (for example, number of urns with the colored balls);

- a set of states of model (urn number, number and so on);

- state at the time (i.e. - one of the );

- object, observed at the time (i.e. - one of the);

- observed sequence;

- length of the observation sequence ;

- initial state distribution, i.e. is the probability that system at time will be in the state ;

- transition probability from state to state ; ; it usually time independent ;

- the state transition probability distribution, square matrix ;

- the probability of observation of object   in the state , i.e. - matrix .

Hidden Markov Model (HMM1) is the set .

In order to understand how HMM1 can generate observation sequence , let us consider the following example. There are urns containing a large number of colored balls. A person has to take balls from urns, selected in random manner. The result of experiment is the consequence of colors of observed balls (see the following pictures).

Animation consists 11 shots, size of file is 268 Kb

First urn is chosen in accordance with initial state distribution . The first ball is then chosen from this urn at random, it will have a color with the probability . After the color of the ball is observed, the ball is replaced in the same urn. Further a new urn is selected in accordance with the probability : and second ball of color will be choosing with the probability . This random process is repeated times, after that we will receive an observation sequence of balls , while state sequence is hidden (it is the reason of model name).

There are three key problems that must be solved for this model to be useful in practical applications:

Problem 1.

Given the observation sequence and the model . The problem is how to efficiently compute , the probability of the observation sequence, produced by the model?

Solution to problem 1.

The most direct way is calculation of the number of every possible state sequences . The probability of such fixed state sequence for given model is

.

The probability of the observation sequence for the state sequence is

.

The searched probability is the summation of all joint probabilities over all possible state sequences:

Let us examine the number of computations involved in this algorithm. From the last formula follows that we have to fulfilled multiplications for each of state sequence . Hence, the total number of multiplications is order of . Even for small values, for example and , this number is approximately multiplications

Forward-backward algorithm.

There is exists more efficient algorithm to solve Problem 1; it is called the forward-backward algorithm. It has a two modification whether uses forward or backward variables, - forward algorithm and backward algorithm.

Forward algorithm.

Let us define the forward variable :

,

which is the probability of the partial observation sequence and state at time , given the model .

This probability can be calculated inductively, as follows:

1. Initialization:

,

2.Induction: for all ;

3.Termination:
.



Backward algorithm.

Let us consider the backward variable :

,

which is the probability of the partial observation sequence from to the final time , given state at time and the model .

We can solve backward variable in the same manner:

1.  Initialization:

, .

2. Induction: for all the ;        :

,

3.   Termination:

.

The calculation of probability for forward-backward algorithm involves of the order of multiplications. For selected for example values and this means approximately 500 multiplications, it is 2000 times less then for direct method.

Problem 2.

Given the observation sequence and the model . How do we choose the corresponding state sequence , which is generates sequencewith the maximum probability ?

The most efficient method to solve Problem 2 is the Viterbi Algorithm.

This algorithm is the modification of dynamic programming for Markov model.

Let us define the following variables:

,

- the highest probability which accounts for the first observations and ends at time in the state , and for memorizing of arguments, maximized .

The Viterbi procedure for finding the best state sequence can be state as follows:

1.        Initialization:

,   .

.

2.        Induction:

,   ,    ,

.

3.        Termination:

- is the highest probability to observe reached along optimal sequence , for which is known only final state:

.

4.        Optimal path backtrañking:

,

Problem 3.

Given the observation sequence and the model . How do we adjust the model parameters in order to maximize ?

Baum-Welsh Algorithm.

Let us define variable

which is the probability of being in state at time and state at time , given the model and observation sequence.

From the definition of forward and backward variables, we can rewrite in the form:

.

The posterior probability of being in state at time , given the observation sequence and model, is:

.

Those variables has the following properties:

is expected number of transitions from state ;

is expected number of transitions from to .

Using the above formulas, we can give a method for re-estimation of the parameters of HMM1:

,

,

.

It was proved by Baum and his colleagues that either:

1.        - the initial model defines a critical point of the likelihood function.

2.          - model is more likely then model .

Three basic barriers cost speech recognition on the way of development of the systems:

large volumes of dictionaries;
templates of continuous speech;
different accents and pronunciations.

It is basic obstacles for the automated systems of recognition of voice, but there are yet and other problems — understanding of semantics of speech. The volumes of dictionaries are determined by the degree of complication, requirements to calculable power and reliability of the systems of recognition of voice. It is possible to adjust to the continuous stream of speech, but there are yet and strict semantic rules which must follow, that the system was able to understand semantics of combinations of words in suggestions. It is necessary to continue sound researches, only it will allow to «manage» with such descriptions of speech, as morphology, accents, height of sound, speed, volume, meetings words, context, articulation, linguistic information, synonyms and etc, is Expected, that basic direction of development will be become by the design of languages for the use in the systems of speech recognition.

Not decided finally and problem of separation of vocal signal ot a noise background. Presently the users of the systems of recognition of voice are forced either to work in the conditions of minimum noise background or carry send with a microphone at a mouth. In addition, users have to «inform» a computer that they speak to him. For this purpose it is usually necessary to push the button or do something like that. Certainly, it is the best not variant of user interface. Working out these problems began, and much-promising results are already got. One of long-awaited developments in area of recognition of voice are the cheloveko-mashinnye interactive systems; engaged in such systems in many university laboratories of researches. The systems are «able» to work with a continuous vocal stream and with unknown announcers, to understand the values of fragments of speech (in narrow areas) and undertake the actions of returns. These systems work in real time and able to execute five functions by phone:

recognition of speech is transformation of speech to the text, consisting of separate words;
understanding is a grammatical analysis of suggestions and recognition of semantic value;
renewal of information is a receipt of information from operative sources on the basis of the got semantic value;
a generation of linguistic information is a construction of suggestions, presenting findings, in language chosen an user;
a synthesis of speech is transformation of suggestions to the speech synthesized a computer.

A dialog interface in such systems allows a man to speak with a machine, create and get information, decide the tasks. The systems with a dialog interface differentiate on the level of initiativeness of man or computer. Researches focused on the «mixed initiative» systems, in which both a man and computer act identically active part in achieving a purpose by means of dialog. With appearance of the systems of recognition of voice the idea of «talking» computer left off to be fantasy.

References:

L. Rabiner, B.-H. Juang. Fundamentals of Speech Recognition. Prentice Hall, 1995, 507 p.
X. D. Huang, Y. Ariki, M. A. Jack. Hidden Markov Models for Speech Recognition. Edinburgh University Press, 1990, 275 p.
C. D. Manning, H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 1999, 680 p.