Summary - Yana Vitalievna Priliapa - Research of methods and software tools for audio file recognition

Abstract on the topic of the final work

Content

Introduction
1. Analysis of text information extraction from audio files
1.1 Recognition of words in a single speech
1.2 Recognition of isolated words
1.3 The problem of automatic speech recognition
1.4 A block diagram of a device for detecting signs of speech signals
1.5 Development of a block diagram of a device for determining the number of sounds in an isolated word of speech
Conclusions
List of sources

Introduction

Currently, the scientific community is investing a huge amount of money in the development of know-how and research to solve the problems of automatic speech recognition and understanding. This is stimulated by the practical requirements associated with the creation of a military and commercial system. Without touching on the first of them, it can be pointed out that in the European community alone, sales of civil systems amount to several billion dollars. At the same time, attention should be paid to the fact that in practical use there are no systems that are considered, for unknown reasons, the pinnacle of the development of automatic speech recognition systems. These are systems that can be called demonstration systems and which 50 years ago were called "phonetic typewriters". Their purpose is to translate speech into an appropriate written text.

If we consider the classical scheme "science – technology – practical systems", then, first of all, it is necessary to determine the conditions in which a practical system of automatic speech recognition or understanding will work. The most serious problems arise under the following conditions:

-an arbitrary, naive user;

-spontaneous speech accompanied by agrammatism and speech "garbage";

-the presence of acoustic interference and distortion, including changing ones;

-the presence of speech interference.

On the other hand, it is necessary to determine the importance of the task, its scientific and applied fundamentality, and its connection with other fields of knowledge. At the same time, it is necessary to take into account the state of scientific and industrial potential and its capabilities. It's no secret that a properly set task is already half the solution.

Currently, among the "speech scientists" there is an idea that the ultimate and highest goal is to create a "phonetic typewriter", and the universal method of solving all speech problems is the "hidden Markov models" (SMM).

Let's focus on the capabilities and disadvantages of the corresponding automatic speech recognition systems (announced today with the ability to recognize hundreds or even thousands of words with reliability up to 98%).

The user is required to pre-configure the system for his voice from several tens of minutes to several hours of pre-speaking texts.

Since words included even in a well-and accurately pronounced text turn out to be floating in an ocean of homonymy, the number of errors (verbal) increases by about 5 times. Cursory tracking of such errors, except in cases of ridiculous texts, is already difficult. The error correction device in most demo systems is poorly debugged.

There have been mentions that even for well-organized spontaneously spoken texts, the probability of correct word recognition does not exceed one third.

Finally, the processing time of the entered speech segment in such systems can take minutes.

All of this suggests that the proposed speech-to-text demonstration systems are unlikely to be of interest as the ultimate goal. This does not exclude the possibility of using them as a testing ground for evaluating scientific ideas, but in this case, the models that are embedded in the data of the automatic recognition system and how their practical prospects should be checked should be clearly stated. Thus, we are moving to the opposite end of the triad "practical systems – speech technologies – speech science".

The purpose of this work is the recognition of speech information using control systems using automatic speech command recognition systems based on hidden Markov models (SMM) on a computer. With the hardware base of such recognition systems fixed to date and taking into account the trends of its development in the near future, one of the most important blocks of such systems is considered - the SMM training block with training sequences. The quality of the recognition system directly depends on the successful solution of the task of teaching the Markov model. At the moment, there are two serious problems in the SMM training problem: the standard methods for solving it (the Baum-Welch method or the EM procedure) are methods of local optimization (that is, they are not able to go beyond the local extremes of the function) and are strongly dependent on the initial parameters.

In search of a solution to this problem, the work is developing software for speech command recognition systems.

To achieve this goal, the following main tasks have been solved:

• SMM training algorithms with training sequences have been investigated.

• Methods have been developed aimed at further improving the efficiency and quality of this algorithm in the context of the problem under consideration.

Currently, work on speech recognition has not only not lost its relevance, but is also developing on a broad front, finding many areas for practical application. Currently, there are 4 relatively isolated areas in the field of speech technology development:

1. Speech recognition is the transformation of a speech acoustic signal into a chain of symbols, words. These systems can be characterized by a number of parameters. First of all, this is the volume of the dictionary: small volumes of up to 20 words, large volumes of thousands and tens of thousands. Number of speakers: from one to any one. Speaking style: from isolated commands to combined speech and from reading to spontaneous speech. The branching coefficient, i.e. the value that determines the number of hypotheses at each recognition step: from small values (<10÷15) to large ones (>100÷200). The signal-to-noise ratio ranges from high (>30 dB) to low (<10 dB). The quality of communication channels: from a high-quality microphone to a telephone channel. The quality of speech recognition systems is usually characterized by the reliability of word recognition, or, what is the same thing, the percentage of errors.

2. Definition of the speaker's personality. These systems are divided into two classes: verification of the speaker (i.e. confirmation of his identity) and identification of the speaker (i.e. identification of his identity from a pre-limited number of people). Both of these classes can be further divided into text-dependent and text-independent. The next characteristic parameter is the volume of the passphrase. The other two (as in speech recognition): the signal-to-noise ratio and the quality of the communication channel. The quality of speaker verification/identification systems is characterized by two values: the probability of not recognizing "one's" speaker and the probability of mistaking a "foreign" speaker for one's own.

3. Speech synthesis. There are practically two classes:

1) Playback of a limited number of messages recorded in one form or another;

2) Text-to-speech synthesis. Synthesizers are characterized by the following parameters: intelligibility (verbal or syllabic), naturalness of sound, noise immunity.

4. Speech compression. The main (and only) classification feature of these systems is the degree of compression: from low (32-16 kbit/sec) up to high (1200-2400 kbit/s and below). The quality of speech compression systems is characterized primarily by the intelligibility of compressed speech. Additional characteristics that are very important in a number of applications are the recognition of the speaker's voice and the ability to determine the stress level of the speaker.

In this paper, we consider the systems of the first group - speech recognition systems and their special case - speech command recognition systems, i.e. recognition of isolated words rather than merged speech. Such systems are very useful in practice, and the increased need for them is primarily due to the emergence of a large number of various devices available to humans (personal, mobile and handheld computers, communicators and mobile phones, gaming and multifunctional multimedia devices with sufficient computing power) combined with the rapid development of telecommunications in the modern world. The importance of mass introduction of new interfaces for human interaction with technical systems is growing, since traditional interfaces have already reached their perfection in many ways, and with it their limits. With the traditionally high importance of information coming to us through the organs of vision, and its high proportion among all sensory information, considered to be about 85%, this channel of human perception becomes largely overloaded, and communication via the acoustic channel is seen as the primary alternative here. In addition, speech recognition (as well as synthesis) systems are also extremely important for people with limited vision, and this niche for their application is actively developing, primarily in the field of mobile telephony, as well as in household appliances (for controlling a variety of home devices). To help such people, manufacturers introduce control capabilities into their devices through voice commands, as well as voice duplication of on-screen information. And first of all, such products require recognition of a limited set of user commands, rather than combined speech with a large or unlimited dictionary. Due to the standardization of platforms and operating systems of phones, the range of third-party developers of software products with this functionality is expanding.

The hardware base of such systems can also be very diverse and have a noticeable impact on the overall effectiveness of the recognition system as a whole. The hardware of recognition systems is no longer the bottleneck and is capable of performing high-quality digitization of the speech signal with the required parameters, as well as providing the required computing power to implement the necessary algorithms for preprocessing and working with word models.

1. Analysis of text information extraction from audio files

The traditional automatic speech recognition (ARR) model assumes that by tracking acoustic parameters and using one of the search tools for a set of phonemic segment standards, phonemic series can be established. Then these series can be applied to carry out linguistic analysis at a higher level of highlighting words, phrases and the meaning of statements. Successful understanding of spoken sentences (phrases) involves the use of a particular linguistic structure in combination with the most reliable sound information.

In automatic speech recognition, the processes of detecting and identifying certain groups of phonemes present great difficulties.

1.1 Word recognition in combined speech

Two different approaches have been tested for word recognition in combined speech. In the first case, with a global approach, the word that needs to be recognized is compared with each word in the dictionary. When comparing, the spectral representation of each word is usually used. Among the various methods of this type, the dynamic programming method has given good results.

In the second case, with the analytical approach, each word or group of words is first segmented into smaller units. Segments are syllable-like or phoneme-like units. This allows recognition to be performed either at the syllabic or phonemic level and at the same time store in memory parameters (duration, energy, etc.) related to prosody and useful in the future. Segmentation can be based on finding the vowels of an utterance, which are often located near the maximum of the integrative energy of the spectrum. With this approach, the first criterion for segmentation is the change in energy over time. Some consonants, such as m, n, l, sometimes have the same energy as vowels. Therefore, it is necessary to enter additional parameters to determine the presence of a vowel sound in each previously defined segment.

To identify consonants, as a rule, the separation of explosive and non-explosive consonants is carried out. This is achieved by detecting a pause (bow) corresponding to the closing before the explosion is realized. The task becomes more complicated for the position of the beginning of the utterance, where the bow is relatively simply determined only for sonorous explosive consonants. After the bow is detected, the change in the spectrum and the type of change are determined. To establish each category of sounds, ordered rules are usually used, based on information depending on the acoustic and phonetic contexts. In a single speech, the phonetic realization of a particular utterance depends on several factors, including dialect, the speed of speech, the manner of pronouncing the speaker, and others.

1.2 Recognition of isolated words

The main features of recognition of isolated words are a hierarchical multi-tiered structure and control of each tier using appropriate grammars, whose symbols are vague linguistic variables.

The recognition strategy is based on grouping speech units into broad phonetic classes, followed by classification into more detailed groups.

Difficulties arise when recognizing merged speech: recognition of merged speech is much more difficult to recognize separately pronounced words, primarily due to implicit boundaries between words. As a result, it is difficult to determine the beginning and end of the correspondence between the phonemic chain of a word from the dictionary and the recognized phonemic chain. The system of acoustic-phonetic analysis of combined speech is usually considered as part of a general system for its automatic recognition.

The preliminary segmentation and classification of sound elements includes the definition of vowel-like, fricative-like sounds, explosive consonants, and pauses. The segmentation problem, considered as the problem of dividing the speech stream into functionally significant segments, is solved in different ways. When developing speech recognition systems, the importance of the first stage of acoustic signal processing is taken into account, which is related to the operation of the acoustic processor. The process of automatic segmentation is continuously connected with the marking of the audio sequence. The development of automatic segmentation and labeling is caused by the need to involve a large acoustic and phonetic database and the desire to objectify speech analysis.

1.3 The problem of automatic speech recognition

The APP problem can be solved in stages. At the first stage, the recognition task consists in the external identification of internally identified and only superficially characterized classes of acoustic events. For the second stage, the generalization of external classification criteria for internally unidentified classes is crucial, which makes it possible to predict the characteristics of an unknown signal.

In automatic speech recognition, first of all, it is necessary to find out whether the signal is actually phonetic (speech). The division of the speech stream into micro- and macro-segments is known. The distinction between two macro-segments (syntagm phrases) is, as a rule, discrete, and between two micro-segments (subsonics, sounds, syllables) is erased. Sounds change their suprasegmental (duration, intensity, frequency of the main tone) and segmental (spectral) characteristics in accordance with the influence of units of other tiers. For example, an increase in the duration of a vowel in a speech stream may indicate the semantic prominence of a word, the position of stress relative to this vowel, information about the preceding and subsequent phonemes, etc. Therefore, to predict, for example, the duration of a sound, a number of linguistic factors should be taken into account.

Knowledge of the compatibility of phonemes at the junctions of words also plays an important role in speech perception. The delimiting means of sounding speech are a complex phenomenon consisting of a variety of components related to phonotactic features, syntactic and semantic factors, and the rhythm of the formation of a speech utterance.

It is necessary to focus on some segmentation problems related to the specifics of the phonetic level. The difficulties may include automatic recognition of nasal and smooth phonemes of merged speech. Uncertainties arising from the limitations of any speech processing system and often due to poor pronunciation are considered as sources of information for stochastic grammar or grammar of an indefinite set.

Currently available methods of speech microsegmentation (segmentation into subsonics, sounds, syllables) can be classified as follows:

using the degree of stability over time of any acoustic parameters of the speech signal, such as the concentration of energy in the frequency spectrum;

applying acoustic labels to the speech signal at regularly repeated short intervals;

comparison of speech signal samples in short time windows at regular intervals with samples from prototype phonemes.

There are context-dependent and context-independent segmentation methods. The simplest method of context-independent labeling is the comparison of standards. To do this, it is necessary that a model is stored in the storage device for each possible dictionary unit. Context-dependent segmentation allows the connection of the used set of features and thresholds with the phonetic context.

To solve the problem of segmentation of sounding speech, it is of great importance to refer to the syllable. At the same time, in modern linguistics, phonetic and phonological types of a syllable are conditionally distinguished.

Phonological criteria should be used when defining, delimiting and defining a syllable. In the most general terms, a syllable is a speech segment consisting of a nucleus, i.e. a vowel (or syllabic consonant) and articulatively related adjacent consonants. The syllable makes it possible to reach both a lower sound and a higher language level using information from phonotactic features of morpheme formation, words. Most methods of segmentation into syllables are based on changes in the total (total) signal intensity, i.e. energy. Since theoretically each syllable should contain only one vowel, and vowels usually have a predominant intensity compared to surrounding consonants, it can be assumed that most local maxima are vowels. It is obvious that the syllabic boundaries are at the minimum point between the two maxima. However, this approach runs into difficulty, because in the presence of, for example, a sonant, false maxima may appear.

Segmentation can be carried out in two stages: on syllables, and then on the sounds that make up them, as a result of which the boundaries between syllables are clarified. The ratio between the segments according to a number of parameters allows us to identify the internal structure of the syllabic unit.

In phonetics, the point of view on the acoustic isolation of the boundaries of a phonetic word (rhythmic structure) has undergone a number of changes. The complete denial of the acoustic boundaries of the word was replaced by the statement that it is quite realistic to rely on objective criteria when determining the boundaries of a phonetic word in the flow of speech: acoustic characteristics of sounds at the junction of phonetic words and their allophonic variability. When differentiating the speech flow into phonetic words, the involvement of acoustic characteristics of butt sounds is necessary in all cases: both without a pause and in the presence of the latter.

The probability of a pause in speech depends on the nature of the combinations of sounds of the rhythmic structure of neighboring words (for example, if the first word ends with a stressed syllable, and the next one also begins with a stressed one, then the appearance of a pause between these words is more likely than in the case when the stressed syllable of the first phonetic word is followed by an unstressed syllable of the second phonetic word) and the place of the junction in question in the phrase.

In the flow of speech, determining the boundaries of a phonetic word is associated with a number of difficulties arising from the belonging of the utterance to the pronunciation style and type of pronunciation; the position of the phonetic word in the text, syntagma and phrase.

Some implementations of phonetic word boundaries do have their own acoustic features, others do not. The task should not be limited solely to the search for physical and auditory signs of neighboring sounds, but should be aimed at determining the hierarchy (subordination) of these signs.

Stress information is undoubtedly also used to determine the number of phonetic words in a message. The most important information, however, used by a person when dividing a speech stream, is information about the types of the most frequent phonetic words (rhythmic structures). When dividing a single speech into semantically significant segments, information from various language levels is used - from phonological to semantic. When developing programs for automatic text division, this information (about the types of rhythmic structures, the number and degree of stress, etc.), of course, should be taken into account. However, ambiguous language situations arise in the merged speech, the decoding of which can be carried out with the involvement of additional information about the acoustic signs of division. Butt vowels and consonants have certain acoustic characteristics, the change of which depends on the nature of the connection between them.

In cases where access to a speech recognition system should be provided to any user, it is advisable to switch to non-adaptive (speaker-independent) automatic recognition systems. These systems are much easier to implement for languages whose phonetic structure is more studied (for Russian, Japanese, English) and much more difficult for tonal languages (Vietnamese, Chinese, French).

When creating automatic speech recognition systems, experiments in the field of speech perception are of great importance. The results of such experiments often underlie the functioning of a particular system. Computers that recognize speech often copy some not only the analyzing functions of the human ear, but also the memory and logical functions of the human brain.

Continuous improvement of the forms of dialogue between the human operator and the computer should lead to optimization of communication between them. The human-machine dialogue in natural language involves the use of both appropriate technical methods and certain linguistic knowledge. The study of the problem of the role of the language of communication between humans and computers and the development of automated systems with a natural human language of communication are at the stage of further development.

1.4 Block diagram of the device for detecting signs of speech signals

The following block diagram of the device for detecting signs of speech signals will be proposed below (Figure 1.1).

It consists of the following blocks:

1 - microphone;

2 – envelope selection block;

3 – block for determining the beginning and end of a word;

4 – the block of allocation of the finite difference;

5 – block for selecting the number of sounds;

6 – delay line;

7 – interval selection block;

8 – analysis block;

9 – data block;

10 - is a printing device.

Picture 1.1 - Block diagram of the device for detecting signs of speech signals

The task of speech recognition can be reduced to the task of recognizing individual sounds, followed by the use of algorithms that take into account the peculiarities of pronunciation, word structure and phrase combinations of individual individuals.

In this case, the task of isolating speech sounds can be considered as a task of recognizing images, the number of which is limited, although it reaches several dozen. At the same time, the very task of classifying the presented sound samples can be reduced to the task of multi-alternative hypothesis testing. At the same time, the speech sound recognition system can be built using the principles of "learning with a teacher", i.e. a preliminary set of an information database of classified data, with which the signals received for analysis are compared. The procedure for recognizing speech sounds should take into account the specifics of their implementation. Firstly, these implementations of each sound have their own appearance. Secondly, they have a limited length in time.

The methods of analyzing speech signals can be considered using a model in which the speech signal is the response of a system with slowly changing parameters to periodic or noise exciting oscillation (Figure 1.2).

The output signal of the vocal tract is determined by the convolution of the excitation function and the pulse response of a linear, time-varying filter modeling the vocal tract. Thus, the speech signal s(t) is expressed as follows:

where e(t) is the excitation function, is the response of the vocal tract at time t to the delta function supplied to the input at time.

Picture 1.2 - Diagram of the functional model of speech formation

The speech signal can be modeled by the response of a linear system with variable parameters (the vocal tract) to the corresponding exciting signal. With the shape of the vocal tract unchanged, the output signal is equal to the convolution of the exciting signal and the impulse response of the vocal tract. However, all the variety of sounds is obtained by changing the shape of the vocal tract. If the shape of the vocal tract changes slowly, then in short time intervals it is logical to approximate the output signal by convolution of the exciting signal and the impulse response of the vocal tract. Since the shape of the vocal tract changes when creating different sounds, the envelope of the speech signal spectrum will, of course, also change over time. Similarly, when the period of the signal exciting sonorous sounds changes, the frequency difference between the harmonics of the spectrum will change. Therefore, it is necessary to know the type of speech signal in short periods of time and the nature of its change over time.

In speech signal analysis systems, they usually try to separate the excitatory function and the characteristics of the vocal tract. Further, depending on the specific method of analysis, parameters describing each component are obtained.

In the frequency domain, the spectrum of short segments of a speech signal can be represented as a product of an envelope characterizing the state of the vocal tract and a function describing the fine structure that characterizes the exciting signal. Since the main parameter of the signal that excites a sonorous sound is the spacing of the harmonics of the fundamental tone, and the characteristics of the vocal tract are determined with sufficient completeness by the frequencies of the formants, it is very convenient to proceed from the representation of speech in the frequency domain. When creating different sounds, the shape of the vocal tract and the exciting signal change, while the spectrum of the speech signal also changes. Therefore, the spectral representation of speech should be based on the short-term spectrum obtained from the Fourier transform.

Consider a sampled speech signal represented by the sequence s(n). Its short-term Fourier transform is defined as

This expression describes the Fourier transform of a weighted segment of a speech oscillation, and the weight function h(n) is shifted in time.

Linear prediction is one of the most effective methods of analyzing speech signals. This method becomes dominant in the assessment of the main parameters of speech signals, such as the pitch period, formants, spectrum, as well as in the shortened presentation of speech in order to transmit it at low speed and economically store it. The importance of the method is due to the high accuracy of the estimates obtained and the relative simplicity of the calculation.

The basic principle of the linear prediction method is that the current reading of the speech signal can be approximated by a linear combination of previous readings. The prediction coefficient is determined uniquely by minimizing the mean square of the difference between the samples of the speech signal and their predicted values (at a finite interval). Prediction coefficients are weight coefficients used in a linear combination. The linear prediction method can be used to reduce the volume of a digital speech signal.

The main purpose of processing speech signals is to obtain the most convenient and compact representation of the information contained in them. The accuracy of the presentation is determined by the information that needs to be saved or highlighted. For example, digital processing can be used to determine whether a given oscillation is a speech signal. A similar but somewhat more difficult task is to classify vibrations into vocalized speech, non-vocalized speech, and pause (noise).

Most speech processing methods are based on the idea that the properties of a speech signal change slowly over time. This assumption leads to short-term analysis methods in which segments of the speech signal are isolated and processed as if they were short sections of individual sounds with different properties.

One of the most well-known methods of speech analysis in the time domain can be called the method proposed by L.Rabiner and R. Schafer in /3/. It is based on the measurement of the short-term average value of the signal and the short-term function of the average number of transitions through zero. As noted above, the amplitude of the speech signal varies significantly over time. Such amplitude changes are well described using the short-term energy function of the signal. In general, the energy function can be defined as

The choice of the impulse response h(n) or window forms the basis for describing the signal using the energy function.

To understand how the choice of a time window affects the function of the short-term energy of the signal, assume that h(n) in (1.2) is sufficiently long and has a constant amplitude; the value of En will change slightly over time. Such a window is equivalent to a low-pass filter with a narrow bandwidth. The low-pass filter band should not be so narrow that the output signal is constant. To describe rapid amplitude changes, it is desirable to have a narrow window (short impulse response), however, too small a window width can lead to insufficient averaging and, consequently, insufficient smoothing of the energy function. The effect of the width of the time window on the accuracy of measuring the short-term average value (average energy):

if N (the width of the window in counts) is small (of the order of the pitch period and less), then En will change very quickly, in accordance with the fine structure of the speech oscillation,

If N is large (on the order of several pitch periods), then En will change slowly and will not adequately describe the changing features of the speech signal.

This means that there is no single value of N that fully satisfies the listed requirements, since the pitch period varies from 10 samples (at a sampling rate of 10 kHz) for high children's and women's voices and up to 250 samples for very low men's voices. Let's choose N equal to 100, 200, 300 samples at a sampling rate of 8 kHz.

The main purpose of En is that this value allows you to distinguish vocalized speech segments from non-vocalized ones. The value of the short-term average signal function for non-vocalized segments is significantly less than for vocalized segments.

A characteristic feature of the speech signal analysis method is the binary quantization of the input speech signal. The possibility of isolating the parameters of signals subjected to binary quantization is shown in /4/. The mathematical model of the speech signal used has the form:

where A(t) is the law of variation of the amplitude of the speech signal, is the complete phase function of the speech signal.

The law of signal amplitude variation is not a sufficiently informative parameter for evaluating a speech message, since it is not constant for the same word or phrase pronounced with different intonation and volume. The proposed method relies on the complete phase function of the speech signal as an informative characteristic of the speech signal. The complete phase function of the speech signal is represented as a Taylor series expansion:

In the decomposition, the first three decomposition coefficients are taken. In this case, the first coefficient 0, which is the initial phase of the speech signal, is assumed to be zero, due to uninformativity. Then the full phase function will be:

where, 1 is the decomposition coefficient, which is the average frequency of the speech signal, 2 is the decomposition coefficient, which is a change (deviation) in the frequency of the speech signal.

The parameters 1 and 2 are characteristics that are used to describe a speech message. In the "sliding window" processing mode, the first finite difference of the total phase function of the speech signal is calculated, which is a short-term function of the average number of transitions through the zero of the speech signal and is a rough estimate of the frequency of the speech signal 1 with some error depending on the frequency change 2. To determine 2, the second finite difference of the total phase function of the speech signal should be calculated, which is also the rate of change of the function of the average number of transitions through the zero of the speech signal.

1.5 Development of a block diagram of a device for determining the number of sounds in an isolated word of speech

The block diagram of the device being developed, analyzing the information signs of speech signals and determining the beginning and end of a sound in a word, is shown in Figure 1.3. It consists of the following blocks:

1 – the first shaper;

2 – digital delay line (CLL);

3 – the first reversible counter;

4 – the second PC;

5 - the first adder;

6 – the third PC;

7 - the fourth PC;

8 – the second adder;

9 - the fifth PC;

10 - the sixth PC;

11 – the third adder;

12 – the first calculator of the module;

13 – the second calculator of the module;

14 – the third calculator of the module;

15 - the first threshold device;

16 - the second PU;

17 – the third PU;

18 – the second shaper;

19 – the third shaper;

20 - the fourth shaper;

21 – scheme OR.

Picture 1.3 - Block diagram of the device for determining the number of sounds

The speech signal spoken by a person gets into the microphone. The microphone is used to convert acoustic waves excited by the human vocal tract into electrical vibrations.

To generate a binary quantized signal from an analog speech signal, an ADC with a single-bit dictionary organization is used. A comparator can be used as such an ADC. The amplitude characteristic of the comparator is shown in Figure 1.4.

Picture 1.4 - Amplitude characteristic of the comparator

The task of the comparator is to monitor whether the input speech signal exceeds a certain threshold of the Uor (for a negative half–wave of the signal, the Uor). When the speech signal at the comparator input is small (located in the range –Upor

At the output of the comparator, a signal is generated in the form of a sequence of binary quantized samples, that is, in the form of a sequence of logical "0" and "1". The appearance of signals at the output of the comparator is determined by the frequency of gating pulses arriving at its gating input. The repetition frequency of the strobing pulses, which is also the sampling frequency of the input speech signal, is selected from the condition for fulfilling Kotelnikov's theorem, that is, at least 2Fmax, where Fmax is the maximum frequency in the spectrum of the speech signal.

From the comparator output, the digitized signal is sent to the first central processing unit, which provides a signal delay of 100 samples, and to the summing input of the first reversible counter. The parameter allocated by the reversible counter is called the first finite difference of the total phase function of the speech signal or the function of the average number of transitions through zero. The circuit calculating the first finite difference consists of a delay line and a reversible counter. It works in the "sliding window" mode. The width of the time window is 100 counts. The code at the output of the reversible counter shows the number of zero crossings in the time interval of 100 counts. Shifting by one count, the "sliding window" outputs a new code showing the number of zero intersections.

The second CLC and the second reversible counter also calculate the first finite difference, but delayed by 100 counts relative to the one calculated by the first CLC and the first reversible counter. Having the first two finite differences of the total phase function of the speech signal, it is possible to estimate the change in the frequency of the speech signal over time, i.e. calculate the rate of change of the function of the average number of crossings through zero.

The operation of finding the second finite difference is performed in the first adder, which subtracts from the first finite difference at the current time the first finite difference delayed by the length of the time window of 100 samples.

The following blocks in the circuit (four reversible counters and two adders) are designed for 200 and 300 counts.

Since the second finite difference has negative values, then from the 1st, 2nd, 3rd adders it goes to the 1st, 2nd, 3rd blocks of the module calculator. Then to the 1st, 2nd and 3rd threshold device and to the shapers. After that comes the OR scheme.

Conclusions

As a result of the work, a literature review was conducted in order to find existing methods of speech analysis. The method of analyzing speech signals based on time domain signal processing has no analogues today. A feature of the proposed method is the representation of the speech signal model not in an additive form, as in spectral analysis methods, but in a multiplicative one. This explains the use of the Taylor series in decomposing the full phase function of a speech signal into components, rather than the Fourier series. A characteristic feature of this method is the allocation of the rate of change in the frequency of the speech signal as an informative parameter.

The results obtained showed the possibility of using the selected parameters of speech signals for speech recognition.

List of sources

Speech recognition using generative adversarial networks. A.S. Ivanov, O.A. Sokolova, D.V. Popov., 2019. - 365 p.
Oppenhein A.V., Shafer R.V. Digital signal processing, Moscow: Radio and Communications, 1979., 347 p.
Rabiner L.R. Shafer R.V. Digital processing of speech signals, Moscow: Radio and Communications, 1981., 258 p.
Lityuk V.I. Methodical manual No. 2231 part 3 "Methods of calculation and design of digital multiprocessor devices for processing radio signals", Taganrog, 1995, 48 p.
Kuznetsov V., Ott A. Automatic speech synthesis. Tallinn: Valgus, 1989. - 135 p.
Methods of automatic speech recognition / Edited by W.Lee. - M.: Mir, 1983. - 716 p.
Zinder L.R. General phonetics. - M.: Higher School, 1979. - 312 p
Zlatoustova L.V., Potapova R.K., Trunin-Donskoy V.N. General and applied Phonetics. Moscow: Moscow State University, 1986. - 304 p.
Potapova R.K. Speech control of a robot. - M.: Radio and communications, 1989. - 248 p.
Mikhailov K.V., Sharov S.V. Speech recognition using deep learning with reinforcement - M.: Radio and Communications, 2012. - 468 p.

Yana Vitalievna Priliapa

Faculty of Intelligent Systems and Programming

Department of Software Engineering named after L.P. Feldman

Specialization «Software Engineering»

Research of audio file recognition methods and software

Scientific supervisor: Associate Professor of the Department of PI, Professor. Rychka Olga Valentinovna