DonNTU   Masters' portal

Abstract

Contents

Objective

An objective of the dissertation is to develop and to research the system of automatic speech recognition based on Sphinx toolkit for source code input intellectualization.

Tasks

Theme urgency

Dissertaion is about urgent problem of human machine interface intellectualization. Modern consumers use a wide wariety of mobile devices with very limited (slow) typing capabilities. The solution of this problem will let the programmers input the program texts by their speech.

Originality of dissertation

The dissertation is original by the hidden Markov models efficiency estimation used in Sphinx system on the real world problem

The practical profit is development of the speech program texts input application using Sphinx toolkit.

Expected practical results

The expected practical results are:

1 Research and development overview

1.1 Algorithms and methods of speech recognition

There are 3 main speech recognition methods:

The next masters were researching speech recognition in DonNTU:


The main speech recognition articles are presented at table 1:


 

Table 1 - DonNTU articles devoted to speech recognition

Title Source Co-writers
Usage of a vocal component in interfaces of programmed systems Interactive Systems: The Problems of Human – Computer Interaction. – Proceedings of the International Conference, 23-27 of september 2001. Ulyanovsk:UISTU, 2001. – pp. 26-28. Gladunov S.A., Fedyaev O.I.
Organizing a speech input of information based on neural   phonemes approximation Interactive Systems: Problems of Human – Computer Interaction. – Proceedings of the International Conference, 23-27 of september 2003. Ulyanovsk:UISTU, 2003. – pp. 198-203. Gladunov S.A., Fedyaev O.I.
Isolated Speech Word Recognition Based on Fuzzy Pattern Matching with Optimal Temporal Alignment In Proc. of 13-th International Conference «Speech and Computer» SPECOM’2009, St. PetersburgRussia, 2009, pp. 454-457. Bondarenko I.Yu., Fedyaev O.I.
Segment-Holistic Speech Recognition System Based on Model of Brain Hemispheric Interaction in Speech Perception Proceedings of 11th International Conference on Pattern Recognition and Information Processing PRIP’2011. – Minsk Belarusian State University of Informatics and Radioelectronics, 2011. – P.226-230. Bondarenko I.Yu., Fedyaev O.I.

1.2 Text data input system by automated speech recognition

Nowadays the best system for speech text input is Dragon NaturallySpeaking by Dragon Systems[1]. This is the only program that has very high accuracy about 99%.

Sakrament[2] is one of the leading developments in sound and speech processing sphere. Sakrament products including speech synthesis (text-to-speech) and speech recognition products provide the most natural communication solutions available for the Russian and English languages. Its accuracy is about 95%.

As for speech recognition in Ukraine, there is the Department of Recognition and Synthesis of the International Center for Science[3] and Education in Information Technologies and Systems, located in Kyiv, and Ukrainian Association for Information Processing and Pattern Recognition (UAsIPPR). They should be noted as undisputable leaders in Ukrainian speech recognition and synthesis.

1.3 Different toolkits for speech recognition systems construction

HTK[4] is a portable toolkit for building and manipulating hidden Markov models(HMM). HMM could be used for any timeline, that's why HTK is flexible toolkit. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing.

Sphinx-4[5] — is the most popular and flexible open source speech recognition by today. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).

2 Sphinx architecture

The Sphinx-4 framework has been designed with a high degree of flexibility and modularity. Figure 1 shows the overall architecture of the system. Each labeled element in Figure 1 represents a module that can be easily replaced, allowing researchers to experiment with different module implementations without needing to modify other portions of the system. The Sphinx-4 system is like most speech recognition systems in that it has a large number of configurable parameters, such as search beam size, for tuning the system performance. The Sphinx-4 ConfigurationManager is used to configure such parameters. Unlike other systems, however, the configuration manager also gives Sphinx-4 the ability to dynamically load and configure modules at run time, yielding a flexible and pluggable system. To give applications and developers the ability to track decoder statistics such as word error rate, runtime speed, and memory usage, Sphinx-4 provides a number of Tools. As with the rest of the system, the Tools are highly configurable, allowing users to perform a wide range of system analysis. Furthermore, the Tools also provides an interactive runtime environment that allows users to modify the parameters of the system while the system is running, allowing for rapid experimentation with various parameters settings [6].

Figure 1 – Functional scheme of speech recognition system based of Sphinx system (swf-animation, 52,0 kb)

3 Prior acoustic model quality estimation by the example of speech Pascal program language words

The first research was devoted to the model quality estimation. There was a few experiments. We had 5 dictionaries with different size: 20, 40, 60, 80 and 100 words. They were consisted of english words - lexems of Pascal programming language. We used 2 acoustic models: speaker-independent model Voxforge[7] and our own speaker-dependent. Speaker-dependent model depends on one signle speaker. Every word in that model were spoken 5 times. The words were isolated. Testing audio data was the same for all dictionaries. It were consisted of 20-sized dictionary. Every word was spoken 4 times. The graphic is on the figure 2.

Figure 2 – Graphic of dependency between word error rate and dictionary size

The model trained on one definite speaker has better accuracy, than speaker-independent model. That is because system gets along better with speaker whom it was trained. Also Voxforge was trained of american speakers, but testing audiodata was recorded by russian speaker.

Conclusion

The research and development overview acknowledged that speech recognition theme is urgent enough. The absense of any lookalikes allowing the user to dictate the program texts, underlines the dissertation originality.

The analysis of the first version of the system shows that the speech recognition accuracy is rather low. That's why further work will increase the recognition accuracy by improving the HMM training and adding grammar.

References

  1. Dragon Speech Recognition Software. - Режим доступа: http://nuance.com/dragon/index.htm
  2. Синтез и распознавание речи. - Режим доступа: http://www.sakrament.com/
  3. Сайт з розпізнавання та синтезу мовлення в Україні. – Режим доступа: http://speech.com.ua
  4. What is HTK? [Electronic resourse] / Интернет-ресурс. - Режим доступа: http://htk.eng.cam.ac.uk/
  5. CMU Sphinx Open Source Toolkit For Speech Recognition Evaluation [Electronic resourse] / Интернет-ресурс. - Режим доступа: http://cmusphinx.sourceforge.net/
  6. Sphinx-4: A Flexible Open Source Framework for Speech Recognition [Electronic resourse] / Интернет-ресурс. - Режим доступа: http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4Whitepaper.pdf
  7. Welcome – Russian Evaluation [Electronic resourse] / Интернет-ресурс. - Режим доступа: http://www.voxforge.org/ru