Introduction to Speaker Identification. Key Problems in Speaker Identification.



Àâòîð: Chris Pasich

Èñòî÷íèê: http://cnx.org/content/m14199/latest/




Introduction

Modern-day security systems are wide-ranging and usually have multiple layers to get through before they can be properly cracked. Aside from the standard locks and deadbolts and alarm systems, there are very complex methods to protecting important material. Many of these are methods that can allow or disallow a specific individual to access the material – a computer system has to be able to successfully detect a fingerprint, read an individual's eye patterns, or determine the true identity of a speaker. This last point is the focus of our project – speaker identification.

Summary

Our project aims to determine the true identity of a specific speaker. The speaker will speak a word to the system, and the actual word itself can be any word. The system can accept any word because it is a text-independent system, meaning there is no specified word need. The system will determine the identity of a user by examining the vowel sounds, from the input speech signal. The vowel sounds will be analyzed in the frequency domain, specifically by looking at the peaks, or formants, of the frequency response of the signal. These formants will be compared with the formants of all of the group members previously stored in the database of the system. The group member with the highest resulting value after the comparison is the one identified as the speaker by the system. If no user reaches the set threshold value, then the system responds by saying there is no match for the given speaker.

Terminology

The task our group performed is called speaker identification, and is often confused with other similar terms. The exact definitions of some of these terms is explained below.

Speaker recognition: Determining who is doing the speaking. Generally has two different applications – speaker identification and speaker verification. Also referred to as voice recognition.

Speaker identification: Identifying the exact person who is speaking. The speaker is initially unknown, and must be determined after being compared to templates. There can often be a very large number of templates that are involved in identifying a speaker, as it is difficult to correctly identify a speaker.

Speaker verification: Determining if the speaker is who he or she claims to be. The speaker’s voice is compared to only one template – the person who he or she is claiming to be.

Speech recognition: Recognizing the actual words being said, in other words, recognizing what is being said rather than who is speaking. Often confused with voice recognition, which recognizes an individual speaker.

The Questions

The issues with speech recognition in general are complex and wide-ranging. One of the main problems lies in the complexity of the actual speech signal itself. In such signals, as in signal 1 below, it is very difficult to interpret the large amounts of information presented to a system.

One of the more evident problems is the jaggedness of the signal. A natural speech signal is not smooth; instead, it fluctuates almost nonstop throughout the signal. Another naturally occurring property of speech patterns is the fluctuation in the volume, or amplitude of the signal. Different people emphasize different syllables, letters, or words in different ways. If two signals have different volume levels, they are very difficult to compare. Speech signals also have a very large number of peaks in a short period of time. These peaks correspond to the syllables in the words being spoken. Comparing two signals becomes much more difficult as the number of peaks increases, as it is easy for results to be skewed by a higher peak, and, consequently, for those results to be interpreted incorrectly. The speed at which the input single is given is also an important issue. A user saying their name at a speed different from the speed at which they normally speak can change results, as two versions of the same pattern are compared. The problem is, the time over which they are spoken is different, and must be accounted for. Finally, when examining the signal in terms of speech verification, another individual may attempt to mimic the speech of another person. If the speaker has a good imitation, it would be possible for the speaker to be accepted by the system.

The Answers

  • How do you deal with the jaggedness of the signal and the noise introduced to the signal through the environment?
  • In order to actually account for this, you have to pass all the signals through a smoothing filter. The filter will accomplish two tasks: first, it gets rid of any excess noise. Second, it gets rid of the high frequency jaggedness in the signal and leaves behind simply the magnitude of the signal. As a result, you get a clean signal that is fairly easy to process.
  • How do you account for the different volumes of speakers?
  • The signals must all be normalized to the same volume before they are examined. Each signal is normalized about zero such that all of the signals will have the same relative maximum and minimum values, and so that comparing two signals with different volumes is the same as comparing the same two signals if they were to have the same volume.
  • How do you examine each of the individual peaks?
  • Just after the signal is smoothed by the filter, we use an envelope function to detect all of the peaks of the signal. By doing this, we can be sure that, if a signal passes a certain threshold amount, it will be examined and compared with the corresponding signal in the database. The analysis will not be an analysis of the entire signal, but rather a formant analysis. The individual formant, or vowel sounds, in the signal will be examined and those will be used to verify the speaker.
  • How does the system handle varying speeds of inputs?
  • Both the formant analysis and the envelope functions will be used to help with varying input speeds. The envelope of the peak will determine which vowels are available, and the actual formants themselves will be relatively unchanged. It is difficult to handle very high speed voices, but most other voices can be handled effectively.
  • How can you account for imitating speech patterns?
  • Once again, the formants of the individual signals are analyzed to actually determine if a speaker is who he claims to be. In most cases, the imitating formants do not match up closely with those stored in the database, and the imitator will be denied by the system.

Formant Comparisons and Identifying the Speaker

After everything is broken down, all that is left for the system to do is the easy part – make a simple comparison between the input formants and the formant in the database. The first step is in determining which vowel is actually being spoken. This is simply an examination of the location of the first two formant peaks. If they both fall within the range of a specific vowel’s first two formants, they are representing that vowel. That range is stored within the database. These ranges are very well defined for each individual vowel and are adjusted to the members of the group. For example, the first formant of a vowel has a range that will include formants at frequencies just above the highest frequency first formant in the group and just below the lowest frequency first formant. If it does not fall in the range of the vowel, that vowel is not the correct one, and it continues to try the next vowel. It repeats this process until either it finds a vowel or goes through all vowel sounds in the database. If the formants do not fall within any particular formant range, the vowel sound will be ignored.

The second step is the actual comparison. The frequency response of the input vowel sound is multiplied in a dot product with each member’s previously stored frequency response for the vowel. This is the vowel that was determined in the first step. A resulting matrix is produced from the dot product. The matrix will output a value from 0 to 1, with 1 being a perfect match and a 0 being an entirely incorrect match.

This process is repeated for each vowel sound in the word. The matrices are then added together, and the system identifies the speaker as the individual with the highest score. If, however, that individual does not pass a threshold value, then the system determines there is no match.

Conclusions

Initially looking at the experiment, the plan was to have a text-dependent system, or a speaker verification system, or something that could actually determine what word was being spoken to the system by the user. It has become painfully clear that that would be a very difficult task to accomplish, and would require much more time, effort, and background on the subject than we could possibly acquire in a short period of time. Our system was, for what it did, relatively successful – it found vowels with regularity, and it identified speakers at a rate of almost 70% - a very good rate for a basic system.

As obvious, however, is how much more in depth speech recognition is than the scope of our project. Being able to determine what is spoken or who a speaker is with near perfect accuracy is an extremely formidable task. Preventing another individual from breaking into the system can be just as difficult, as it requires a system dependent on text and a system that will not accept anything other than what it specifies. Our initial idea of being able to determine what word was being spoken is, at best, naive, and at worst not at all feasible. With that said, however, the end results were very acceptable.