The quality of a speech recognition system can be indicated by different numbers and shown in IVR Message Examples. In addition to detection rate – usually as a real-time factor (ECF) specified – can be measured as the word accuracy and word recognition rate of the recognition quality.
If grammars are used, there are usually context-free grammars. In this case, however, each word must play its role within the grammar assigned. Therefore, such systems are usually used only for a limited vocabulary and special applications, but not in the current speech recognition software for PCs.
Furthermore, there are also first applications of speech recognition systems to evaluate the intelligibility of pathologic speech. To understand how a voice recognition system works, one must first be clear about the challenges that it has to deal with.
In one sentence in everyday language, the individual words are pronounced without perceptible break in between. As a human you can intuitively identify the transitions between the words – something that earlier speech recognition systems were incapable of doing. They required a discrete (discontinuous) language, with the artificial pauses between words.
However, modern systems are also capable of understanding continuous (smooth) speech. In many languages there are words or word forms that have different meanings but are pronounced the same. Such words are called homophones. Since in contrast to humans, a speech recognition system usually has no knowledge of the world, it can not discriminate on the basis of the importance of different ways as shown in IVR Message Examples.
The question of the upper or lower case also falls within this range. In the acoustic level, in particular the location of the formants play a role: the frequency components of spoken vowels typically focus on certain different frequencies, called formants.
In particular, for the differentiation of vowels for the lowest two formants is important: The lower frequency is in the range of 200 to 800 Hertz, the higher range: 800-2400 Hertz. The location of these frequencies can distinguish the individual vowels that form part of IVR Messages.
Consonants are relatively difficult to detect; single consonants (so-called plosives) are for example only fixable through the transition to the adjacent sounds. It is evident that speak within the word the consonant p (more precisely, the closure phase of the phoneme p) is in fact only recognized by the transitions to the other vowels.
Other consonants are quite recognizable by spectral patterns. It is noteworthy that relevant for the decision of these sounds is largely outside of the spectral data transmitted in telephone networks (up to about 3.4 kHz).