What Is a Speech Synthesis Server?
A speech synthesis server is a networked computer that prompts a human user for input using speech. These servers rely on many complex underlying technologies, such as text-to-speech, voice synthesis and voice recognition. Learning about these technologies, and how they work together, can give you a better appreciation for the complex inner workings of the speech synthesis server.
-
Speech Synthesis Server
-
Many call centers use speech synthesis servers to route callers from the main menu to a human operator who works in a specific department. These servers are usually capable of producing speech as well as understanding it. Speech synthesis servers are also used in Web applications to make them more accessible and interactive.
Speech Synthesis Engine
-
A speech synthesis engine accepts input in the form of preprogrammed text or real-time commands and outputs recognizable speech. Speech synthesis engines that process preprogrammed text often perform a single task, such as informing visitors that a certain area is off-limits. Speech synthesis engines that accept real-time commands are used to provide the handicapped with a means to communicate, and are also used in telecommunications systems. The voice that you hear from these systems is a combination of human voice samples and algorithms that create the illusion of smooth speech.
-
Voice Recognition
-
Voice recognition systems work in the opposite direction. They interpret a human's speech and convert it to text. Voice recognition systems use a method of interpreting a human's speech that is based on probabilities. For example, in a simple voice recognition system where the only acceptable inputs are "yes" or "no," the computer is able to compute the probability that a user is saying one or the other. This is possible because the voice recognition system can compare the phonetic sounds of each input against a database of samples. The principle is the same for systems with many inputs, though the likelihood for error is higher.
Text-to-Speech
-
Text-to-speech is a technology that converts human-recognizable text into its phonetic equivalent, then converts that into sound using speakers. A large part of this process is concerned with interpreting the text and breaking it down into pieces. Each piece can be composed of several words, and represents individual phrases. In this way, the text-to-speech engine can render speech that sounds natural to human listeners. Sophisticated text-to-speech engines further break these units into individual syllables, complete with pitch and duration information.
-