Helping machines sound more human…
Core business: Speech synthesis
Employees: 12 full-time engineers including the management team
Helping machines sound more human
When Matthew Aylett answers the phone, some people may wonder if it is him or not. After all, his company (www.cereproc.com) is one of the world's leading developers of speech synthesis systems, and he may just be trying out one of its latest creations.
There are many bad examples of speech synthesis systems, including speak-your-weight machines and Daleks. Text-to-speech or TTS technology can also have its funny side – e.g. the American editor who wasn't aware Stephen Hawking was English because of his “American” accent. Voice synthesis is also nothing new – Alexander Graham Bell also tried to develop a Speaking Machine. But according to Aylett, robot-like voices will soon be a thing of the past as machines learn to speak more like humans, thanks to recent technological advances.
“We give machines a personality, character and emotion,” says Aylett. “We can also clone human voices, duplicating them so a machine can use the voice, or humans can replace their own voices.”
Aylett, who is Cereproc's chief technical officer and co-founded the company with Paul Welham and Christopher Pidcock six years ago, is a self-confessed evangelist for “characterful” voices for machines. “I want to get the industry to move past conventional command and control systems,” Aylett explains.
It is ironic that when CereProc first developed its interest in voice synthesis, many industry pundits did not see the appeal of more human-like voices. The orthodox opinion was that machines should sound neutral, but Aylett questioned the value of the “unearthly” voices of so many systems.
Aylett's interest in creating emotional voices started eight years ago, when his sister was investigating “bullying” as part of a the “FearNot” project at Heriot-Watt University and asked him if he could produce five “expressive” characterful child voices to help her research. Aylett, then a senior engineer in a TTS company called Rhetorical Systems which spun out of Edinburgh University in 2000, realised that the current technology needed to be improved to achieve this goal.
Rhetorical was later bought by Nuance (formerly Scansoft), and after a spell working at the International Computer Science Institute (ICSI) at Berkeley, California, Aylett teamed up with Pidcock and Welham to form Cereproc, “lamenting the lack of innovation by speech technology companies around the world.”
Academics, says Aylett, tend to look for publishable, theoretically interesting ideas, while commercial enterprises need end results. Cereproc, he says, has taken a “mixed approach” to advance its technology, in the quest to develop more characterful voices. The classic industry approach was to think in terms of male or female voices, and there was little incentive to create something better. But Cereproc’s founders had different ideas.
Initially, the company's resources were ploughed into developing the “voice-building process,” so they could produce a wide variety of voices easily and quickly to meet different client requirements. Choosing the right accent for particular uses can be an interesting challenge, says Aylett, because many accents are “loaded” with their own associations. Realistic-sounding speech can be important, but if an automated machine is stating your bank balance, for example, neutral speech may be more appropriate. On the other hand, to read out an entire paragraph, the voice has to have a more “natural” tone to convey complex meaning – or it will sound very boring.
To explain this process, Aylett refers to the “emotional continuum” of voice synthesis systems, including the basics of “happy or sad, positive or negative, active or passive” (see sidebar). To make a voice sound human, it should be capable of speaking in a natural manner that seems to appreciate context and meaning, etc., but Aylett also cautions that the aim is not to misdirect or “trick” the listener but simply make the machine sound more “normal.”
The growth of the voice synthesis market has also been driven by changes in other technologies such as smartphones and tablet computers, which come with built-in microphones and are starting to popularise speech recognition solutions. A few years ago, it was thought the killer application would be automated travel agents, “speaking” to customers over the phone, but visual presentation (e.g. websites) for some applications will always be much more effective.
Like the Daleks, voice synthesis once had a bad reputation, says Aylett, and the technology was “a design-free zone,” but this is now rapidly changing. Voice synthesis is not always the most appropriate interface, however. We don’t need our toasters to tell us the toast’s burned, for example, but voice sythesis systems are now going through a renaissance, because they’re being used in more appropriate ways. “The problem is designing it so it is right,” Aylett says.
Speech synthesis systems are also beginning to spread into more and more aspects of life, not just electronic games and voice-response systems but mobile and virtual devices, including “smart homes”, where technology “speaks” to people with disabilities such as impaired vision.
Another application gaining in importance is “aggregated data” systems combined with TTS – for example, a “personalised radio” system which can search the web for news alerts and then read them out, in the voice of your choice, as a customised “package,” while you are driving. Other alerts could be set up for air fares or special promotions – much the same as “live” traffic announcements. Radio, says Aylett, is making a comeback, and “push-down” (as opposed to “pull-down”) data systems could be part of this resurgence. “We are wading through a sea of information,” says Aylett, and using TTS could turn it into a “rich audio experience.”
Aggregated data was one of three Cereproc projects funded by Scottish Enterprise under the SMART: Scotland programme, starting in 2006, when the company received a grant for £50,000 to develop a semi-automated voice synthesis system. Voice cloning is another exciting area for Cereproc – and not just for fun. If people are about to lose their voices because of a medical problem, the voices can be sampled and then reproduced so they sound “just like themselves.” A humorous example of this kind of technology can be heard at http://www.idyacy.com/cgi-bin/bushomatic.cgi, where you can type in words and listen to President Bush say whatever you want him to say.
Cereproc has sometimes swum against the tide as it developed speech synthesis systems, but has now established itself as an industry leader. In Aylett's view, the future of the company depends on “staying at the cutting edge of the technology,” because for such a relatively small business to survive, it has to be one step ahead of its rivals. Cereproc’s success so far has come from the freedom small companies have to pursue their own vision, and Aylett believes this has enabled it to be more innovative – and will continue to help it in future.
Cereproc works with several universities in the UK and abroad, including the University of Edinburgh (particularly the Centre of Speech Technology Research in the School of Informatics), Stanford University, the Fundacio Barcelona Media and Heriot-Watt University, which is using Cereproc text-to-speech technology in the development of robots with character. Its commercial partners include Interactive Digital, ReadSpeaker, SpeechConcept and JayBee.
The emotional continuum
Cereproc uses two separate methods to simulate emotional states. The first is to select tense or calm voice quality. This compares closely with the perception of negative and positive emotional states (and to some extent active or passive). The second is to use digital signal processing (DSP) techniques to alter the speech to active or passive states. Active states involve faster speech rate, higher volume and higher pitch. Passive states involve slower speech rate, lower volume and lower pitch.
The stronger the emotion, the harder it can be to simulate. This is because the DSP ‘tricks” used to simulate emotions begin to include “artefacts.”For example, slowing the speech rate and lowering the pitch with a negative voice quality will make the voice sound sad – and the slower the sadder – until the speech sounds artificially slowed down and unnatural.
Thus, although we can simulate a wide variation in the underlying emotion of voices, we do not have complete control. We cannot make our voice sound furious or in agony.