Google’s digital voice assistant to sound more like humans


FE Team | Published: December 28, 2017 14:56:59 | Updated: December 31, 2017 14:45:10


Reuters Photo

Google is making its voice assistant sound more like humans than robots. The company is working on a new text-to-speech system called Tacotron 2 which is essentially a neural network architecture for speech synthesis directly from text people see on the screen.

“The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesise time-domain waveforms from those spectrograms,” Google explains in a study titled ‘Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions’.

“Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.”

Google has set up a dedicated website to demonstrate what its Tacotron 2 is capable of. The new voice AI is capable of handling words or names that are difficult to pronounce and change the tone depending upon the structure of a sentence, reports Quartz. For instance, it can stress on words that are capitalised to indicate the importance, like humans would do during a conversation.

Tacotron 2 is part of Google’s ongoing efforts to improve its digital assistant which has now moved beyond smartphones and powering smart home speakers. The company has already made giant strides towards making AIs more human. Back in October, Google announced an updated version of its WaveNet, “a new deep neural network for generating raw audio waveforms that is capable of producing better and more realistic-sounding speech than existing techniques.” WaveNet is one of the technologies behind Google’s voice assistant, according to Hindustan Times.

The updated version of Wave Net results in more natural sounding voices for the Assistant, Google said.

Majority of text-to-speech (TTS) — or speech synthesis — systems are based on “concatenative TTS, which uses a large database of high-quality recordings, collected from a single voice actor over many hours. However, these systems can result in unnatural sounding voices and are also difficult to modify because a whole new database needs to be recorded each time a set of changes, such as new emotions or intonations, are needed,” Google said in a blog post.

Share if you like