[DECtalk] Intelligibility/Listenability criteria

Sun Jul 21 16:41:01 EDT 2019

Hi Jayson,

On 7/21/2019 12:10 AM, Jayson Smith wrote:
> I personally prefer formant-based speech synthesizers as opposed to systems
> using pre-recorded human speech. I'm really sorry that formant speech synthesis
> seems to have fallen by the wayside in recent decades, and the focus seems to
> be trying to make things sound as human as possible.

I suspect there are a few reasons for this.

First, a bigger user population, many of which don't RELY on speech
synthesis to interact with machines but who are coaxed into using
speech by service providers trying to economize on their operations.
E.g., your bank would much rather have you interact with an IVR
than a real person.  And, the IVR woul dbe easier to implement if
it could be driven by text scripts -- instead of having to round up
the original "voice artist" who recorded the original messages!

Second, resources are much cheaper, now.  The original KRM ran
on a Nova minicomputer the size of a dishwasher.  It had 32KB (?)
of memory and ran at a whopping 500KHz (i.e., half a megahertz).

And, cost $8000.

I can get 100 times that performance running in a something the size
of a sugarcube for $30, today.

Third, those resources make a lot of the "tough problems" go away.
Instead of trying to decompose letter strings into sounds, you can
simply store all of the sounds associated with a particular letter
string -- a pronunciation dictionary.  Then, just lookup each word
that you encounter in your input text!

Finally, synthesizing the voice by a crude model of a vocal tract
is computationally more expensive than just recording ACTUAL utterances
and building a "dictionary" of those utterances (diphone inventory).
Then, just pick the utterances and glue them together.  This also
has the benefit of being able to replicate (instead of modeling!) any
speakers voice -- they just have to record a bunch of sample text
from which the diphones can be extracted a couple hours of the speaker's
time).

> Don't get me wrong, there
> are some great natural voices out there, Alex on MacOS and iOS being one
> example, as well as Amazon Alexa.

But they don't really have to deal with unconstrained input.  Alexa
doesn't have to speak "Your MAC address is 2C:24:55:6C:3F:08".
Or, "Please contact Dr Xerces at x3-4438 in the Hadron Accelerator
building".  Or, "for (i = 1; i < MAXCOUNT; i++)"

And, does Alexa reside IN the home appliance?  Or, is all of the
synthesis done remotely and Alexa just acts like a "dumb speaker"??

> But it seems no matter how hard the
> developers try, there's always going to be some odd case now and then where
> they can't quite match up the bits of recorded speech just right to make it
> sound completely natural.
>
> My favorite TTS systems are DECtalk and Eloquence. I absolutely cannot stand
> ESpeak, there's just something about its sound that gets on my nerves!

But, is it the rendering of the speech that annoys (the "voice")?
Or, the content being "off" that irritates?

For example, the Votrax units had a "voice" that you could FEEL, like
someone dragging a rasp across your eardrum.  Nothing about it was
really "natural".  But, with exposure to it, you could understand what
the input text was, despite its synthesis shortcomings.

If you wanted to read a recipe for baking chocolate cakes, there was
very little chance of you misunderstanding its intent.  And, you'd
only have to tolerate it for a few paragraphs.

On the other hand, having a book read to you would leave you exhausted.
The sound wasn't pleasant and the work that you had to do to decipher
its intent from the noises it made was just too taxing.