[DECtalk] Intelligibility/Listenability criteria

Tue Jul 23 09:38:53 EDT 2019

Hi,

A few points here.
First, I think it's a little of both pronunciation and the voice itself 
that gets on my nerves with ESpeak.

Second, I'd argue that Alex and Alexa do have to contend with 
unrestrained input. If I go into my Alexa web portal and put 
"Sfjsaofhdsahbfiuewfbhifgbfvbiuewqfbirewqfbiwfbiubifdsava" on my 
shopping list, then ask her to read my shopping list, she's going to 
have to deal with that horrible mess of text.

Jayson

On 7/21/2019 4:41 PM, Don wrote:
> Hi Jayson,
>
> On 7/21/2019 12:10 AM, Jayson Smith wrote:
>> I personally prefer formant-based speech synthesizers as opposed to 
>> systems
>> using pre-recorded human speech. I'm really sorry that formant speech 
>> synthesis
>> seems to have fallen by the wayside in recent decades, and the focus 
>> seems to
>> be trying to make things sound as human as possible.
>
> I suspect there are a few reasons for this.
>
> First, a bigger user population, many of which don't RELY on speech
> synthesis to interact with machines but who are coaxed into using
> speech by service providers trying to economize on their operations.
> E.g., your bank would much rather have you interact with an IVR
> than a real person.  And, the IVR woul dbe easier to implement if
> it could be driven by text scripts -- instead of having to round up
> the original "voice artist" who recorded the original messages!
>
> Second, resources are much cheaper, now.  The original KRM ran
> on a Nova minicomputer the size of a dishwasher.  It had 32KB (?)
> of memory and ran at a whopping 500KHz (i.e., half a megahertz).
>
> And, cost $8000.
>
> I can get 100 times that performance running in a something the size
> of a sugarcube for $30, today.
>
> Third, those resources make a lot of the "tough problems" go away.
> Instead of trying to decompose letter strings into sounds, you can
> simply store all of the sounds associated with a particular letter
> string -- a pronunciation dictionary.  Then, just lookup each word
> that you encounter in your input text!
>
> Finally, synthesizing the voice by a crude model of a vocal tract
> is computationally more expensive than just recording ACTUAL utterances
> and building a "dictionary" of those utterances (diphone inventory).
> Then, just pick the utterances and glue them together.  This also
> has the benefit of being able to replicate (instead of modeling!) any
> speakers voice -- they just have to record a bunch of sample text
> from which the diphones can be extracted a couple hours of the speaker's
> time).
>
>> Don't get me wrong, there
>> are some great natural voices out there, Alex on MacOS and iOS being one
>> example, as well as Amazon Alexa.
>
> But they don't really have to deal with unconstrained input. Alexa
> doesn't have to speak "Your MAC address is 2C:24:55:6C:3F:08".
> Or, "Please contact Dr Xerces at x3-4438 in the Hadron Accelerator
> building".  Or, "for (i = 1; i < MAXCOUNT; i++)"
>
> And, does Alexa reside IN the home appliance?  Or, is all of the
> synthesis done remotely and Alexa just acts like a "dumb speaker"??
>
>> But it seems no matter how hard the
>> developers try, there's always going to be some odd case now and then 
>> where
>> they can't quite match up the bits of recorded speech just right to 
>> make it
>> sound completely natural.
>>
>> My favorite TTS systems are DECtalk and Eloquence. I absolutely 
>> cannot stand
>> ESpeak, there's just something about its sound that gets on my nerves!
>
> But, is it the rendering of the speech that annoys (the "voice")?
> Or, the content being "off" that irritates?
>
> For example, the Votrax units had a "voice" that you could FEEL, like
> someone dragging a rasp across your eardrum.  Nothing about it was
> really "natural".  But, with exposure to it, you could understand what
> the input text was, despite its synthesis shortcomings.
>
> If you wanted to read a recipe for baking chocolate cakes, there was
> very little chance of you misunderstanding its intent.  And, you'd
> only have to tolerate it for a few paragraphs.
>
> On the other hand, having a book read to you would leave you exhausted.
> The sound wasn't pleasant and the work that you had to do to decipher
> its intent from the noises it made was just too taxing.
>
> _______________________________________________
> Dectalk mailing list
> Dectalk at bluegrasspals.com
> http://bluegrasspals.com/mailman/listinfo/dectalk
>
>