[DECtalk] Intelligibility/Listenability criteria

Sun Jul 21 17:01:12 EDT 2019

Hi Damien,

On 7/21/2019 1:17 AM, Damien Garwood wrote:
> Here are my criteria:
>
> 1. Understandability
> As a screen reader user who has to listen to speech synthesis on a constant
> basis while using a computer, understandability is first and foremost. If the
> synthesiser can't be understood, then you're not going to get the feedback you
> need. In my opinion, ESpeak ticks every box, except this, so I can't use it.

But understandability is a nebulous term.  If a synthesizer says "bizfut"
every time it encounters the word "Fred", I would consider it highly
understandable.  It would just be tedious to have to make that translation
in your mind each time!

On the other hand, if it's making noises that aren't recognizable as
an F sound followed by an R sound followed by... then I wouldn't care
how accurately it THINKS it knows what to say.

> 2. Responsiveness. Again, because the speech is reading everything for me, I
> don't want a synthesiser that acts sluggishly with any kind of latency, whether
> that be a second, or 50 milliseconds, whether through lack of performance
> optimisation or through audio silence. When I press a key, I want instant
> feedback. This automatically rules out most natural-sounding synthesisers.

I'd not considered this aspect of synthesis.  I've been focused on the
"output" and its ability to faithfully reflect the input provided.

By the way, 300 milliseconds is typically considered the threshold of pain
when it comes to interaction with a user.  At that point, the user has already
moved on from "waiting for a response" to "taking action to reinitiate the
response"... hitting the button a SECOND time!

Moral of story:  if you're approaching 300 milliseconds, your product sucks!

> 3. Accuracy: It needs to be able to read text accurately for the language it is
> designed for. It's not enough simply to have a phonetics dictionary, but it
> also needs to be able to distinguish between words (Present noun versus present
> verb, for instance).

The latter is a problem of heterophones -- same spelling, different
pronunciations.

Present your child with a birthday present.
Polish the Polish silverware.
Don't desert your friends in the desert.
I object to you treating me as an object!

English is particularly fraught with problems from foreign words creeping
into everyday usage.  So, rules have to accommodate lots of exceptions.
(Apparently, swedish? is highly regular).

Then, there are mispronunciations which have become the norm.
How do YOU pronounce almond?

> 4. Flexibility: The voice timbres should be available to the user, and for the
> most part should adjust smoothly to the change. This is important if a user has
> specialist needs and cannot use the synth in its default state. Speed and pitch
> are definitely a must. Again, this rules out natural synths, since due to the
> nature of recorded samples they start to begin to sound unnatural if you
> attempt to adjust the speed and pitch. The bigger the change, the more unnatural.

I recognize that this is an important part of the user experience when
it comes to synthesis.  But, I've been focused more on the up-front
aspects of the TTS process.  For example, a different voice will still
suffer from the same flaws that occur upstream in the process.  I think
I could more tolerate a crappy voice "saying the right things" than
a delightful voice that leaves me wondering what the original text
actually was!

> Like Jason, I also prefer formant synths. My favourite by far is Keynote, which
> to me is the most understandable, but I do love DECTalk for its flexibility. I
> also like Eloquence and the synthetic version of Orpheus.

My preference for formant synthesizers driven by LTS algorithms (instead
of dictionaries) has to do with implementation economies.  I can implement
a DECTalk in much fewer resources than something like Festival.

But, if the resulting audio "product" sucks, that's a foolish economy!