[DECtalk] Intelligibility/Listenability criteria

Don Text_to_Speech at GMX.com
Sun Jul 21 22:13:19 EDT 2019


Hi Blake,

On 7/21/2019 6:46 AM, Blake Roberts wrote:
> Don, For me, whether a speech synthesizer is tolerable or not depends on a
> few factors. 1. How realistic the voice sounds, naturalness.

By naturalness I'm assuming you mean prosody and lack of audible
artifacts (pops, squeals, clicks, etc.).

But, would your opinion change if it had other "warts"?  For example,
if it always pronounced "read" as "reed", regardless of context?
Or, if all numbers were read as strings of digits -- so "2019" was
pronounced "two zero one nine"?  Or, if it mispronounced a large
number of words (but did so with a pleasant voice)?

I find that each unexpected encounter causes my attention to be
diverted into contemplating what it might have been trying to say
instead of continuing to listen to the text that follows.  This gets
frustrating -- like taking two steps forwards and one step back.

> 2. Whether the synthesizer can handle the amount of text given to it by the
> screen reading software  without crashing.

Ah, that's simply unacceptable.  If it doesn't work, then its broken
and I don't want to have my time wasted by it.  As with the previous
comment, I'd rather make progress at a reduced rate or with lower quality
audio than to have to keep "starting over" each time the algorithm gags
or the application crashes.

> 3. If I can listen to the synthesizer for a long period without getting ear fatigue.

Understood.  But, hard to put a number on that.  And, what you consider
fatigue may not be a problem for someone else.

On the other hand, you can count the number of mispronunciations
and that number doesn't change, regardless of listener.

> Let me provide two  examples. Years ago I purchased AT&T voices from
> nextup.com for use with the the TextAloud program. Since AT&T voices are
> SAPI5 compatible, I chose to use them with my screen reader. That was a
> mistake. The voices are so large in size that they would consistently crash
> after being my JAWS screen reader voice for a minute or two. To me, the

Is that because of a lack of sufficient resources in your machine?
More RAM, faster processor?  Annoying to see how much bloat modern
software requires for a job that was handled decades ago by relatively
simple hardware!

> AT&T voices I purchased also sound monotone, so I could not tolerate
> listening to AT&T voices for hours on end in any event. I think there is a
> newer version of the AT&T voices from Wizard Software which NextUp does not
> have access to/does not sell. I can only share my perspective based on the
> voices which I have.
>
> On my Windows 10 system at home, I prefer either Eloquence or Microsoft
> Mark. When I am using JAWS Professional Edition on my work laptop, I prefer
> Microsoft Mark or the Vocalizer British English Vocalizer voice Malcolm
> although I happen to reside in the U.S. Malcolm sounds natural, does not
> crash and I enjoy listening to him for hours.

So your focus really seems to be on the "voice".  But, then again,
you're listening to it for prolonged periods of time so one would
assume you would want it to be as "easy on the ears" as possible.

I'm amused that no one has pointed to more technical issues regarding
the conversion process -- upstream of the voicing.

> These are my thoughts. I know that some people evaluate a synthesizer voice
> on how fast it can talk. I do not use that criteria myself as an end-user
> because I prefer slow or medium speed. If a voice is set too fast, I cannot
> understand it.

I think a lot depends on the content.  For me, comprehension seems
to vary opposite reading speed.  I'd want to read a contract more
carefully than a pulp novel or the day's news!



More information about the Dectalk mailing list