[DECtalk] Intelligibility/Listenability criteria

Don Text_to_Speech at GMX.com
Tue Jul 23 11:05:53 EDT 2019


Hi Blake,

On 7/23/2019 5:44 AM, Blake Roberts wrote:
> When I referenced naturalness previously, I was summarizing my individual
> criteria. Yes, I do consider how a specific voice pronounces words
> consistently when judging its naturalness. For example, the old Nuance U.K.
> English voice Emily consistently says "restart" as "restert", as in when one
> restarts a computer. I know that word is a voice-specific issue because
> other U.K. voices Daniel and Malcom, also from Nuance, pronounce the word
> correctly.

OK.  Similar to my "maysh" and "mash" example.  That's really not much
different than encountering a person who happens to speak differently,
like my are-kansas friend.

If its some number of select words, then you make allowances for that
idiosyncrasy.  If its a pattern of pronouncing certain types of words
or phonemes "oddly", you might treat it as an accent, of sorts.

But, they interfere with your comprehension as they require your
brain to "stick" on them, even momentarily, before grasping their
intent.

> Reading through your questions to various members of this list, I feell like
> an individual who is  taking a survey. Are you a researcher or scientist? I
> ask due to your reference to measurable criteria.

I've been extensively automating my home as a showcase project to
demonstrate how one can approach "universal design", "design for
accessibility", etc.  My goal has been to make it possible for
blind, deaf, mute, paralyzed, developmentally disabled, etc.
individuals to be able to use it as effectively as so-called
"able-bodied" folks.  And, in doing so, to be able to explicitly
point to the "extra" efforts that were required to address each class
of potential user.  In doing so, to be able to lay a foundation for
other folks to design products with the same degree of accessibility.

For example, I can design for sighted individuals because folks
have developed frameworks that make it easy for me to throw together
all sorts of "displays" of varying degrees of complexity with very
little effort.  I can't do that for audio interfaces.  Or, haptic
ones.  Or...

My specific synthesizer question has two goals.  First, to help
me select a synthesis technology for users who elect "audio output"
as the means of receiving information from the system.

Because the user will want to have the interface device on his
person at all times -- to facilitate interacting with the house
regardless of where he is or what he's doing -- the devices must
be very small, unobtrusive and have long battery life (you wouldn't
want to be trailing a cord around you or having to recharge batteries
every three hours.

This poses another set of challenges:  the device will be severely
resource constrained.  No room for a big battery.  Very little
available power to waste on an oversized CPU required to tackle
complex processing tasks.  Etc.

I can work-around this by making the device, in the case of audio output,
act as little more than a wireless speaker.  All of the heavy lifting
required for synthesizing the voice(s) necessary can be housed in
a box hiding in a closet, powered from the AC mains.  The "remote"
can then be little more than a bluetooth earpiece -- small, inexpensive,
convenient, unobtrusive, etc.

But, this brings up the second need for the question.  What to do
if the earpiece can't communicate with that big box in the closet!
How do you "pair" the earpiece to the big box and how do you
know of any problems that might ensue?

When you pair an earpiece with a cell phone, you have the phone available
in your hands so the phone can interact directly with you, despite the
fact that your prefered means of interaction -- the earpiece -- is
not working.  The big box in the closet may not be as readily
accessible.  Or, it may be crippled in some way.  Or, it might not
be "yours" to mess with (imagine your employer having a big box that
services all of his employees at your place of business).

Or, there may be a big box in your neighbor's nearby house and you need
to be able to choose which of these to pair with.

Or, the earpiece itself may need to tell you something (low battery,
no big boxes detected, volume at level 6, Ursula voice selected, etc.).

Or, you may need to configure the earpiece so that it tells the
big box how you would like the big box synthesizer to be configured,
when you are eventually granted access!  In that way, your voice
preferences travel with the earpiece instead of being captive
to the bog box you normally use.

If you can determine what ALL of the messages are going to be at
the time you design the earpiece, then you can can those messages
and embed them in the earpiece.  But, then the earpiece can never
say anything other than those messages without the help of the
"big synthesizer" located in the big box.  The big box to which it
can't yet connect!

So, embed a "little sythesizer" in the earpiece.  Then, it can
convey any messages from the big box that are significant.
Remember, the big box has to also be able to interact with folks
who are deaf or speak some other language.

Now  you're stuck with the crippled resources of the little
earpiece, yet challenged to produce "good speech".  Regardless
of what the person administering the big box has decided to
"utter" at potential users who have not yet "connected".

And, if this is an event that only happens occasionally,
the user will be unaccustomed to this "crippled" speech
having grown to expect the "big box" speech, instead!

Some fictional examples:

"Users must obtain prior authorization from Dr Wahilaway
in room 246N of the Phoenix building at x3-2219 or
email wahilaway at bigbusiness.com"

"System ID 12KK46, node 13.12.101.6.  Please provide valid credential:"

"Sorry, the system is down for maintenance until 0200."

"Maximum number of users supported.  Please try again later"

I've written several candidate synthesizers for this role.  But,
always have to trade off some level of complexity or naturalness
to fit within the resources available.  And, as I can't guarantee
that the person administering the next black box that you encounter
will have correctly structured his messages with the challenges of
speech synthesis in mind, I can always come up with a real example
of a message that the crippled synthesizer will gag on.

Even if that's just a misspelling!

Finally, your interface to the crippled synthesizer is also
crippled!  You don't want to carry around a keyboard so you
can tell it to repeat that last message.  Or, replay it one
word at a time.  Or, spell out that word that you just can't
seem to understand.

So, you want the best quality crippled synthesizer that you
can get.  For that, you need to have some criteria by which
to evaluate the relative qualities of candidate synthesizers!


More information about the Dectalk mailing list