[DECtalk] Intelligibility/Listenability criteria

Tue Jul 23 22:56:10 EDT 2019

On 7/23/2019 10:47 AM, Jayson Smith wrote:
> One thought that just came to my mind is Stephen Hawking. For many years he
> used a speech synthesizer called the CallText 5010 by Speech Plus. But unlike
> most members on this list, this device was his voice, period. This obviously
> means people who were not accustomed to speech synthesis were exposed to his
> synthetic voice. This is especially true in the 80's and 90's, when synthetic
> speech wasn't nearly as mainstream as it is now E.G. no Alexa/Google/Siri, no
> or very few synthetic voiced robocalls reminding you of doctor appointments,
> etc. His voice sounded similar to DECtalk, so the two were often confused, to
> the point that the DECtalk article on Wikipedia claims that Hawking used a form
> of DECtalk.

I suspect there is a "listening bias" involved.  To me, most (formant)
synthesizers sound like the Votrax units of the 70's.  <shrug>

> The interesting thing about this particular synthesizer is that an
> emulation of it was completed, to be used by Hawking, a few months before his
> death. This brings this classic voice into the modern world, but unfortunately
> I assume it's likely that none of us will ever get a chance to play with it.
> According to an article, the source code was turned over to the Hawking estate,
> which certainly has reason to protect that particular voice. I know one friend
> who's told me that if this voice were made available, he'd use it as his
> default screen reader voice.

Adobe is working on an audio manipulation tool that would, conceivably, allow
recorded voice snipets of his utterances to be used to reconstruct a full
"voice", capable of saying anything.

The appeal, to me, of diphone synthesizers is their ability to map a REAL
human voice into the synthesis domain.  In the future, folks like Hawking
could proactively record their own voices before losing the ability to
speak (ALS takes that ability from you, eventually).  Then, a synthesizer
could be fitted with that voice so that you hear your original voice speaking
on your behalf.  I suspect this would help preserve some bit of dignity and
humanity that might otherwise be surrendered to the encroaching automation
on which the patient would be relying.

I was disappointed at the announcement of Douglas Rain's recent passing.
It would have been delightful to capture his complete voice to be able
to synthesize arbitrary statements in it!

> As I see it, the problem with your scenario of the crippled voice in the
> earpiece is that you can't please everyone no matter how hard you try.

I don't have to "please" people.  Recall that you will RARELY hear this
voice.

What I have to do is create a voice that is understandable, regardless of the
content it is being tasked with uttering.  And, to do so "economically", with
a minimum of interaction with the user.  You don't want the user to need
fancy controls to tell the synthesizer to "repeat last word", "go back one
word", "move to next word", "spell word", "pronounce punctuation", etc.

> Ideally,
> if the big box in the closet can send synthetic speech to the earpiece once the
> user is properly authenticated, that's an audio stream, so it should be able to
> send a pre-recorded file to unauthenticated/unregistered users. This
> pre-recorded file could be spoken by a more capable synthesizer, or even by an
> actual human, who can clearly state the message. If spoken by a human, in
> theory that human would know, or could find out, how to properly pronounce
> unusual names, places, etc.

There are lots of problems with that.

First, it places a higher burden on the "server" (the big box in the closet).
Recall that every user interacting with that server may have different output
preferences -- someone might want a text display, someone else spoken word,
some else spoken word in a different language, yet another wants to drive a
brailler, etc.  So, the server has to query the user's device and then decide
which of many "messages streams" to deliver.

The cost of this interaction can then be hard to predict, ahead of time.
Its possible that someone could launch a denial of service attack on the
server by flooding it with a variety of such requests.  It also is more
likely to trigger a latent bug if the complete suite of message streams
hasn't been correctly specified.

It also means the administrator has to maintain multiple renderings of
the same message.  If he wants to change the message to "Sorry, server
temporarily off-line for maintenance.  Try again at 12:30P" he has to
create versions of this message in all forms.  Failing to do so leaves
him open to complaints like "I couldn't access the server, today.  Why
didn't you tell my brailler that it was down for maintenance?"

[Yes, you can try to automate all of this but it still increases the
complexity of the administrator's task -- for something that likely means
very little to the administrator (who may PERSONALLY not need any message
stream other than straight text!)  He's worried about getting a remote
access node back on-line and not overly concerned with whether the
brailler message has been properly updated]

It also limits the forms that the user interface can assume.  If, for
example, someone wanted to interface an Opticon to the system, the
system would now have to include messages for the POSSIBLE opticon
user(s).

And, it still doesn't address the "local interactions" between the
user and that device.  For example, setting the volume level, querying
the strength of the radio signals presently available to it, checking
battery level, verifying the device is actually "alive", etc.

What if the user had been connected to the system and things suddenly went
quiet?  Has the battery died?  Has he moved out of signal range?  Or,
has the radio link crapped out?  Has the big synthesizer crashed?  Has the
system crashed?  Or, become temporarily overloaded?  The user needs a way
of asking its intermediary -- the earpiece -- about what it knows of
the situation.

> This leaves our poor crippled synthesizer in the earpiece to deal with those
> situations where no big box can be reached at all, or a situation has arisen
> for which no canned recording is available. Once again, you can't please
> everybody. If you're designing the big boxes and writing their documentation,
> you can include best practices to insure that when administrators write error
> messages, they make them as friendly to the particular synthesizer you've
> chosen as possible. But that doesn't mean the admins have to follow your
> recommendations.

There are guidelines for designing websites for accessibility.  Yet,
no "web police" to enforce those guidelines!

I don't want to design a system that targets "folks with special needs",
exclusively.  That leads to insanely high costs which translate into
high prices.

Instead, I want a system that "able-bodied" folks could just as easily
adopt.  That would subsidize the added costs of supporting these other
user needs.

So, the individual who will likely be administering any single system
will most likely NOT have special interface needs.  If their earpiece
stops working due to any of the above scenarios, they can walk up to a
video display and query the system for the reason behind that failure.
They wouldn't be dependant on the crippled synthesizer to regain access
to the system!

People being basically lazy, this would inevitably lead to those other
message formats not being maintained.

The same holds true of deployments in businesses; if you don't have
any blind staff, you're probably not going to keep abreast of the
needs of those users -- until you're presented with a problem BY ONE
OF THEM!

[Imagine the administrator telling his boss that he needs some time,
money or other resource "to support blind users".  And, the pointy headed
boss replying, "But we don't have any blind employees!"]

> And then maybe there's the newbie admin who is rushing through
> installation and configuration and just wants to get this up and running as
> quickly as possible, skims through your docs briefly, and some poor user who
> needs to register gets the following mess, spoken by their earpiece's crippled
> synthesizer:
>
> Sorry we dont recognise your device ID. Please clal dr Johnson at 8178446611
> Tahnk you

Exactly.  There are whole classes of similar "less than optimal" messages
that would likely be encountered.

I've been contacting various internet servers via text interfaces to
see what's under the hood and hidden from view in most interactions.
How much care do folks responsible for administering those servers
place in the stuff that is typically hidden from view?  Often, the
error messages are terse or cryptic.  Sometimes, playful -- and
unhelpful!  ("Oh my!  You broke the server!!  Shame on you...")

I then imagine how those messages would be conveyed to me via this crippled
synthesizer.  Would I be capable of understanding what was said?  Would
I understand the INTENT?  Could I resolve the apparent "problem" without
contacting the server's administrator?  (Imagine how that would play out
if the problem ended up being on MY end and not the server's!)

> Now for those situations where no big box can be found at all, assuming you're
> in control of the firmware for the earpieces, you know exactly how your
> crippled synthesizer works, and can work around any quirks it has in order to
> provide the most understandable messages possible.

Messages originating on the earpiece represent a limited domain application.
I know EVERYTHING that the synthesizer might be asked to say.  And, can
tailor the messages to work-around any artifacts in the synthesizer's
implementation.  For example, if I want to ensure something is spelled out
instead of hoping the synthesizer can pronounce it correctly, I can simply
issue the message with those words broken down into their constituent
L E T T E R S.

I could go further and eliminate the synthesizer completely and just
store prerecorded messages in a compressed form.

But, that means being able to anticipate EVERYTHING that the synthesizer
might be called upon to utter.  It's not a very "future-safe" approach
to the design.