[DECtalk] DECtalk TTS licensing

Don Text_to_Speech at GMX.com
Mon Aug 30 10:18:27 EDT 2021


On 8/30/2021 6:40 AM, Devin Prater wrote:
> Man, this should be spread far and wide throughout TTS circles. Screen
> readers should do this too. But we've not even gotten past simple API
> readers almost.

That's because synthesizers are treated as *bags* that are bolted onto
(usually preexisting!) applications that weren't designed with speech
in mind.  So, *all* of the "conversion" from the initial medium to
speech has to be done *in* the synthesizer; it can't "ask" the application
what the application is trying to do.

If you were reading a novel (or, a news story on the web) and you
encountered some double quotes, you would recognize this as a
literal quotation of someone's statement/speaking.  For example:

    The sheriff said "We believe we have apprehended the sole gunman
    in this heinous crime".  He later added that the suspect was due to
    be booked into the local jail, later that day.

Why can't the synthesizer switch from "narrator voice" to "speaker's
voice" when it encounters the first double quote?  And, then switch
back when it encounters the closing double quote?  Wouldn't this
make it easier to understand that you are now *in* a quoted statement?

Ah, but what if someone FORGOT a closing quote?  Does the synthesizer
stay in that voice, forever?

What if there was never intended to be a closing quote?  Your
password is 4HJ"/*Fred

Not only can choice of voice convey information, but it also gives
your ears a break.  Ever listen to a newscast with ONE reporter
reading all of the stories?  Contrast that with newscasts where two or
more reporters/anchors alternate -- especially a male and a female.

> Well we now have OCR and image recognition in some of the
> screen readers, but TTS is still a sort of binary thing with "is speaking"
> "is not speaking" at a user-defined rate and pitch and volume and sometimes
> intonation. I'd love to see more screen readers become more like Emacspeak.

EMACSpeak tries to integrate the "presentation" (speech) with the application.
But, has to be "taught" about every potential application.  Because
applications don't emit information that informs as to its (current!) intent.

It is fairly common to design error and informational messages
in the form:
    <number> <explanation>
Like:
    404 page not found
A computer (program!) that sees this message gets everything that it
needs to know in those first 4 characters -- the text that follows is
just there for humans (the computer ignores it!).

So, folks *can* make interactions that convey additional information
for a particular type of consumer (computer vs. human).  But, there
is little pressure to do so.

How often do you see information conveyed by color?  Choice of "font"?
Bold?  Italic?

Will the synthesizer recognize that something presented in BOLD should
be treated differently than something in italic?  Or, do you throw
away that information -- in which case, why was it present in the
first place???

Why have red for stop and green for go -- if 15% of the population
can't distinguish between them?


More information about the Dectalk mailing list