[DECtalk] DECtalk TTS licensing (long)

Wed Sep 1 04:05:53 EDT 2021

Lengthy reply, disregard if not interested.  OTOH, it tells you how
developers think about application domain problems!

On 8/31/2021 6:14 PM, Chime Hart wrote:
> Actually Don, I much prefer hearing numbers as single digits, as I did while
> running Vocal-Eyes for many years. However, my Linux screen-reader, Speakup has
> no acception dictionary for probably the reasons you stated. As far as dates,
> sure I prefer the sound of 0 8 3 1 2 0 2 1  but British style completely
> confuses me.

If I recognize a date (due to context of format), I will convert it
into a spoken date format.  So, 08/31/2021 would speak as August
31, 2021 -- and the year would be spoken as "twenty twenty-one".
I know which format is used because of the locale configured in
the synthesizer; so, the same spoken result for 31-08-21 if I
was configured to be a brit.

I'd likewise speak 3.01 as three point oh one in the U S but
3,01 would speak the same, elsewhere.  Note the use of "oh"
instead of "zero".

A number, in the US, with interspersed commas would force itself
to be spoken as millions, thousands, etc. -- but, only if the commas
were in the correct places!   123,4,5678 wouldn't qualify!
A period takes the place of commas, elsewhere.

Numbers larger than 999 million get spoken as groups of digits,
right justified -- so the last group always has 3 digits in it.

These varied rules are intended to allow the listener to gauge
the magnitude of a number with relative ease.  I.e., if you hear
"million", you know you're dealing with a total of 7 to 9 digits;
"thousand" is 4 to 6.  If you hear groups of digits, you can
mentally count the number of groups and largely ignore all but
the significant digits in the first group.

Numbers that look like they may be years -- because they
are 4 digits in the range of 1 or 2 thousand and do NOT
contain a comma (or "thousands separator" to accommodate other
locales) -- are spoken with "hundred" to separate the left and
right pairs.

All of these algorithms mimic the way sighted readers process
printed text -- a long string of digits with commas is parsed
as "thousands, millions, billions, trillions, too-big-to-care".
The actual digits are typically not of interest -- unless you
are reading a spreadsheet!  We don't conversationally speak of
years as "two thousand twenty one" or "one thousand, nine
hundred, eighty four" but, rather as "twenty twenty one" -- note
the absence of "hundred" -- or nineteen eighty four.

In cities, an address of 1034 would be referenced as "ten hundred
thirty four".  And 11934 would be "one hundred and nineteen
hundred thirty four" -- because folks think in terms of *blocks*.

In an I P address, one would speak each period as "dot" because
"one hundred ninety two <end-sentence pause> one hundred sixty
eight <end-sentence pause> one <pause> one hundred and one" would
be cumbersome.  Instead, "one ninety two (notice the hundred is
missing!) dot one sixty eight dot one dot one oh one" is more
natural and carries more information -- you mentally expect there
to be three "dots".

A number prefaced by a dollar sign is treated as a monetary amount
and, assuming a whole number or decimal (with at most one decimal
point) follows, will have "dollars" appended -- even though the
punctuation mark -- the dollar sign -- preceded the digits.

So, if you hear 25, followed by a group of three digits, you mentally
think 25 thousand.  After the next group of three digits -- regardless
of what they are -- you think 25 million.  Then billion.  Then trillion.
After that, you likely aren't interested in the actual magnitude as
it's a "really big number and the first two digits are 2 and 5".

If you want to review the actual digits, then you put the synthesizer
in "spell" mode where it announces each character, one at a time.
(So, you need to be able to tell it how long of a pause to insert
between characters)

Spelling takes place in my synthesizer, not in the application
driving it.  And, reviewing a word-at-a-time speaks each as they
were spoken *in* the sentence.  To change the pronunciation or
prosody would be disingenuous and hinder "training" the user's ear.
This is not a trivial task as what is spoken doesn't always
directly correspond to characters input -- think back on the
number discussions, above.

(If there is a better way of pronouncing it, then why didn't you
pronounce it that way when encountered in the sentence?)

A common exercise, for software developers, is to imagine reading
your code to another programmer over the phone.  You wouldn't want
to have to speak every punctuation mark -- the listener, familiar
with the language's syntax, should be able to fill them in automatically.
But, what you say should sound natural -- for reading a program's
statements!

Likewise, how I would read the contents of this email to you, over
the phone, should mimic what you hear from a synthesizer.  You
wouldn't require me to speak every punctuation mark or announce
each new paragraph!  It mimics how another person would TALK to you.

Speech is considerably slower than other means of "consuming input".
Most adults read at about 200-250 words per minute.  If you are
conversationally speaking, this is slightly faster than the rate at
which you'd deliver information to your listeners -- if you were
interested in conveying information.  This is consistent with
surveyed use of commercial synthesizers where comprehension is
of interest (you could likely read a romance novel at 500 words
per minute and not miss much!).

An auctioneer typically speaks at almost double this rate.  But,
comprehension can remain high because he's just blathering dollar
amounts -- each building on previous amounts.  Had he been listing
the names of people who would be participating in the next auction,
you'd be hard-pressed to notice anything other than whether or not
*your* name was mentioned.

I can read about 500-600 words per minute.  But, only for "light"
material.  That's about where DECtalk maxes out -- and intelligibility
at those rates is dubious.  And, it takes a fair bit of practice to
be able to reliably consume speech at that rate -- and not have to
stop the synthesizer to review some misspelled word, etc.  You learn
to ignore words that aren't immediately intelligible -- unless they
appear in a critical passage.

When I'm trying to absorb new material -- and the many details
presented in it -- then my reading rate drops by an order of
magnitude.

I only use my formant-based synthesizer (DECtalk-alike) as a
fallback.  E.g., when my device can't connect to the wireless
server and make use of *its* synthesizer.  This because I've
found that people much prefer listenability to "size of computer
program".  Folks that I surveyed preferred Cepstral voices to
DECtalk, by a wide margin, in every test I threw at them.  And,
once you leave the constraints of a small, portable, battery
powered earpiece, processing power and storage are no longer
issues!  So, its obvious that you'd opt for the preferred
presentation medium.

There's a reason Alexa and Siri don't sound like Perfect Paul!  <grin>