[DECtalk] Criteria, second attempt (long)

Don Text_to_Speech at GMX.com
Sun Jul 21 16:06:51 EDT 2019


Ugh.  Sorry, I guess what I've been thinking about when it comes to
synthesizer evaluation criteria is a bit different.  Let me try to
approach the synthesis problem in finer details to explain the areas
where the results can be somewhat less than we'd hope.

TL;DR... what follows is a rough breakdown of the process of going
from text to speech and the challenges that have to be addressed
in each step.  Each challenge represents an opportunity for a
particular synthesizer to screw up.  What I want to get a feel for
is how significant those screwups can be for the listener.

For example, I have a (human) friend who pronounces A R K A N S A S
as "are-kansas".  Every time she utters the word, my brain literally
slips out of gear as I have to sort out what she's actually saying.
Then, I have to play "catch up" to process what shes said while my
brain was sorting out this blip.  Think about each of the following
issues and how your synthesizer might screw up in each of these
areas and what that does to your comprehension and ease of use.

If we assume the synthesizer is challenged by "input" that hasn't been
created with a speech synthesizer in mind, the first problem lies in
the synthesizer "understanding " what the text is meant to say.

Written text contains lots of abbreviations that the synthesizer must
recognize and then determine how to pronounce.  Some will be expanded
in full -- "etc." becoming "etcetera".  Others may be rendered into a
completely different form -- "et al." becoming "and associates",
"e.g." becoming "for example", etc.

Some abbreviations may omit the period(s) that we were taught were
an essential part of an "abbreviation" (those of us old-timers) like
ASAP, wrt, aka, am/pm, etc.

And, some abbreviations have effectively become words in their own
right -- like "TV".

Others may be ambiguous:  is "Dr." to mean "doctor" or "drive"?  Does
its interpretation change if the period is missing like "Dr"?  Or, if
it is capitalized like "DR"?  Or, in lowercase like "dr"?  I alluded
to this in my example about doctor jones on jones drive...

This introduces the problem of heterophones where parts of speech
analysis is required to assign a pronunciation to a word:
   I read every day; I read two books yesterday.
   My project is to project our income for the next two years.
   I am going to the store at 9am.

Are contractions treated as "special words" with their own
pronunciation rules?

Numbers can be rendered in a variety of ways.  We likely expect
"2019" to be read as "twenty nineteen" and not "two thousand and
nineteen".  Unless, of course, context suggests that it isn't!  <grin>

On the other hand, "2267" is likely "two thousand two hundred
and sixty seven".  When do you switch from the systematic approach
we were taught as kids for pronouncing large numbers where "1234567"
is "one million, two hundred thirty four thousand, five hundred and
sixty seven" to simply rattling off individual digits like "one two
three four five six seven"?

How is punctuation to be treated?  Do these marks just help
establish "breath groups"?  Do they need to be read to the user
lest he not realize the presence of a mark that may be significant?
(For those using screen readers, how annoying is it to hear all
of the punctuation that I've been using be spoken to you?)

What about "words" that really aren't words?  When should a
synthesizer abandon its notion of legitimate words in favor of
resorting to a fallback position?  "Syzygy" is a real word
but "npwwtr" is not; the first should be spoken while the last
should be spelled.

Foreign language words?  Proper names?  These tend to follow
different pronunciation rules than "most" words -- yet English is
littered with them.  How does one get "wooster" fro "worcester"?
Or "bill-rick-uh" from "billerica"?

I've thus far ignored special domains that have their own
specialized terminology -- like medicine, music, etc.  Years
ago, you'd never expect to encounter something like
"192.168.1.10" in normal text.  And, the presence of a colon
between digits would almost be a guaranteed indicator of a time
of day or duration.  Yet, now you'll see "12:34:56:78:9A:BC"!
Cases where a special domain's terminology has bled into normal
usage.

And, how are misspellings to be handled?  I'm reasonably sure
I've typed "teh" somewhere along the way when I meant to type
"the".

What about homophones like "their", "they're" and "there"?
While they are pronounced the same, they are likely to confuse
an algorithm that is trying to understand sentence structure
to help resolve other pronunciations or apply prosody.

Once the synthesizer knows what it wants to say, it has to sort
out how to actually pronounce it all.  Is "the" pronounced "thuh"
or "thee"?  Is "mash" pronounced "maysh" or "mash"?

Then, impose breath groups so its not one long, continuous sequence
of utterances.

And, prosody to avoid a monotone.

FINALLY, to apply some tonal structure to the utterances -- a "voice"
to suit the hearing preferences of the listener.

A synthesizer can economize on its implementation by taking shortcuts
in many of these places.  At an extreme, a synthesizer can just "spell"
everything that is presented to it and let the listener reassemble the
text in their meatware.  This would be insanely unusable but entirely
accurate and unambiguous!  One could expend a great deal of effort
making sure that the synthesizer says "see" very crisply when it
encounters a 'C' in the input stream!

<frown>

Or, skip prosody for fear of applying it incorrectly.

Or...  <whatever>

Which areas are most -- or least! -- important when it comes to
speech synthesis?  If a synthesizer can accurately normalize all
text presented to it (abbreviations, parts of speech resolution,
heterophone resolution, etc.) but speaks in a monotone, is that
better -- or worse -- than a synthesizer with a natural sounding
voice that "says the wrong words"?


More information about the Dectalk mailing list