[DECtalk] Criteria, second attempt (long)

Tue Jul 23 09:33:36 EDT 2019

Hi,

Very interesting thoughts and questions here.

I guess my main preference would be for the sweet spot of both getting 
things right and pronouncing as much as possible correctly. Back in the 
80's we had an Apple IIe then an Apple IIgs computer, using the Echo II 
speech synthesizer. If you're not familiar with the Echo, it can be 
emulated in MAME today. It had sixty-three speech pitches, and within 
each of these pitch settings, there were three pitches it used for all 
text. In other words, it sounded very robotic. Not exactly monotone, but 
not very pleasant either. What's more, it got a lot of pronunciations 
wrong, but we really didn't care because it was all we had. When better 
came along, we were certainly ready to get away from Echo.

I guess what I'm saying is, I want it to have a natural inflection. I 
don't think I could read a novel with Echo or the old Braille 'N Speak 
type synthesizers and enjoy it very much, but then, being early systems, 
they also fail miserably on many pronunciations as well. Like I've said 
before, DECtalk and Eloquence are still my favorite synths because they 
have a much more natural inflection and do a pretty good job of getting 
most words right. Nothing is going to get absolutely everything correct 
because the English language is too nuanced, and too much depends on 
contextual clues which only a human can find.

As for a particular word or set of words being mispronounced in a 
specific way, I'd probably add them to an exception dictionary. If an 
exception dictionary is not an option, what would probably happen is 
that I'd get confused by them for a while, then after some time I'd get 
used to it and start doing the mental translation without thinking about it.

As for words that aren't really words, "npwwtr" isn't a very good 
example because it has no vowels, and I've never seen a synthesizer do 
anything other than spelling out a word like that. What I'd tend to say 
is "Sardoodledom," "Prosopopoeia," and "Humuhumunukunukuapua'a" are real 
words, but "Fasljoijoiqn," "Vlivmoiroirw," and "Voamfweemfwoomfj" are 
not. However, most synthesizers will treat all six as real words. I 
don't think I'd like a synthesizer that spelled any word it wasn't 
familiar with in some regard, simply because you're always going to run 
into strange words from time to time.

Hope this helps,

Jayson

On 7/21/2019 4:06 PM, Don wrote:
> Ugh. Sorry, I guess what I've been thinking about when it comes to
> synthesizer evaluation criteria is a bit different.  Let me try to
> approach the synthesis problem in finer details to explain the areas
> where the results can be somewhat less than we'd hope.
>
> TL;DR... what follows is a rough breakdown of the process of going
> from text to speech and the challenges that have to be addressed
> in each step.  Each challenge represents an opportunity for a
> particular synthesizer to screw up.  What I want to get a feel for
> is how significant those screwups can be for the listener.
>
> For example, I have a (human) friend who pronounces A R K A N S A S
> as "are-kansas".  Every time she utters the word, my brain literally
> slips out of gear as I have to sort out what she's actually saying.
> Then, I have to play "catch up" to process what shes said while my
> brain was sorting out this blip.  Think about each of the following
> issues and how your synthesizer might screw up in each of these
> areas and what that does to your comprehension and ease of use.
>
> If we assume the synthesizer is challenged by "input" that hasn't been
> created with a speech synthesizer in mind, the first problem lies in
> the synthesizer "understanding " what the text is meant to say.
>
> Written text contains lots of abbreviations that the synthesizer must
> recognize and then determine how to pronounce.  Some will be expanded
> in full -- "etc." becoming "etcetera".  Others may be rendered into a
> completely different form -- "et al." becoming "and associates",
> "e.g." becoming "for example", etc.
>
> Some abbreviations may omit the period(s) that we were taught were
> an essential part of an "abbreviation" (those of us old-timers) like
> ASAP, wrt, aka, am/pm, etc.
>
> And, some abbreviations have effectively become words in their own
> right -- like "TV".
>
> Others may be ambiguous:  is "Dr." to mean "doctor" or "drive"? Does
> its interpretation change if the period is missing like "Dr"?  Or, if
> it is capitalized like "DR"?  Or, in lowercase like "dr"?  I alluded
> to this in my example about doctor jones on jones drive...
>
> This introduces the problem of heterophones where parts of speech
> analysis is required to assign a pronunciation to a word:
>   I read every day; I read two books yesterday.
>   My project is to project our income for the next two years.
>   I am going to the store at 9am.
>
> Are contractions treated as "special words" with their own
> pronunciation rules?
>
> Numbers can be rendered in a variety of ways.  We likely expect
> "2019" to be read as "twenty nineteen" and not "two thousand and
> nineteen".  Unless, of course, context suggests that it isn't! <grin>
>
> On the other hand, "2267" is likely "two thousand two hundred
> and sixty seven".  When do you switch from the systematic approach
> we were taught as kids for pronouncing large numbers where "1234567"
> is "one million, two hundred thirty four thousand, five hundred and
> sixty seven" to simply rattling off individual digits like "one two
> three four five six seven"?
>
> How is punctuation to be treated?  Do these marks just help
> establish "breath groups"?  Do they need to be read to the user
> lest he not realize the presence of a mark that may be significant?
> (For those using screen readers, how annoying is it to hear all
> of the punctuation that I've been using be spoken to you?)
>
> What about "words" that really aren't words?  When should a
> synthesizer abandon its notion of legitimate words in favor of
> resorting to a fallback position?  "Syzygy" is a real word
> but "npwwtr" is not; the first should be spoken while the last
> should be spelled.
>
> Foreign language words?  Proper names?  These tend to follow
> different pronunciation rules than "most" words -- yet English is
> littered with them.  How does one get "wooster" fro "worcester"?
> Or "bill-rick-uh" from "billerica"?
>
> I've thus far ignored special domains that have their own
> specialized terminology -- like medicine, music, etc.  Years
> ago, you'd never expect to encounter something like
> "192.168.1.10" in normal text.  And, the presence of a colon
> between digits would almost be a guaranteed indicator of a time
> of day or duration.  Yet, now you'll see "12:34:56:78:9A:BC"!
> Cases where a special domain's terminology has bled into normal
> usage.
>
> And, how are misspellings to be handled?  I'm reasonably sure
> I've typed "teh" somewhere along the way when I meant to type
> "the".
>
> What about homophones like "their", "they're" and "there"?
> While they are pronounced the same, they are likely to confuse
> an algorithm that is trying to understand sentence structure
> to help resolve other pronunciations or apply prosody.
>
> Once the synthesizer knows what it wants to say, it has to sort
> out how to actually pronounce it all.  Is "the" pronounced "thuh"
> or "thee"?  Is "mash" pronounced "maysh" or "mash"?
>
> Then, impose breath groups so its not one long, continuous sequence
> of utterances.
>
> And, prosody to avoid a monotone.
>
> FINALLY, to apply some tonal structure to the utterances -- a "voice"
> to suit the hearing preferences of the listener.
>
> A synthesizer can economize on its implementation by taking shortcuts
> in many of these places.  At an extreme, a synthesizer can just "spell"
> everything that is presented to it and let the listener reassemble the
> text in their meatware.  This would be insanely unusable but entirely
> accurate and unambiguous!  One could expend a great deal of effort
> making sure that the synthesizer says "see" very crisply when it
> encounters a 'C' in the input stream!
>
> <frown>
>
> Or, skip prosody for fear of applying it incorrectly.
>
> Or...  <whatever>
>
> Which areas are most -- or least! -- important when it comes to
> speech synthesis?  If a synthesizer can accurately normalize all
> text presented to it (abbreviations, parts of speech resolution,
> heterophone resolution, etc.) but speaks in a monotone, is that
> better -- or worse -- than a synthesizer with a natural sounding
> voice that "says the wrong words"?
> _______________________________________________
> Dectalk mailing list
> Dectalk at bluegrasspals.com
> http://bluegrasspals.com/mailman/listinfo/dectalk
>
>