[DECtalk] Criteria, second attempt (long)
Jayson Smith
jaybird at bluegrasspals.com
Tue Jul 23 09:33:36 EDT 2019
Hi,
Very interesting thoughts and questions here.
I guess my main preference would be for the sweet spot of both getting
things right and pronouncing as much as possible correctly. Back in the
80's we had an Apple IIe then an Apple IIgs computer, using the Echo II
speech synthesizer. If you're not familiar with the Echo, it can be
emulated in MAME today. It had sixty-three speech pitches, and within
each of these pitch settings, there were three pitches it used for all
text. In other words, it sounded very robotic. Not exactly monotone, but
not very pleasant either. What's more, it got a lot of pronunciations
wrong, but we really didn't care because it was all we had. When better
came along, we were certainly ready to get away from Echo.
I guess what I'm saying is, I want it to have a natural inflection. I
don't think I could read a novel with Echo or the old Braille 'N Speak
type synthesizers and enjoy it very much, but then, being early systems,
they also fail miserably on many pronunciations as well. Like I've said
before, DECtalk and Eloquence are still my favorite synths because they
have a much more natural inflection and do a pretty good job of getting
most words right. Nothing is going to get absolutely everything correct
because the English language is too nuanced, and too much depends on
contextual clues which only a human can find.
As for a particular word or set of words being mispronounced in a
specific way, I'd probably add them to an exception dictionary. If an
exception dictionary is not an option, what would probably happen is
that I'd get confused by them for a while, then after some time I'd get
used to it and start doing the mental translation without thinking about it.
As for words that aren't really words, "npwwtr" isn't a very good
example because it has no vowels, and I've never seen a synthesizer do
anything other than spelling out a word like that. What I'd tend to say
is "Sardoodledom," "Prosopopoeia," and "Humuhumunukunukuapua'a" are real
words, but "Fasljoijoiqn," "Vlivmoiroirw," and "Voamfweemfwoomfj" are
not. However, most synthesizers will treat all six as real words. I
don't think I'd like a synthesizer that spelled any word it wasn't
familiar with in some regard, simply because you're always going to run
into strange words from time to time.
Hope this helps,
Jayson
On 7/21/2019 4:06 PM, Don wrote:
> Ugh. Sorry, I guess what I've been thinking about when it comes to
> synthesizer evaluation criteria is a bit different. Let me try to
> approach the synthesis problem in finer details to explain the areas
> where the results can be somewhat less than we'd hope.
>
> TL;DR... what follows is a rough breakdown of the process of going
> from text to speech and the challenges that have to be addressed
> in each step. Each challenge represents an opportunity for a
> particular synthesizer to screw up. What I want to get a feel for
> is how significant those screwups can be for the listener.
>
> For example, I have a (human) friend who pronounces A R K A N S A S
> as "are-kansas". Every time she utters the word, my brain literally
> slips out of gear as I have to sort out what she's actually saying.
> Then, I have to play "catch up" to process what shes said while my
> brain was sorting out this blip. Think about each of the following
> issues and how your synthesizer might screw up in each of these
> areas and what that does to your comprehension and ease of use.
>
> If we assume the synthesizer is challenged by "input" that hasn't been
> created with a speech synthesizer in mind, the first problem lies in
> the synthesizer "understanding " what the text is meant to say.
>
> Written text contains lots of abbreviations that the synthesizer must
> recognize and then determine how to pronounce. Some will be expanded
> in full -- "etc." becoming "etcetera". Others may be rendered into a
> completely different form -- "et al." becoming "and associates",
> "e.g." becoming "for example", etc.
>
> Some abbreviations may omit the period(s) that we were taught were
> an essential part of an "abbreviation" (those of us old-timers) like
> ASAP, wrt, aka, am/pm, etc.
>
> And, some abbreviations have effectively become words in their own
> right -- like "TV".
>
> Others may be ambiguous: is "Dr." to mean "doctor" or "drive"? Does
> its interpretation change if the period is missing like "Dr"? Or, if
> it is capitalized like "DR"? Or, in lowercase like "dr"? I alluded
> to this in my example about doctor jones on jones drive...
>
> This introduces the problem of heterophones where parts of speech
> analysis is required to assign a pronunciation to a word:
> I read every day; I read two books yesterday.
> My project is to project our income for the next two years.
> I am going to the store at 9am.
>
> Are contractions treated as "special words" with their own
> pronunciation rules?
>
> Numbers can be rendered in a variety of ways. We likely expect
> "2019" to be read as "twenty nineteen" and not "two thousand and
> nineteen". Unless, of course, context suggests that it isn't! <grin>
>
> On the other hand, "2267" is likely "two thousand two hundred
> and sixty seven". When do you switch from the systematic approach
> we were taught as kids for pronouncing large numbers where "1234567"
> is "one million, two hundred thirty four thousand, five hundred and
> sixty seven" to simply rattling off individual digits like "one two
> three four five six seven"?
>
> How is punctuation to be treated? Do these marks just help
> establish "breath groups"? Do they need to be read to the user
> lest he not realize the presence of a mark that may be significant?
> (For those using screen readers, how annoying is it to hear all
> of the punctuation that I've been using be spoken to you?)
>
> What about "words" that really aren't words? When should a
> synthesizer abandon its notion of legitimate words in favor of
> resorting to a fallback position? "Syzygy" is a real word
> but "npwwtr" is not; the first should be spoken while the last
> should be spelled.
>
> Foreign language words? Proper names? These tend to follow
> different pronunciation rules than "most" words -- yet English is
> littered with them. How does one get "wooster" fro "worcester"?
> Or "bill-rick-uh" from "billerica"?
>
> I've thus far ignored special domains that have their own
> specialized terminology -- like medicine, music, etc. Years
> ago, you'd never expect to encounter something like
> "192.168.1.10" in normal text. And, the presence of a colon
> between digits would almost be a guaranteed indicator of a time
> of day or duration. Yet, now you'll see "12:34:56:78:9A:BC"!
> Cases where a special domain's terminology has bled into normal
> usage.
>
> And, how are misspellings to be handled? I'm reasonably sure
> I've typed "teh" somewhere along the way when I meant to type
> "the".
>
> What about homophones like "their", "they're" and "there"?
> While they are pronounced the same, they are likely to confuse
> an algorithm that is trying to understand sentence structure
> to help resolve other pronunciations or apply prosody.
>
> Once the synthesizer knows what it wants to say, it has to sort
> out how to actually pronounce it all. Is "the" pronounced "thuh"
> or "thee"? Is "mash" pronounced "maysh" or "mash"?
>
> Then, impose breath groups so its not one long, continuous sequence
> of utterances.
>
> And, prosody to avoid a monotone.
>
> FINALLY, to apply some tonal structure to the utterances -- a "voice"
> to suit the hearing preferences of the listener.
>
> A synthesizer can economize on its implementation by taking shortcuts
> in many of these places. At an extreme, a synthesizer can just "spell"
> everything that is presented to it and let the listener reassemble the
> text in their meatware. This would be insanely unusable but entirely
> accurate and unambiguous! One could expend a great deal of effort
> making sure that the synthesizer says "see" very crisply when it
> encounters a 'C' in the input stream!
>
> <frown>
>
> Or, skip prosody for fear of applying it incorrectly.
>
> Or... <whatever>
>
> Which areas are most -- or least! -- important when it comes to
> speech synthesis? If a synthesizer can accurately normalize all
> text presented to it (abbreviations, parts of speech resolution,
> heterophone resolution, etc.) but speaks in a monotone, is that
> better -- or worse -- than a synthesizer with a natural sounding
> voice that "says the wrong words"?
> _______________________________________________
> Dectalk mailing list
> Dectalk at bluegrasspals.com
> http://bluegrasspals.com/mailman/listinfo/dectalk
>
>
More information about the Dectalk
mailing list