[DECtalk] Criteria, second attempt (long)

Don Text_to_Speech at GMX.com
Tue Jul 23 12:59:35 EDT 2019


On 7/23/2019 6:33 AM, Jayson Smith wrote:
> Very interesting thoughts and questions here.
>
> I guess my main preference would be for the sweet spot of both getting things
> right and pronouncing as much as possible correctly.

<grin>  That's sort of a non-answer.   "How do you define the sweet spot
so that I know THIS is it, and not THAT?"

I know!  We'll codify some criteria and use those to rank a variety of
different implementations so we know which is BEST!  <grin>

Are you getting a better feel for the situation I'm facing?  <frown>

> Back in the 80's we had an
> Apple IIe then an Apple IIgs computer, using the Echo II speech synthesizer. If
> you're not familiar with the Echo, it can be emulated in MAME today. It had
> sixty-three speech pitches, and within each of these pitch settings, there were
> three pitches it used for all text. In other words, it sounded very robotic.
> Not exactly monotone, but not very pleasant either. What's more, it got a lot
> of pronunciations wrong, but we really didn't care because it was all we had.
> When better came along, we were certainly ready to get away from Echo.

Yes, I worked with Votrax synthesizers in the mid-late 1970's.  The voice
was HORRIBLE!  But, it was a VOICE -- instead of printed text!  <grin>

> I guess what I'm saying is, I want it to have a natural inflection. I don't
> think I could read a novel with Echo or the old Braille 'N Speak type
> synthesizers and enjoy it very much, but then, being early systems, they also
> fail miserably on many pronunciations as well.

Yes.  It's really hard to come up with a set of "rules" that let you
convert sequences of letters into sounds, reliably.  There are just
too many exceptions to deal with.  Proper names are a total crapshoot.

And, that's BEFORE dealing with things like heterophones or adding
inflection, prosody, etc.

This is why its so much easier to just build a huge pronunciation dictionary
and "look things up" instead of trying to deduce proper pronunciation
from spelling.

Early synthesizers had limited resources so had to resort to rules
instead of dictionaries.  One of the first popular rule sets was
published by Elovitz, et al.  It was very terse.  It was also very
poor at addressing the problem!

But, even then, it was hard to come up with a good criteria to
evaluate the quality of the candidate rule sets.  One approach
might be to just throw every word at the rules and compare the
pronunciation (sequence of phonemes) that it computes against
the pronunciation of some established dictionary.  The score
could be the percentage of words pronounced correctly!

But, do you really care if the synthesizer pronounces syzygy
properly?  What if the rules that were added to make that
pronunciation correct come at the expense of screwing up
the pronunciation of "Hello"?  Should words have weights assigned
to them?  Perhaps based on how frequently the word is encountered
in "normal" speech?  But, what if the expected input to the
synthesizer doesn't represent "normal" speech?  A synthesizer
tasked with speaking the contents of a novel would place
different emphasis than one that speaks the contents of emails!

> Like I've said before, DECtalk
> and Eloquence are still my favorite synths because they have a much more
> natural inflection and do a pretty good job of getting most words right.
> Nothing is going to get absolutely everything correct because the English
> language is too nuanced, and too much depends on contextual clues which only a
> human can find.

Playing devil's advocate:  this suggests the safest bet is to simply
speak the individual graphemes (letters, symbols) to the user and let
him assemble the words in his own mind.  This avoids all of the
mistakes that a synthesizer is likely to make!  But, also makes it
incredibly tedious to use!!

> As for a particular word or set of words being mispronounced in a specific way,
> I'd probably add them to an exception dictionary. If an exception dictionary is
> not an option, what would probably happen is that I'd get confused by them for
> a while, then after some time I'd get used to it and start doing the mental
> translation without thinking about it.

When you have no other alternative, you can become accustomed to all
sorts of blemishes.

But, this assumes a continued exposure to those blemishes in order
to "learn" them.  In the case of the "crippled" synthesizer that
I mentioned in another post, the user will typically only rarely
encounter it.  So, isn't likely to remember -- or predict -- how
it will mispronounced something.

This will happen at a time when the user MOST needs clarity from
the synthesizer; things aren't working as expected and he doesn't
want to have to guess at what the synthesizer is trying to tell him.

> As for words that aren't really words, "npwwtr" isn't a very good example
> because it has no vowels, and I've never seen a synthesizer do anything other
> than spelling out a word like that. What I'd tend to say is "Sardoodledom,"
> "Prosopopoeia," and "Humuhumunukunukuapua'a" are real words, but
> "Fasljoijoiqn," "Vlivmoiroirw," and "Voamfweemfwoomfj" are not. However, most
> synthesizers will treat all six as real words. I don't think I'd like a
> synthesizer that spelled any word it wasn't familiar with in some regard,
> simply because you're always going to run into strange words from time to time.

Yes.  On the other hand, having to deal with unexpected pronunciations
can be just as distracting.  I'm SURE that I've typed "teh" at least once
in my emails.  Without seeing the misspelling, it becomes a guessing game
where you rely on context to try to deduce what was intended.

The KRM had dedicated buttons to allow the parse of the text to
be focused on individual words or phrases.  You could repeat a
word countless times -- tweeking the pitch and rate knobs (the only
real "voice" controls available) until you were confident that you
knew what was being said.  Or, direct it to spell the word for you
as you knew how it pronounced each letter of the alphabet!

But, this requires a lot of interaction with the synthesizer.
Lots of "controls".  Controls that you may not have used in a
very long time, in the case of the crippled synthesizer that I
mentioned!

Now, imagine you're at an advanced age and you're not as "quick-witted"
as you used to be.  Or, perhaps have a hearing impairment.  When do
the failures of the synthesizer become unsurmountable?



More information about the Dectalk mailing list