[DECtalk] Question About Formant vs Natural Speech

Sun Jun 19 09:28:26 EDT 2022

On 6/18/2022 11:25 PM, Brandon Tyson wrote:
> From what I thought I remembered reading, I feel like it was talking about
> synths like Vocalizer as compared to something like Eloquence.

Vocalizer appears to be diphone (or similar) based -- "sampled real people".
Eloquence is formant based ("vocal tract model").

A "voice" has different implementations in the different types of synthesizers.
It must embody data that allows the appropriate "speech sounds" to be made.
It must know the rules of the language in which it will be used.
It must carry a sort of "speaker identity".

Diphone synthesizers rely on an "inventory" of real, human-generated sounds
from which to "piece-together" speech utterances.  The larger the inventory,
the more likely to find more appropriate sounds for the utterance at hand.
Intonation and inflection become significant issues as getting samples
of every sound COMBINATION in every form is difficult from a human subject.

Formant synthesizers rely more on having good rules for how to make
these representations -- because those rules can be used to algorithmically
modify the characteristics of the modeled vocal tract to achieve the
intended result.

(Modifying recorded speech snippets is possible but more costly)

So, with a formant-based synthesizer you have more leeway over the speech
PROCESS -- assuming you have a good understanding of it and have expressed
that adequately in your ruleset!  :>

> I was hoping I could find a page talking more about the listener fatigue and
> detail on how a natural voice can contribute to this for some people more so
> than the formant speech.

IMO, the problem with all comparisons (that I've seen) of synthesizers and
synthesizer technologies is there is no "baseline" possible.  You are
perpetually comparing apples to aardvarks.  Like trying on one model of
a left WALKING shoe and comparing to another model of a right HIKING shoe,
knowing that your left and right feet are not identical (and the shoes are
designed for entirely different uses!)

Talking to your *bank*, on the phone, usually has you interacting with a
delightfully intelligible voice -- despite the poor quality/limited bandwidth
of the communication medium.  Why aren't all speech synthesizers so pleasant?

OTOH, how would (should?) that delightful "bank employee" speak:
    Aardvarks sloshing red popcorn under Chicago
Or, the infamous:  ghoti

I think a lot of the fatigue issue comes from over-exposure to the voice.
Think about your evening newscast (TV).  When a single talking-head is
the presenter, the stories all seem to run together... the broadcaster
just drones on and on.

OTOH, when there are a pair of presenters -- ideally a male and female -- then
there is some deliberate variety in the presentation.  There's some "novelty"
for your brain to seize on as the speaker periodically changes.

OToOH, imagine those two presenters were alternating *words*!  The novelty
becomes too excessive -- too much work for the brain to sort out what is
being said.  "People don't talk that way".

Similarly, consider sitting through a long lecture, church service, etc.
Being entirely passive lets the brain shutdown do to boredom/fatigue.

By extension, any technology that makes it easier for you to "humanize"
the output can, potentially, be more palatable to a listener.

I've written 5 synthesizers, now, and have found that none are truly "best"
unless you can clearly identify the listener and the source material.
I've played with letter-to-sound rules, prosody, the underlying
sound technology, etc.

I've had users swear that #2 was "clearly the easiest to understand"
while others would claim it was the worst, giving top honors to #4.

In some applications, there are clear winners - but, usually, because
they introduce some other important criteria that the remaining
implementations can't easily address.  (E.g., making voices for
folks who have lost their natural ability to speak)

<shrug>

> I also feel like how well the synth can perform at a high speaking rate was
> another thing I saw.

I think a lot, there, depends on the rules put in place, not the
"sound technology" being used.

I always laugh when I hear a synthesizer speak a web address... I want
to "hurry it up" as real people aren't that pedantic/deliberate in
their speech:  duh-bull-you duh-bull-you duh-bull-you dot ...  It's
almost like the synthesizer is trying on new LIPS!

(c'mon, do you know anyone who is THAT deliberate and monotonous in
their speaking?)

It's possible to excise portions of prerecorded speech to speed up
the speaking rate without significantly impacting *pitch* (obviously,
you can make time run faster to speed things up but you then end up
with Mickey Mouse speech).  I'm not sure how extensively this is used
in diphone synthesis, though, as it requires "editing" the samples
from the unit database in real time.

[Time is another aspect of synthesis that has to be considered.
How do you handle the case where the synthesizer doesn't, yet,
have the full utterance at its disposal in order to perform
a complete analysis of intent, impose inflection, etc.  E.g.,
you can infer a question from the initial "W" word -- Who,
What, Why, etc.  But, not always:  "Bob went home?" vs. "Bob
went home!"  This isn't as much of a problem for a screen reader
where the entire text is LIKELY available to the synthesizer.
But, not all speech apps are screen readers!  :> ]

How much overhead do you want to "lose" to speech synthesis?
(overhead translates to cost)