[DECtalk] Question About Formant vs Natural Speech

joshknnd1982 at gmail.com joshknnd1982 at gmail.com
Sun Jun 19 09:52:51 EDT 2022


Actually, vocalizer is unit-selection speech and I think the new neural-tts
voices like alexa may be hmm-based speech, hmm or h-m-m means hidden Markov
model.


-----Original Message-----
From: Dectalk <dectalk-bounces at bluegrasspals.com> On Behalf Of Don
Sent: Sunday, June 19, 2022 9:28
To: dectalk at bluegrasspals.com
Subject: Re: [DECtalk] Question About Formant vs Natural Speech

On 6/18/2022 11:25 PM, Brandon Tyson wrote:
> From what I thought I remembered reading, I feel like it was talking 
> about synths like Vocalizer as compared to something like Eloquence.

Vocalizer appears to be diphone (or similar) based -- "sampled real people".
Eloquence is formant based ("vocal tract model").

A "voice" has different implementations in the different types of
synthesizers.
It must embody data that allows the appropriate "speech sounds" to be made.
It must know the rules of the language in which it will be used.
It must carry a sort of "speaker identity".

Diphone synthesizers rely on an "inventory" of real, human-generated sounds
from which to "piece-together" speech utterances.  The larger the inventory,
the more likely to find more appropriate sounds for the utterance at hand.
Intonation and inflection become significant issues as getting samples of
every sound COMBINATION in every form is difficult from a human subject.

Formant synthesizers rely more on having good rules for how to make these
representations -- because those rules can be used to algorithmically modify
the characteristics of the modeled vocal tract to achieve the intended
result.

(Modifying recorded speech snippets is possible but more costly)

So, with a formant-based synthesizer you have more leeway over the speech
PROCESS -- assuming you have a good understanding of it and have expressed
that adequately in your ruleset!  :>

> I was hoping I could find a page talking more about the listener 
> fatigue and detail on how a natural voice can contribute to this for 
> some people more so than the formant speech.

IMO, the problem with all comparisons (that I've seen) of synthesizers and
synthesizer technologies is there is no "baseline" possible.  You are
perpetually comparing apples to aardvarks.  Like trying on one model of a
left WALKING shoe and comparing to another model of a right HIKING shoe,
knowing that your left and right feet are not identical (and the shoes are
designed for entirely different uses!)

Talking to your *bank*, on the phone, usually has you interacting with a
delightfully intelligible voice -- despite the poor quality/limited
bandwidth of the communication medium.  Why aren't all speech synthesizers
so pleasant?

OTOH, how would (should?) that delightful "bank employee" speak:
    Aardvarks sloshing red popcorn under Chicago Or, the infamous:  ghoti

I think a lot of the fatigue issue comes from over-exposure to the voice.
Think about your evening newscast (TV).  When a single talking-head is the
presenter, the stories all seem to run together... the broadcaster just
drones on and on.

OTOH, when there are a pair of presenters -- ideally a male and female --
then there is some deliberate variety in the presentation.  There's some
"novelty"
for your brain to seize on as the speaker periodically changes.

OToOH, imagine those two presenters were alternating *words*!  The novelty
becomes too excessive -- too much work for the brain to sort out what is
being said.  "People don't talk that way".

Similarly, consider sitting through a long lecture, church service, etc.
Being entirely passive lets the brain shutdown do to boredom/fatigue.

By extension, any technology that makes it easier for you to "humanize"
the output can, potentially, be more palatable to a listener.

I've written 5 synthesizers, now, and have found that none are truly "best"
unless you can clearly identify the listener and the source material.
I've played with letter-to-sound rules, prosody, the underlying sound
technology, etc.

I've had users swear that #2 was "clearly the easiest to understand"
while others would claim it was the worst, giving top honors to #4.

In some applications, there are clear winners - but, usually, because they
introduce some other important criteria that the remaining implementations
can't easily address.  (E.g., making voices for folks who have lost their
natural ability to speak)

<shrug>

> I also feel like how well the synth can perform at a high speaking 
> rate was another thing I saw.

I think a lot, there, depends on the rules put in place, not the "sound
technology" being used.

I always laugh when I hear a synthesizer speak a web address... I want to
"hurry it up" as real people aren't that pedantic/deliberate in their
speech:  duh-bull-you duh-bull-you duh-bull-you dot ...  It's almost like
the synthesizer is trying on new LIPS!

(c'mon, do you know anyone who is THAT deliberate and monotonous in their
speaking?)

It's possible to excise portions of prerecorded speech to speed up the
speaking rate without significantly impacting *pitch* (obviously, you can
make time run faster to speed things up but you then end up with Mickey
Mouse speech).  I'm not sure how extensively this is used in diphone
synthesis, though, as it requires "editing" the samples from the unit
database in real time.

[Time is another aspect of synthesis that has to be considered.
How do you handle the case where the synthesizer doesn't, yet, have the full
utterance at its disposal in order to perform a complete analysis of intent,
impose inflection, etc.  E.g., you can infer a question from the initial "W"
word -- Who, What, Why, etc.  But, not always:  "Bob went home?" vs. "Bob
went home!"  This isn't as much of a problem for a screen reader where the
entire text is LIKELY available to the synthesizer.
But, not all speech apps are screen readers!  :> ]

How much overhead do you want to "lose" to speech synthesis?
(overhead translates to cost)
_______________________________________________
Dectalk mailing list
Dectalk at bluegrasspals.com
https://bluegrasspals.com/mailman/listinfo/dectalk



More information about the Dectalk mailing list