[DECtalk] Question About Formant vs Natural Speech

Sun Jun 19 11:19:14 EDT 2022

Yes well just take for example all possible combinations of all 26 lletters
of the English alphabet or all possible combinations of dictionary words.
It's a number so large you have to express it scientific notation using
exponents. Its just an indescribably huge number!

-----Original Message-----
From: Dectalk <dectalk-bounces at bluegrasspals.com> On Behalf Of Don
Sent: Sunday, June 19, 2022 10:55
To: dectalk at bluegrasspals.com
Subject: Re: [DECtalk] Question About Formant vs Natural Speech

On 6/19/2022 7:23 AM, joshknnd1982 at gmail.com wrote:
> The units are probably the same as they always have been, very small 
> recordings of human speech broken down into phonemes or maybe diphones.
> Except there is a whole lot of them probably taking hundreds of 
> megabytes but probably compressed somehow to make them smaller.

If you recorded every phoneme ("unit") and tried to paste them together to
form speech, you would gag at the resulting speech quality.  Just like
trying to record every WORD and paste them together to form speech.

[The image of a ransom note made from letters cut from magazines and pasted
together is appropriate]

Diphone synthesis acknowledges that much intelligibility is embodied in the
*transitions* between phonemes, not the phonemes themselves.

So, instead of emitting:
    A  B  O  V  E
(these aren't real phonemes, just symbolic representations for discussion)
you would emit:
    _a  ab  bo  ov  ve  e_
This is better sounding because the middle of a diphone is a more "stable"
waveform than the START/END of a phoneme.  How you (as a
human) make a particular sound is influenced by the sounds around it.

The problem now is you need ~ N^2 diphones instead of N phonemes.

And, need to "select" which pair of diphones to mash together to create an
utterance.  E.g., you need the "ab" diphone and "bo"
diphone, not just the "b" phoneme.

[There's also some massaging of the waveforms to make them "fit better".
And, pitch adjustment can either be done in post processing *or* by storing
additional "versions" of each diphone (which makes the unit database larger
and the selection process more involved)]

Creating this database is a key part of the synthesizer's design.
And, relies on a human speaker for the original waveforms.
_______________________________________________
Dectalk mailing list
Dectalk at bluegrasspals.com
https://bluegrasspals.com/mailman/listinfo/dectalk