[DECtalk] Question About Formant vs Natural Speech

Sun Jun 19 10:55:18 EDT 2022

On 6/19/2022 7:23 AM, joshknnd1982 at gmail.com wrote:
> The units are probably the same as they always have been, very small
> recordings of human speech broken down into phonemes or maybe diphones.
> Except there is a whole lot of them probably taking hundreds of megabytes
> but probably compressed somehow to make them smaller.

If you recorded every phoneme ("unit") and tried to paste them together to
form speech, you would gag at the resulting speech quality.  Just like
trying to record every WORD and paste them together to form speech.

[The image of a ransom note made from letters cut from magazines and
pasted together is appropriate]

Diphone synthesis acknowledges that much intelligibility is embodied in
the *transitions* between phonemes, not the phonemes themselves.

So, instead of emitting:
    A  B  O  V  E
(these aren't real phonemes, just symbolic representations for discussion)
you would emit:
    _a  ab  bo  ov  ve  e_
This is better sounding because the middle of a diphone is a more
"stable" waveform than the START/END of a phoneme.  How you (as a
human) make a particular sound is influenced by the sounds around it.

The problem now is you need ~ N^2 diphones instead of N phonemes.

And, need to "select" which pair of diphones to mash together to
create an utterance.  E.g., you need the "ab" diphone and "bo"
diphone, not just the "b" phoneme.

[There's also some massaging of the waveforms to make them "fit
better".  And, pitch adjustment can either be done in post
processing *or* by storing additional "versions" of each diphone
(which makes the unit database larger and the selection process
more involved)]

Creating this database is a key part of the synthesizer's design.
And, relies on a human speaker for the original waveforms.