[DECtalk] FW: decTalk sapi5 installer

Don Text_to_Speech at GMX.com
Fri Jul 1 15:54:51 EDT 2022


On 7/1/2022 7:55 AM, Karen Lewellen wrote:
> Apple does need a version of the dectalk so that those who use the vintage
> voices for their screen reader have choices.
> If they can add the dreadful eloquence, they can  provide quality voices like
> Denis or even Paul.

To a zero-th order approximation, one can get a feel for the
underlying technology in the synthesizer by examining the installed
binaries.  Formant-based synthezisers (DECtalk et al.) have a relatively
small footprint.  If you DON'T include a pronunciation dictionary in
the binary (which makes it easier to sort out HOW to convert a
particular sequence of glyphs to sounds), then the next largest
component is the set of rules for doing this.  The actual synthesizer
is relatively small as it is just a collection of math routines
that are imposed several times (one for each formant).

Examining a binary with the strings(1) command will show a large
number of hits, often in alphabetical order; each of the words
in the dictionary!

This isn't always the case as smarter implementations wouldn't
store entire words as they would be more costly in terms of space
and execution time.  For example, "about" and "above" have three
initial letters in common -- if the software has found A B and O
but the next letter is not U, why would it want to backtrack and
check the A B and O, again, in case it's above?  Just look for the V!

A "voice" in a formant-based synthesizer is a SMALL collection of
numbers that define the characteristics of the vocal tract model.
(This assumes we are dealing with a single language; a voice for
a different language will have different glyph-to-sound rules
which, as above, can be large).

By contrast, a diphone synthesizer (usually more natural speech)
has a large "inventory" of actual speech snippets that are
mashed together and "smoothed" to form continuous sounds.

Look at the controls that you have (even if they are hidden under the hood)
for the voice.  Volume is a no-brainer -- not even related to synthesis
(except WITHIN a spoken utterance -- like for emphasis).  Pitch can usually
be adjusted for every synthesizer type.  Likewise, speaking rate.  Even
the Votrax had these.

My MITtalk work-alike allows my users to decide if the voice is that
of a young child, woman, man, etc. by adjusting things like "head size"
(which affects the dimensions of the vocal tract), pitch, rate, etc.
The more malleable the voice, the greater the likelihood that it is
driven by a model -- like formants.

I can make a boy sound like a man or a man sound like a woman from one
word to the next.  I can potentially have 100 different "people"
speaking in a given session.  Doing this with other technologies becomes
costly.

I, for example, use this to give status and error messages in a different
voice so the user is made aware that the normal information low is
being interrupted.

> singing is a novelty, some of us use dec-talk for actual work.

Yes, I've never understood the fascination with that.
The controls that make singing possible are side-effects of
being able to define a voice.  Or, impose prosody, stress/emphasis,
etc.

OTOH, I can recall configuring multiple Reading Machines to recite
a "round" of "Row, Row, Row Your Boat" -- utterly dreadful with
the Votrax!




More information about the Dectalk mailing list