[DECtalk] DECtalk's development, and old speech synthesizer recordings

Sat Nov 5 01:15:37 EDT 2022

[Long, as usual -- for me  :>  But, hard to ensure you can get a feel
for some of the implementation issues, otherwise]

On 11/4/2022 9:24 AM, Aksel Leo Christoffersen wrote:
> As stated, I haven't read that literature yet, but as I understood it,
> Perfect Paul was supposed to be a model of Klatt's own voice, but yes, it
> could make sense that it was a combination of research including several
> source voices.

My point was that if he used his own voice to extract the parameters
of "human speech", then any voices created with those *findings* would
tend to sound like him.

How many syllables in YOUR pronunciation of "mayonnaise"?
Brits have "aluminium" cans.  Do you eat SALmon or SAHmon?

Sort of like if you used YOUR hand to create a model of what a human hand
looked like, that model would resemble YOUR hand more than those of others!
The lengths of your fingers likely differ from mine -- as do the *ratios*
of each of your fingers to the others!  And, that EXTRA FINGER on your right
hand would be a dead giveaway!  :>

> Regarding your comments about implementation, are you saying that the method
> of implementation had a great influence on the sound? I have heard audio
> samples of DECtalk DTC-01, boath from the actual audio, and from a MAME
> emulator, and they sound more or less the same to me. I use a DECtalk
> Express daily, and it's "voice" sounds very similar to the early software
> DECtalk versions, sutch as 4.3. The DECtalk Express is using version 4.2CD.
> It was because of this, that I found it interesting that the recordings from
> 1986 sounds more like the DECtalk Express, or DECtalk 4.3, and the DECtalk
> DTC-01 from 1984, running either version 1.8 or 2.0. How much of the
> implementation could be changed in these 2 years?

Implementation means many things.

Let's stick to assuming the *algorithm* that was used remained unchanged.
I.e., the basic steps and how those were "made real" (reified).

I'm assuming you aren't commenting about how individual words are
pronounced (like salmon, above) but, rather, the "character" of the voice.

In this application, speech is a "real-time" problem.  The goal
isn't just to make a waveform that, when played, SOUNDS like spoken
words but, rather, to make a waveform in the timeframe that it
takes you to play it.  So the device is actually "speaking" to
you.

By contrast, movies employ lots of special effects that may
require HOURS of computer time just to create SECONDS of
video.  But, they aren't creating the video AS it is being
watched so that is not a "real-time" problem!

The resources that you, as a developer, have available are:
- memory (how much, how it is organized, accessed, etc.)
- computational ability (the sorts of "math" you can do)
- speed (how many operations can you perform per unit of time)

If you come up with a technique that requires more memory
than you have available, too bad!

If you come up with a technique that requires all sorts of
complex math, too bad!

If you come up with a technique that requires a computer
that is faster than the one available, too bad!

You could, for example, store a giant dictionary with EVERY
word and its pronunciation.  You could opt to store that
pronunciation not phonetically but, rather, as a genuine
audio sample!  (takes even MORE space).

Unless, of course, you can't afford all that memory!

You could come up with a technique that requires you to
"massage" audio snippets to build a continuous waveform.
But, if that requires lots of computation or very high
speed operations to achieve the "real-time" aspect, then
you're also screwed.

[MITalk was developed on a minicomputer]

In a minicomputer, you tend to have lots of storage
because you can treat the disk drive as additional
storage beyond what is available in "RAM".

In a minicomputer, you tend to have extensive computational
capabilities because these general purpose machines were
often used for astronomy, business, modeling, etc. and
had to address ALL of those needs.

In a minicomputer, you tend to have a fair bit of
"performance" (throughput) -- because the folks spending
all that money for the minicomputer want it to do REAL
WORK for them; it's not just a hobbyist's toy!  They
can't afford to have a computer dedicated to "doing weekly
payroll"; it has to have other uses as well!

The DTC-01 (the first DECtalk unit) had sixteen 16KB ROMs to
store the program and any constants, tables, etc.  That's
effectively a 256KB "file".  Additionally, it had ten 8KB RAMs
for "working memory"... variables, etc.  A total of 80KB.
So, if this was a PC, it would have 336KB of memory (no disk).

The (main) processor was a 10MHz 68000 -- a 16 bit processor.
The 68000 has no floating point support so a second processor
was included -- a Digital Signal Processor -- to handle the
more complex math, "quickly".

[It seems likely that Klatt would have relied on floating point
calculations to make his job easier.]

I.e., you have roughly the hardware available in an IBM XT
*if* it had a floating point accelerator.  Completely dedicated
to JUST making speech!  But, you can't change the program
in the DECtalk because the ROMs can't be rewritten like RAM!

[Move to a modern processor with gigabytes of RAM and terabytes
of disk and the solution space has more options!]

Math in a computer is always approximate.  There is no way
to store 1/3, for example.  Note that this is also true
when trying to use decimals:  1/3 = 0.33333333333333333...
Yet, you and I both know that *we* can multiply by 1/3
without problems!  We can add 1/3 to 1/3 and get 2/3 just
as easily!  etc.

[There are ways around this but they are "expensive" in terms
of time and effort]

The point, here, is that nothing is really EXACT inside the computer.

There are three main ways of storing numbers in a computer:
- integer (e.g., write down 5 digits but no decimal point!)
- floating point (e.g., write down 5 digits and ONE decimal point)
- fixed point (write 5 digits and REMEMBER where the decimal should be!)

Integer operations are typically very fast.  But, they are coarse.
You can't express 1.5, 72.00234, 902.872 or anything else that
isn't a "whole number".

Floating point operations are slow.  Largely, because the computer
keeps track of where the decimal point is, at all times.  So, *it*
knows that 37000. divided by 74.000 is 500.00 -- note that the decimal
point in the result is not where it was in either of the first two
arguments!  Magic!

This is wonderful because you can concentrate on the operations that you
want to perform without having to worry about that silly little detail!

But, in addition to being slow, it is also fraught with hidden perils
for the naive developer.  E.g., 37000. + 0.1000 is still 37000.  Even
if I add a thousand 0.1000's to that, I still end up at 37000.  So,
if I want to add a thousand 0.1000's to it, I must first multiply
the 0.1000 by 1000 to get 100.00 which I can then add to the 37000
and end up with 37100.

The third approach puts an even greater burden on the developer
by forcing him to keep track of the decimal point (instead of
letting the computer do it).  Here, the developer uses integer
operations (fast!) but *interprets* the results as if they had
a decimal point in a fixed position (that can vary during the
course of the operation).

So, I can treat the price of a gallon of gasoline as an integer
that expresses TENTHS of cents (in the US, fuel is priced this
way; a gallon of gasoline costs $3.489 even though you can never
pay for just one gallon because we don't have tenth-pennies!)

I can treat the amount (volume) of fuel as another integer
that indicates hundredths of gallons.  So, 1612 hundredth-gallons
to fill my tank.

1612 hundredth-gallons * 3489 tenth-pennies = 5624268 -- which
I know to *interpret* as 5624 pennies or $56.24.  The result is
just as accurate -- but I (the developer) have to make sure
I've kept track of the form of each "value" in the computation.

[When I designed my DECtalk-workalike, I knew ahead of
time what the cost for requiring floating point capabilities
would be in terms of limiting my implementation choices.
So, I opted for this third approach knowing that I only
had to track the decimal points (actually BINARY points)
ONCE, during the development of the algorithm.  Thereafter,
their positions would be fixed and known to the software
so it could exploit the increased speed of integer operations]

Additionally, digital filters can be implemented in different
ways with subtle differences in their performance and "artifacts".

[The vocal tract is modeled as a bunch of resonators and
filters]

Resonators, in particular, run the same data through the "loop"
repeatedly in a form of positive feedback.  Like humming in
a shower stall -- certain notes come out sounding incredibly
loud because they "resonate" in the physical space available
and *reinforce* themselves.

Choosing one form or another may be inconsequential when you have
floating point support available.  But, if you have to deal with
fixed point representations, one form may be considerably less
effective than the other.

How much Klatt knew about these issues is unknown.  IME, folks
working on a thesis are just trying to "get it done".  They
aren't concerned about the more *practical* issues (like
building a device for mass production) or "software maintenance"
or even whether it would model a toddler's voice as effectively as
an adult male.

Or, whether the same implementation would "scale" to running
at 400 words per minute

How much the DEC implementors knew about *speech* is equally
unknown.  I suspect they were more concerned with figuring out
WHAT the product should do, look like, how it should work,
etc.  E.g., I don't recall MITalk having provisions to alter the
voice WHILE speaking.  I'm sure the "programmer" could specify
a set of parameters for subsequent speech but embedding commands
in the input text to do that was likely added to make it a
commercial product.  Likewise, adding support for "type through"
operation where the serial data coming into the device could pass
through to a computer terminal (both ways) -- so customers could
just INSERT it into their existing wiring instead of having to
add new capabilities to their computer systems.

How the product evolved would likely be the result of each
side/party coming to understand the issues that the other
party already understood -- and coming to an effective
compromise in the implementation.

This would, of necessity, be an incremental process.
"Let's get it to work before we start trying to improve it!"

And, of course, developers/maintainers are always tweaking
things -- sometimes without truly understanding the full
consequences of their tweaks!  What happens when an *8*
bit character is encountered with the high bit set (instead
of pure 7-bit ASCII)?  When fixing problem 1, can you resist
tweaking something else that you noticed along the way?

You can always undo a change (in a subsequent release)
as you already know how it worked.  Coming up with new
changes is always a crap shoot.