[DECtalk] DECTalk At High Speaking Rates

Mon Oct 31 07:17:52 EDT 2022

> 
Hi,

Thanks for your input on this. I'll try to answer as best as I can.

To make it easier for me to respond to, I've pasted your comments and then was going to respond after it.

It may be that the "parameter update rate" is getting in the way.

Briefly, there are a few dozen parameters that are adjusted to
form the various sounds (phonemes) and the *transitions* between
them.  The overall "trajectory" of each parameter is coded into
the algorithms.  But, the actual values of the parameters are
only periodically updated; in the original MITalk implementation,
these updates occurred about 200 times per second.  Note the
default speaking rate was about 200 words per minute.

If the parameter updates remain at the 200 per second rate regardless
of the speaking rate, then the synthesizer's model of the vocal tract
may be too sluggish to provide high fidelity to the resulting speech
waveform.

Note, also, that the original synthesizer generated 10,000 samples
per second which were filtered, electronically, to the 5,000 Hz
bandwidth of most (male) speech.

[I run my synthesizer considerably faster as I process speech in
much the same way as I process music -- which needs a sample rate
in the 40,000 range.]

You might suggest to the folks working on the DT sources that they
explore increasing the sample rate and parameter update rates to
see if this has a noticeable impact on the quality of the speech
at higher speaking rates.
I'd be happy to reach out about this. This seems like it could be a good starting point.

This is all conjecture as I've not examined the DECTalk sources but, rather,
am commenting from published literature regarding its ORIGINAL design.

If I use Eloquence at a high speed it's much easier for me to follow that.
I'm used to how both synthesizers sound and so I don't think it's due to not
having used DECTalk enough.

But, the Eloquence voice inherently sounds different.  It could be that you
are more attuned to the audio artifacts in its speech -- even if its
speech is of a lesser (whatever that means) quality.  Your brain likely
fills in a lot of detail that your ears miss.

If you really want to "objectively" test intelligibility, you need to
use material that is unpredictable -- so your brain doesn't fill in
words that your ears missed.

I use a modified rhyme test to really trip up listeners as all of the
words sound very similar -- and none are related to the others conceptually.
  fun, sun, bun, run, nun, gun
"Which of these words -- first, second, third, etc -- holds a hotdog?"
  jaw, thaw, law, raw, saw, paw
"Which of these words is used to cut wood?"
Running Eloquence fast I was able to figure these, I'll have to try it with DECTalk.

For those who use DECTalk with a screen reader, do you use it fast, say,
above 470-480, or do you use it slower?
And generally speaking, for those who use DECTalk in general, such as with
the Speak windows, is fast speech simply not a concern?

I can't comment on high speaking rates as my usage is for short messages.
But, wanting high comprehension on a single pass as repeating a message
is a costly exercise for the listener.

I feel that if DECTalk were to be used widely in the AT space that it could
really do a better job at high speeds.
And I'm not saying Eloquence is perfect either--it definitely isn't, but
it's far more intelligible for me at high speeds than DECTalk.

Flip that comment around.  At *low* speeds, how would you rate its
intelligibility?
At low speaking rates both DECTalk and Eloquence seem to be intelligible. Is there something I'm missing with this one?
How would one of your colleagues NOT accustomed
to listening to it answer that question?  I.e., how much have you
been trained vs. objectively evaluating the products in each
situation?
Several people not accustomed to speech synthesis seem to be able to follow what at least Eloquence is saying at lower speeds, but I haven't investigated this as much with DECTalk.

Does this answer your questions?

I'd be happy to clarify anything else that might come up too.

Thanks again for your input,

Brandon Tyson
Sent from my iPhone

On Oct 29, 2022, at 10:35 PM, Don <Text_to_Speech at gmx.com> wrote:

It may be that the "parameter update rate" is getting in the way.

Briefly, there are a few dozen parameters that are adjusted to
form the various sounds (phonemes) and the *transitions* between
them.  The overall "trajectory" of each parameter is coded into
the algorithms.  But, the actual values of the parameters are
only periodically updated; in the original MITalk implementation,
these updates occurred about 200 times per second.  Note the
default speaking rate was about 200 words per minute.

If the parameter updates remain at the 200 per second rate regardless
of the speaking rate, then the synthesizer's model of the vocal tract
may be too sluggish to provide high fidelity to the resulting speech
waveform.

Note, also, that the original synthesizer generated 10,000 samples
per second which were filtered, electronically, to the 5,000 Hz
bandwidth of most (male) speech.

[I run my synthesizer considerably faster as I process speech in
much the same way as I process music -- which needs a sample rate
in the 40,000 range.]

You might suggest to the folks working on the DT sources that they
explore increasing the sample rate and parameter update rates to
see if this has a noticeable impact on the quality of the speech
at higher speaking rates.

This is all conjecture as I've not examined the DECTalk sources but, rather,
am commenting from published literature regarding its ORIGINAL design.

If I use Eloquence at a high speed it's much easier for me to follow that.
I'm used to how both synthesizers sound and so I don't think it's due to not
having used DECTalk enough.

But, the Eloquence voice inherently sounds different.  It could be that you
are more attuned to the audio artifacts in its speech -- even if its
speech is of a lesser (whatever that means) quality.  Your brain likely
fills in a lot of detail that your ears miss.

If you really want to "objectively" test intelligibility, you need to
use material that is unpredictable -- so your brain doesn't fill in
words that your ears missed.

I use a modified rhyme test to really trip up listeners as all of the
words sound very similar -- and none are related to the others conceptually.
  fun, sun, bun, run, nun, gun
"Which of these words -- first, second, third, etc -- holds a hotdog?"
  jaw, thaw, law, raw, saw, paw
"Which of these words is used to cut wood?"

For those who use DECTalk with a screen reader, do you use it fast, say,
above 470-480, or do you use it slower?
And generally speaking, for those who use DECTalk in general, such as with
the Speak windows, is fast speech simply not a concern?

I can't comment on high speaking rates as my usage is for short messages.
But, wanting high comprehension on a single pass as repeating a message
is a costly exercise for the listener.

I feel that if DECTalk were to be used widely in the AT space that it could
really do a better job at high speeds.
And I'm not saying Eloquence is perfect either--it definitely isn't, but
it's far more intelligible for me at high speeds than DECTalk.

Flip that comment around.  At *low* speeds, how would you rate its
intelligibility?  How would one of your colleagues NOT accustomed
to listening to it answer that question?  I.e., how much have you
been trained vs. objectively evaluating the products in each
situation?