[DECtalk] Question About Formant vs Natural Speech

Don Text_to_Speech at GMX.com
Sun Jun 19 02:15:37 EDT 2022


On 6/18/2022 9:55 PM, Brandon Tyson wrote:
> Several years ago I thought I remember reading an article that talked
> about comparisons between natural text-to-speech voices and formant
> ones, and their advantages and disadvantages.

What do you mean by "natural TTS voices"?  Diphone synthesis?
Limited vocabulary solutions?  Cut-and-paste solutions?

> If I remember right, I believe it covered aspects like the ability to
> understand it at high speeds and something else that I thought I read
> was called listener fatigue or something similar, suggesting that some
> people can have more difficulty listening to a concatenated voice for
> a prolonged period of time over a computer generated voice that uses
> no recordings of human speech.

AFAICT, all synthetic speech creates listener fatigue.  The question
is how long it can be tolerated and how much "lost comprehension"
comes along with that fatigue.

Cutting-and-pasting words/word fragments/diphones together can give
more natural SOUNDING speech, depending on the size of the unit database.

But, just cutting and pasting speech elements together, by itself,
isn't a panacea.  One problem with such solutions is that you still need
to impose a "naturalness" contour to the result -- prosody, cadence,
breath groups, etc.

And, of course, the input material (the "text") plays a role in just
how natural the result can be made in terms of listener acceptance.
E.g., writing with shorter sentences, phrase groups, etc. is easier
to consume than long, complex sentence structures.

> I was wondering if there are any articles or pages that anyone knows
> of that could point me in the right direction by chance?

What specifically are you looking to learn/conclude from your research?

> Thank you very much for your time and I look forward to any assistance
> you are willing to provide.



More information about the Dectalk mailing list