Skip to the content.

SimpleSpeech: Towards Simple and Efficient Speech Synthesis with Scalar Latent Transformer Diffusion Models

Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng
CUHK
Accepted by Interspeech 2024
Code is open-sourced in https://github.com/yangdongchao/SimpleSpeech

Introduction

In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) SimlpeSpeech can be trained on the speech-only dataset, without any alignment information; (2) SimpleSpeech directly takes plain text as input and generates speech through an NAR way; (3) SimpleSpeech tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specific, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named as scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, SimpleSpeech shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models \textit{e.g.} VALL-EX and Pheme TTS, SimpleSpeech presents significant speech quality and generation speed improvement while having a similar voice clone ability with less training data.

Overview

The overview of SimpleSpeech as following picture shows. The overview of SimpleSpeech In the following, we will show some generated samples by our proposed method.

Zero-shot TTS.

In the following, we first show some case in LibriTTS test clean set. We compare with VALL-EX, Pheme TTS and XTTS

Content (The transcirption of the target audio)
Prompt
VALL-EX
Pheme TTS
XTTS
Ours
GT Speech
You see, sir, these sharks are badly designed.
I cannot believe such was the case.
She felt the force of the objections.
She can’t get it out of her head, even after fifty years.
And he had certainly neglected the Princess a little.
If we could only get him out of the way we might succeed better.
The sum required was offered with such delicacy that it could not be declined.
The ray from his lantern swung about the room for a moment, then he switched on the electric light.
I don’t suppose any one else can find hidden worms that way
But at that moment the voice of the stranger was heard from the window
But the young man was there in presence; and John’s will carried the day.

Compared with E3TTS.

Content (The transcirption of the target audio)
Prompt
E3TTS
Ours
The small lamp in front of the icons was the only light left in the room.
Come up on the bank and learn to perch, as we birds do.
“I don’t think so,” replied Tom.
“It’s impossible,” he said, “it’s no good.
They were glancing about with eager eyes.

Compared with NaturalSpeech 2.

Content (The transcirption of the target audio)
Prompt
NS2
Ours
All the furniture belonged to other times
Heaven, a good place to be raised.
No thanks, I am glad to give you such easy happiness.
Whatever appeal to her sense of beauty was straightway transferred to paper or canvas.
John Wesley Kombash, Jacob Taylor, and Thomas Edward Skinner
You will allow me to suggest, said he, that is a matter of opinion
It is this that is of interest to theory of knowledge

Compared with AudioBox.

We use the official model to generate speech from AudioBox https://audiobox.metademolab.com/maker. We use their defualt speaker to generate speech, thus please ignore the speaker timbre, and only focus on the speech quality.

Content (The transcirption of the target audio)
AudioBox
Ours
In the following, I will give you a question and the corresponding emotion
You are requested to provide a comforting response based on the emotion.
with the goal of training models that help people solve problems that require real world interaction.

Compared with UniAudio

Content (The transcirption of the target audio)
Prompt
UniAudio
Ours
It is 16 years since John Berkson died.
That is one reason you are ojo the unlucky said the woman in sympathetic tone.

Compared with NaturalSpeech 3

Content (The transcirption of the target audio)
Prompt
NaturalSpeech 3
Ours
It is this that is of interest to theory of knowledge.
For, like as not, they must have thought him a prince when they saw his fine cap.
They think you’re proud because you’ve been away to school or something.
For the past ten years, Conseil had gone with me wherever science beckoned.

Audio Codec Reconstruction comparison

Original Speech
DAC
HiFi-Codec
Encodec
Ours

Ablation study (Unet VS Transformer)

Content (The transcirption of the target audio)
Unet
Transformer
You see, sir, these sharks are badly designed.
I cannot believe such was the case.
She felt the force of the objections.
She can’t get it out of her head, even after fifty years.

Ablation study (Cross-attention condition vs In context condition)

Content (The transcirption of the target audio)
Cross-attention
In context
You see, sir, these sharks are badly designed.
I cannot believe such was the case.
She felt the force of the objections.
She can’t get it out of her head, even after fifty years.

Ablation study (VAE vs SQ)

Content (The transcirption of the target audio)
VAE
SQ
If we could only get him out of the way we might succeed better.
The sum required was offered with such delicacy that it could not be declined.
That is the landslide which I predicted
I want you to go out now, she said, I have no stamps.

Ablation study (400h training data vs 4000h training data)

Content (The transcirption of the target audio)
400h
4000h
You see, sir, these sharks are badly designed.
I cannot believe such was the case.
She felt the force of the objections.
She can’t get it out of her head, even after fifty years.

Conclusion

In this work, we try to answer the question whether researchers from the academic community can rapidly develop a large TTS model using a few GPUs and limited data resources. Our study demonstrates the feasibility of constructing a simple and efficient TTS model. The proposed SQ-Codec is straightforward to train, requiring no special techniques. Similarly, the scalar latent diffusion model can be easily developed and trained, leveraging the LLM structure and diffusion training strategy.

Research into large TTS models began with Tortoise TTS and VALL-E last year. After one year, numerous related works have demonstrated impressive performance. Our findings indicate that: (1) a high-quality tokenizer is fundamental for generating high-quality speech; (2) speaker timbre disentanglement is crucial for enhancing voice cloning capabilities, potentially improving similarity scores. Given that we can easily access advanced model structures from GitHub, data quality and the number of GPUs emerge as the third critical factor for the success of large TTS models.

Limitations

Although SimpleSpeech demonstrates the ability to clone voices, we do not scale the training data due to GPU resource limitations. Furthermore, we have not designed speaker disentanglement in the Codec model to enhance voice cloning capabilities. Regarding limited comparisons: (1) We have not compared all related large TTS models because not all models are openly available. We respect all previous works, even though some are not mentioned in our paper due to space constraints. Additionally, we do not claim that SimpleSpeech is superior to any previous models. (2) The evaluation is not comprehensive; we only conducted Mean Opinion Score (MOS) and Subjective Mean Opinion Score (SMOS) evaluations. More extensive evaluations are necessary to assess prosody and naturalness accurately.

The prompt for ChatGPT

A simple sampe to use the in-context-learning of GPT to estimate the duration of text.

completion = openai.ChatCompletion.create(
          messages=[
                  {
                      'role': 'user',
                      'content':f' I want to predict that the duration of speech based on the content and you need to give me the results in the following format:\
                      Question: Three members of this shift separately took this opportunity to visit the Cellar Coffee House.\
                      Answer: It includes 15 words, it may cost 5 to 6 seconds.\
                      Question: This Allotment Division will consider all of the recommendations submitted to it. \
                      Answer: It includes 12 words, it may cost 4 to 5 seconds. \
                      Question: has been far less than in any previous, comparable period. \
                      Answer: It includes 10 words, it may cost 3 to 4 seconds. \
                      Question: They were glancing about with eager eyes. \
                      Answer: It includes 7 words, it may cost 2 to 3 seconds. \
                      Question: I do not think so. \
                      Answer: It includes 5 words, it may cost 1 to 2 seconds. \
                      You should consider the pronunciation of each word. Some words may need more time to pronunciate. In general, if a word includes more letter, it costs more time to read.  \
                      In summary, you should know how many words in the sentence, then consider how long it will cost to read it.\
                      Question: {caption} \
                      Answer:',
                  },
              ]
          )