Cross-lingual text-to-speech using multi-task learning and speaker classifier joint training

Authors: Jingzhou Yang and Lei He

Abstract:
In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker. Normally, only the data of monoglot speakers are available for model training, thus the speaker similarity is relatively low between the synthesized cross-lingual speech and the native language recordings. Based on the multilingual transformer text-to-speech model, this paper studies a multi-task learning framework to improve the cross-lingual speaker similarity. To further improve the speaker similarity, joint training with a speaker classifier is proposed. However, the parallel training mechanism for transformer is disrupted when introducing such joint training. To alleviate this problem, a scheme similar to parallel scheduled sampling is proposed to train the transformer model in parallel. By using multi-task learning and speaker classifier joint training, the cross-lingual speaker similarity can be consistently improved for both the seen and unseen speakers in the training set. The experiments also show the feasibility of the x-vector cosine distance as an objective measure of the speaker similarity.

Contents

The zh-CN speaker in the training set
The en-US speaker in the training set
The unseen speaker

The training corpus is comprised of around 700 hours professional recordings from 14 language locales. In each locale, there are at least 3 speakers.

In the comparison experiments, "Baseline" is the multilingual transformer model. "+MTL" denotes the multi-task learning (MTL) system, which is based on the baseline system by introducing additional classification tasks. "+MTL+JointSpk" is based on the MTL system, and an additional x-vector system is introduced to jointly train with the MTL multilingual system.

The zh-CN speaker in the training set

The zh-CN recordings

Text: 杨子听着听着,心中忽然一惊,此人正是当朝尚书严嵩。	Text: 见杰克抬手急忙向左一闪,避开了这一枪。	Text: 快到海边了,锦龙就地一滚,滚出一条河道来。

The en-US cross-lingual samples

	Text: He deserves to die, and he deserves to die painfully.	Text: George Balanchine's charming ballet to orchestrations of Gershwin songs.	Text: There is no way of knowing how many of them have legitimate gripes.
Baseline
+MTL
+MTL+JointSpk

The de-DE cross-lingual samples

	Text: Erwartet uns also ein flotter Dreier im Stil von "Liebe Sünde"?	Text: Der Volkskongreß muß beide ratifizieren.	Text: Dies sei von der Regierungskoalition politisch nicht gewollt.
Baseline
+MTL
+MTL+JointSpk

The en-US speaker in the training set

The en-US recordings

Text: That attack killed three men: a Romanian, a Chinese and an Israeli.	Text: I've enjoyed her morphing into responsible, research girl.	Text: They reviewed the day's events.

The zh-CN cross-lingual samples

	Text: 那眼神中,满是羡慕,还有坚强。	Text: 饭店的具体位置已显示在下面的地图中。	Text: 我想,这正是爱国主义得以源远流长的内在动力。
Baseline
+MTL
+MTL+JointSpk

The unseen speaker

There are around 9 minutes of data form the new zh-CN speaker.

The zh-CN recordings

	Text: 好吧!我陪你玩。先吃饭再说。	Text: 妈妈,葡萄还没成熟呢,我们不用这么着急赶路啊!	Text: 你干吗发这么大的脾气?
Recordings

The en-US cross-lingual samples

	Text: Planning is a continuous process throughout the audit.	Text: Although background research is out of fashion these days, we boldly did a Nexis search.
Baseline
+MTL
+MTL+JointSpk