Cross-lingual text-to-speech using multi-task learning and speaker classifier joint training

Authors: Jingzhou Yang and Lei He

In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker. Normally, only the data of monoglot speakers are available for model training, thus the speaker similarity is relatively low between the synthesized cross-lingual speech and the native language recordings. Based on the multilingual transformer text-to-speech model, this paper studies a multi-task learning framework to improve the cross-lingual speaker similarity. To further improve the speaker similarity, joint training with a speaker classifier is proposed. However, the parallel training mechanism for transformer is disrupted when introducing such joint training. To alleviate this problem, a scheme similar to parallel scheduled sampling is proposed to train the transformer model in parallel. By using multi-task learning and speaker classifier joint training, the cross-lingual speaker similarity can be consistently improved for both the seen and unseen speakers in the training set. The experiments also show the feasibility of the x-vector cosine distance as an objective measure of the speaker similarity.




The training corpus is comprised of around 700 hours professional recordings from 14 language locales. In each locale, there are at least 3 speakers.

In the comparison experiments, "Baseline" is the multilingual transformer model. "+MTL" denotes the multi-task learning (MTL) system, which is based on the baseline system by introducing additional classification tasks. "+MTL+JointSpk" is based on the MTL system, and an additional x-vector system is introduced to jointly train with the MTL multilingual system.

The zh-CN speaker in the training set

The zh-CN recordings

Text: 杨子听着听着,心中忽然一惊,此人正是当朝尚书严嵩。
Text: 见杰克抬手急忙向左一闪,避开了这一枪。
Text: 快到海边了,锦龙就地一滚,滚出一条河道来。

The en-US cross-lingual samples

Text: He deserves to die, and he deserves to die painfully.
Text: George Balanchine's charming ballet to orchestrations of Gershwin songs.
Text: There is no way of knowing how many of them have legitimate gripes.

The de-DE cross-lingual samples

Text: Erwartet uns also ein flotter Dreier im Stil von "Liebe Sünde"?
Text: Der Volkskongreß muß beide ratifizieren.
Text: Dies sei von der Regierungskoalition politisch nicht gewollt.


The en-US speaker in the training set

The en-US recordings

Text: That attack killed three men: a Romanian, a Chinese and an Israeli.
Text: I've enjoyed her morphing into responsible, research girl.
Text: They reviewed the day's events.

The zh-CN cross-lingual samples

Text: 那眼神中,满是羡慕,还有坚强。
Text: 饭店的具体位置已显示在下面的地图中。
Text: 我想,这正是爱国主义得以源远流长的内在动力。


The unseen speaker

There are around 9 minutes of data form the new zh-CN speaker.

The zh-CN recordings

Text: 好吧!我陪你玩。先吃饭再说。
Text: 妈妈,葡萄还没成熟呢,我们不用这么着急赶路啊!
Text: 你干吗发这么大的脾气?

The en-US cross-lingual samples

Text: Planning is a continuous process throughout the audit.
Text: Although background research is out of fashion these days, we boldly did a Nexis search.