Cross-lingual text-to-speech using multi-task learning and speaker classifier joint training
Authors: Jingzhou Yang and Lei He
Abstract: In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker. Normally, only the data of monoglot speakers are available for model training, thus the speaker similarity is relatively low between the synthesized cross-lingual speech and the native language recordings. Based on the multilingual transformer text-to-speech model, this paper studies a multi-task learning framework to improve the cross-lingual speaker similarity. To further improve the speaker similarity, joint training with a speaker classifier is proposed. However, the parallel training mechanism for transformer is disrupted when introducing such joint training. To alleviate this problem, a scheme similar to parallel scheduled sampling is proposed to train the transformer model in parallel. By using multi-task learning and speaker classifier joint training, the cross-lingual speaker similarity can be consistently improved for both the seen and unseen speakers in the training set. The experiments also show the feasibility of the x-vector cosine distance as an objective measure of the speaker similarity.
The training corpus is comprised of around 700 hours professional recordings from 14 language locales. In each locale, there are at least 3 speakers.
In the comparison experiments, "Baseline" is the multilingual transformer model. "+MTL" denotes the multi-task learning (MTL) system, which is based on the baseline system by introducing additional classification tasks. "+MTL+JointSpk" is based on the MTL system, and an additional x-vector system is introduced to jointly train with the MTL multilingual system.