A Transformer-based Model for Sentence-Level Chinese Mandarin Lipreading

摘要

Lipreading is a task that converts silent speaker video into its speech content, which has practical value in many scenarios. However, most current lipreading research is based on English and the research on sentence-level Chinese lipreading is still insufficient. Therefore, we propose an end-to-end lipreading network for Chinese Mandarin. Different from the existing works, we first applied Transformer architecture in Chinese Mandarin lipreading, which integrates the self-attention mechanism and improves the performance of the language model. According to the characteristics of Mandarin, pinyin is introduced to assist the prediction of Chinese characters. In addition, we divided the pinyin dictionary into initials and finals instead of 26 simple English letters, which can be more suitable with Chinese speaking habits. Based on the above, a Cascade-Transformer based Chinese Lipreading Network(CTCH-LipNet in short) is proposed to map the talking video to the speech content. From the experiment results on the large-scale dataset, it has been demonstrated that our proposed approach can achieve better recognition performance than the state-of-the-art approach investigated.

查看全部
In Proceedings of IEEE International Conference on Data Science in Cyberspace 2020
马诗慧
马诗慧
毕业生
王士林
王士林
教授