Chinese Mandarin Lipreading using Cascaded Transformers with Multiple Intermediate Representations

摘要

Automatic lipreading has attracted much research interest over the past few decades. Different from English, Chinese is a tone-based language with a large alphabet and thus the correlation between Chinese characters and lip motions is more complex. Most existing methods employed an intermediate representation (usually Pinyin), and adopted a cascaded architecture for Chinese lipreading. However, such a cascaded structure may accumulate errors, and employing Pinyin as the intermediate representation would cause the loss of visual information. Moreover, these approaches do not perform well for unseen speakers due to inter-speaker variability. In this paper, we propose a cascaded Transformer-based model with a new cross-level attention mechanism, enriching the ways of information transmission between cascading structures and reducing the accumulation of errors. Multiple intermediate representations including Chinese Pinyin and the visemes are adopted to acquire multi-perspective visual and linguistic features and to improve the generalization ability for unseen speakers. Evaluations on the public sentence-level Chinese lipreading database, i.e. CMLR, have demonstrated the advantages of the proposed method in both speaker-independent and multi-speaker scenarios over state-of-the-art approaches.

查看全部
In Proceedings of IEEE International Conference on Image Processing 2022
马兴华
马兴华
毕业生
王士林
王士林
教授