Muse: Multi-modal target speaker extraction with visual cues

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Speaker extraction algorithm relies on a reference speech to focus its attention on a target speaker. The reference speech is typically pre-registered as a speaker embedding. We believe that temporal synchronization between speech and lip movement is a useful cue, and target speaker embedding is also equally important. Motivated by this belief, we study a novel technique to use visual cues as the reference to extract target speaker embedding, without the need of pre-registered reference speech. We propose a multi-modal speaker extraction network, named MuSE, that is conditioned only on a lip image sequence for target speaker extraction. MuSE not only improves over AV-ConvTasnet baseline in terms of SI-SDR and PESQ, but also shows superior robustness in cross-domain evaluations.

Related collections

Author and article information

Journal

Publication date Created: 15 October 2020

Article

ArXiV ID: 2010.07775

SO-VID: 516d43b2-c139-4e47-9d9a-abdac55fa597

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Categories eess.AS cs.MM cs.SD eess.IV

ScienceOpen disciplines: Graphics & Multimedia design,Electrical engineering

Data availability:

ScienceOpen disciplines: Graphics & Multimedia design, Electrical engineering

Muse: Multi-modal target speaker extraction with visual cues

Read this article at

Abstract

Related collections

International Conference on Moisture in Buildings Series

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 262