6
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Muse: Multi-modal target speaker extraction with visual cues

      Preprint
      , , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Speaker extraction algorithm relies on a reference speech to focus its attention on a target speaker. The reference speech is typically pre-registered as a speaker embedding. We believe that temporal synchronization between speech and lip movement is a useful cue, and target speaker embedding is also equally important. Motivated by this belief, we study a novel technique to use visual cues as the reference to extract target speaker embedding, without the need of pre-registered reference speech. We propose a multi-modal speaker extraction network, named MuSE, that is conditioned only on a lip image sequence for target speaker extraction. MuSE not only improves over AV-ConvTasnet baseline in terms of SI-SDR and PESQ, but also shows superior robustness in cross-domain evaluations.

          Related collections

          Author and article information

          Journal
          15 October 2020
          Article
          2010.07775
          516d43b2-c139-4e47-9d9a-abdac55fa597

          http://arxiv.org/licenses/nonexclusive-distrib/1.0/

          History
          Custom metadata
          eess.AS cs.MM cs.SD eess.IV

          Graphics & Multimedia design,Electrical engineering
          Graphics & Multimedia design, Electrical engineering

          Comments

          Comment on this article