PSST! Prosodic Speech Segmentation with Transformers

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Self-attention mechanisms have enabled transformers to achieve superhuman-level performance on many speech-to-text (STT) tasks, yet the challenge of automatic prosodic segmentation has remained unsolved. In this paper we finetune Whisper, a pretrained STT model, to annotate intonation unit (IU) boundaries by repurposing low-frequency tokens. Our approach achieves an accuracy of 95.8%, outperforming previous methods without the need for large-scale labeled data or enterprise grade compute resources. We also diminish input signals by applying a series of filters, finding that low pass filters at a 3.2 kHz level improve segmentation performance in out of sample and out of distribution contexts. We release our model as both a transcription tool and a baseline for further improvements in prosodic segmentation.

Related collections

Author and article information

Journal

Publication date Created: 03 February 2023

Article

ArXiV ID: 2302.01984

SO-VID: 7258ce21-2eee-4135-9eda-bb481ecbe52b

License:

http://creativecommons.org/licenses/by/4.0/

History

Custom metadata

Comments 5 pages, 3 figures. For associated repository, see https://github.com/Nathan-Roll1/psst

Categories cs.CL cs.SD eess.AS

ScienceOpen disciplines: Theoretical computer science,Electrical engineering,Graphics & Multimedia design

Data availability:

ScienceOpen disciplines: Theoretical computer science, Electrical engineering, Graphics & Multimedia design

PSST! Prosodic Speech Segmentation with Transformers

Read this article at

Abstract

Related collections

Electronic Workshops in Computing (eWiC)

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 233