Broadcast Language Identification & Subtitling System (BLISS)

Accessibility is an important area of Human Computer Interaction (HCI) and regulations within many countries mandate that broadcast media content be accessible to all. Currently, most subtitles for offline and live broadcasts are produced by people. However, subtitling methods employing re-speaking with Automatic Speech Recognition (ASR) technology are increasingly replacing manual methods. We discuss here the subtitling component of BLISS (Broadcast Language Identification & Subtitling System), an ASR system for automated subtitling and broadcast monitoring built using the Kaldi ASR Toolkit. The BLISS Gaussian Mixture Model (GMM)/Hidden Markov Model (HMM) acoustic model has been trained with ~960 hours of read speech, and language model with ~900k words combined with a pronunciation dictionary of 200k words from the LibriSpeech corpus. In tests with ~5 hours of unseen clean speech test data with little background noise and seen accents BLISS gives recognition accuracy of 91.87% based on the WER (Word Error Rate) metric. For ~5 hours of unseen challenge speech test data, with higher-WER speakers, BLISS’s accuracy reduces to 75.91%. A BLISS Deep Learning Neural Network (DNN) acoustic model has also been trained with ~100 hours of read speech data. It’s accuracy for ~5 hours of unseen clean and unseen challenge speech test data is 92.88% and 77.27% respectively based on WER. Future work includes training the DNN model with ~960 hours of read speech data using CUDA GPUs and also incorporating algorithms for background noise reduction. The BLISS core engine is also intended as a Language Identification system for broadcast monitoring (BLIS). This paper focuses on its Subtitling application (BLSS).

Content

Author and article information

Contributors

Jinling Wang

Karla Muñoz Esquivel

James Connolly

Kevin Curran

Paul Mc Kevitt

Conference

Publication date: July 2018

Publication date (Print): July 2018

Pages: 1-6

Affiliations

[ ¹ ] Faculty of Computing, Engineering & Built Environment, Ulster University (Magee), Derry/Londonderry, BT48 7JL, Northern Ireland

[ ² ] Faculty of Computing, Engineering & Built Environment, Ulster University (Jordanstown), Newtownabbey, BT37 0QB

[ ³ ] Department of Computing, Letterkenny Institute of Technology (LYIT), Port Road, Letterkenny, IRL- F92 FC93, Co. Donegal, Ireland

[ ⁴ ] Faculty of Arts, Humanities & Social Sciences, Ulster University (Magee), Derry/Londonderry, BT48 7JL, Northern Ireland

Article

DOI: 10.14236/ewic/HCI2018.150

SO-VID: 649e0789-3917-4243-b134-11fc47c5b303

License:

This work is licensed under a Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Conference name: Proceedings of the 32nd International BCS Human Computer Interaction Conference

Conference acronym: HCI

Conference number: 32

Conference location: Belfast, UK

Conference date: 4 - 6 July 2018

Conference sponsor: Electronic Workshops in Computing (eWiC)

Conference theme: Human Computer Interaction Conference

History

Product

1477-9358 BCS Learning & Development

Self URI (article page): https://www.scienceopen.com/hosted-document?doi=10.14236/ewic/HCI2018.150

Self URI (journal page): https://ewic.bcs.org/

References

O. Abdel-HamidL. DengD. Yu 2013 Exploring convolutional neural network structures and optimization for speech recognition Interspeech, ISCA Lyon, France 3366 3370
A. AlvarezC. MendesM. RaffaelliT. LuısS. PauloN. PiccininiH. ArzelusJ. NetoC. del AliprandiA. Pozo 2016 Automating live and batch subtitling of multimedia contents for several European languages Multimed. Tools Appl 75 10823 10853
P. BellGales, MJF., T. HainJ. KilgourP. LanchantinX. LiuA. McParlandS. RenalsO. SazM. WesterPC. Woodland 2015 The MGB challenge: Evaluating multi-genre broadcast media recognition In Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU2015) Scottsdale, Arizona, USA, 687 693
R. CollobertC. PuhrschG. Synnaeve 2016 Wav2letter: an end-to-end convnet-based speech recognition system CoRR, Vol. abs/1609.03193
J. ConnollyK. CurranP. McKevittJ. MacraeS. Craig 2014 Broadcast Language Identification System (BLIS) In: Proc. of the 16th Irish Machine Vision and Image Processing Conference (IMVIP-14) Ulster University, UK
G. DahlD. YuL. DengA. Acero 2011 Large vocabulary continuous speech recognition with context-dependent DBN-HMMs IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Prague, Czech Republic 4688 4691
G. DahlD. YuL. DengA. Acero 2012 Context Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition 20 1 30 42
G. DahlT. SainathG. Hinton 2013 Improving DNNs for LVCSR using rectified linear units and dropout IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Vancouver ca8609 8613
L. Fryer 2018 An introduction to Audio Description: a practical guide New York, USA Routledge
G. HintonS. OsinderoY. The 2006 A fast learning algorithm for deep belief nets Neural Computation 18 1527 1554
G. HintonR. Salakhutdinov 2006 Reducing the dimensionality of data with neural networks, Science 313 5786 504 507
G. HintonL. DengD. YuG. DahlA. MohamedN. JaitlyA. SeniorV. VanhouckeP. NguyenT. N. SainathB. Kingsbury 2012 Deep Neural Networks for Acoustic Modelling in Speech Recognition IEEE Signal Processing Magazine 29 6 82 97
N. JaitlyP. NguyenA. W. SeniorV. Vanhoucke 2012 Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition The 13th International Speech Communication Association, in Proc Interspeech, New York, USA 2578 2581
N. Jaitly 2017 Lecture 12: End-to-End Models for Speech Processing Stanford University School of Engineering http://www.youtube.com/watch?v=3MjIkWxXigM&app=desktop
Kaldi 2018 http://kaldi-asr.org/doc/index.html
S. KarpagavalliE. Chandra 2016 A Review on Automatic Speech Recognition Architecture and Approaches, International Journal of Signal Processing Image Processing and Pattern Recognition 9 4 393 404
I. KaurN. KaurA. UmmatJ. KaurK. Navjot 2016 Automatic Speech Recognition: A Review InternatIonal Journal of Computer ScIence and Technology (IJCST) 7 4 Oct.-Dec
D. LiG. HintonB. Kingsbury 2013 New Types of Deep Neural Network Learning for Speech Recognition and Related Applications: An Overview IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada 8599 8603
D. LiJ. LiJ. HuangK. YaoD. YuF. SeideM. SeltzerG. ZweigX. HeJ. WilliamsY. GongA. Acero 2013 Recent Advances in Deep Learning for Speech Research at Microsoft IEEE International Conference on Acoustics, Speech and Signal Processing B.C. Vancouver Canada, 8604 8608
V. LiptchinskyG. SynnaeveR. Collobert 2017 Letter-Based Speech Recognition with Gated ConvNets CoRR Vol. abs/1712.09444
H. Maxwell 2018 Can we talk about the other 7%? http://www.redbeemedia.com/blog/can-talk-7/ Red Bee Media Blog
A. MohamedG. DahlG. Hinton 2009 Deep belief networks for phone recognition In Proc. NIPS Workshop on Deep Learning for Speech Recognition and Related Applications B. C. Vancouver Canada 1 9
A. MohamedG. DahlG. Hinton 2012 Acoustic modeling using deep belief networks IEEE Trans. on Audio, Speech, and Language Processing 20 1 14 22
Ofcom (2013 Measuring the quality of live subtitling http://www.ofcom.org.uk/__data/assets/pdf_file/0017/51731/qos-statement.pdf
V. PanayotovG. ChenD. PoveyS. Khudanpur 2015 Librispeech: an ASR corpus based on public domain audio books In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Queensland, Australia 5206 5210
J. PanC. LiuZ. WangY. HuH. Jiang 2012 Investigation of Deep Neural Networks (DNN) for Large Vocabulary Continuous Speech Recognition: Why DNN Surpasses GMMS in Acoustic Modelling In Proc. of 8th International Symposium on Chinese Spoken Language Processing (ISCSLP’2012) Hong Kong 301 305
D. PoveyX. ZhangS. Khudanpur 2015 Parallel Training of Deep Neural Networks with Natural Gradient and Parameter Averaging In Proc. of 3rd International Conference on Learning Representations (ICLR2015) San DiegoUSA
P. Romero-Fresco 2014 Subtitling through speech recognition: respeaking Manchester, UK St. Jerome Publishing
T. SainathA. MohamedB. KingsburyB. Ramabhadran 2013 Deep convolutional neural networks for lvcsr In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing Vancouver, BC, Canada 8614 8618
F. SeideG. LiD. Yu 2011 Conversational Speech Transcription Using Context-Dependent Deep Neural Networks In Proc. Interspeech, Florence, Italy 444 447
T. SercuV. Goel 2016 Advances in Very Deep Convolutional Neural Networks for LVCSR Multimodal Algorithms and Engines Group, IBM J. Watson Research Center USA
W. SongJ. Cai 2015 End-to-End Deep Neural Network for Automatic Speech Recognition Technical Report, Department of Computer Science Stanford University
G. A. Stevenson 2016 Aalysis of Pre-Trained Deep Neural Networks for Large-Vocabulary Automatic Speech Recognition LLNL-TH-698797 July 28, Lawrence Livermore National Laboratory
P. C. WoodlandX. LiuY. QianC. ZhangM. GalesP. KaranasouP. LanchantinL. Wang 2015 Cambridge University Transcription Systems for the Multi-Genre Broadcast Challenge, Automatic Speech Recognition and Understanding (ASRU) IEEE Automatic Speech Recognition and Understanding Workshop Scottsdale, Arizona, USA 639 646
C. ZhangP. C. Woodland 2015 A general artificial neural network extension for HTK In Proc. Interspeech Dresden, Germany 3581 3585
Y. ZhangM. Pezeshki 2016 Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks In Proc. Interspeech San Francisco, USA 410 414

Comments

Comment on this article

[1] O. Abdel-HamidL. DengD. Yu 2013 Exploring convolutional neural network structures and optimization for speech recognition Interspeech, ISCA Lyon, France 3366 3370

[2] A. AlvarezC. MendesM. RaffaelliT. LuısS. PauloN. PiccininiH. ArzelusJ. NetoC. del AliprandiA. Pozo 2016 Automating live and batch subtitling of multimedia contents for several European languages Multimed. Tools Appl 75 10823 10853

[3] P. BellGales, MJF., T. HainJ. KilgourP. LanchantinX. LiuA. McParlandS. RenalsO. SazM. WesterPC. Woodland 2015 The MGB challenge: Evaluating multi-genre broadcast media recognition In Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU2015) Scottsdale, Arizona, USA, 687 693

[4] R. CollobertC. PuhrschG. Synnaeve 2016 Wav2letter: an end-to-end convnet-based speech recognition system CoRR, Vol. abs/1609.03193

[5] J. ConnollyK. CurranP. McKevittJ. MacraeS. Craig 2014 Broadcast Language Identification System (BLIS) In: Proc. of the 16th Irish Machine Vision and Image Processing Conference (IMVIP-14) Ulster University, UK

[6] G. DahlD. YuL. DengA. Acero 2011 Large vocabulary continuous speech recognition with context-dependent DBN-HMMs IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Prague, Czech Republic 4688 4691

[7] G. DahlD. YuL. DengA. Acero 2012 Context Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition 20 1 30 42

[8] G. DahlT. SainathG. Hinton 2013 Improving DNNs for LVCSR using rectified linear units and dropout IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Vancouver ca8609 8613

[9] L. Fryer 2018 An introduction to Audio Description: a practical guide New York, USA Routledge

[10] G. HintonS. OsinderoY. The 2006 A fast learning algorithm for deep belief nets Neural Computation 18 1527 1554

[11] G. HintonR. Salakhutdinov 2006 Reducing the dimensionality of data with neural networks, Science 313 5786 504 507

[12] G. HintonL. DengD. YuG. DahlA. MohamedN. JaitlyA. SeniorV. VanhouckeP. NguyenT. N. SainathB. Kingsbury 2012 Deep Neural Networks for Acoustic Modelling in Speech Recognition IEEE Signal Processing Magazine 29 6 82 97

[13] N. JaitlyP. NguyenA. W. SeniorV. Vanhoucke 2012 Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition The 13th International Speech Communication Association, in Proc Interspeech, New York, USA 2578 2581

[14] N. Jaitly 2017 Lecture 12: End-to-End Models for Speech Processing Stanford University School of Engineering http://www.youtube.com/watch?v=3MjIkWxXigM&app=desktop

[15] Kaldi 2018 http://kaldi-asr.org/doc/index.html

[16] S. KarpagavalliE. Chandra 2016 A Review on Automatic Speech Recognition Architecture and Approaches, International Journal of Signal Processing Image Processing and Pattern Recognition 9 4 393 404

[17] I. KaurN. KaurA. UmmatJ. KaurK. Navjot 2016 Automatic Speech Recognition: A Review InternatIonal Journal of Computer ScIence and Technology (IJCST) 7 4 Oct.-Dec

[18] D. LiG. HintonB. Kingsbury 2013 New Types of Deep Neural Network Learning for Speech Recognition and Related Applications: An Overview IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada 8599 8603

[19] D. LiJ. LiJ. HuangK. YaoD. YuF. SeideM. SeltzerG. ZweigX. HeJ. WilliamsY. GongA. Acero 2013 Recent Advances in Deep Learning for Speech Research at Microsoft IEEE International Conference on Acoustics, Speech and Signal Processing B.C. Vancouver Canada, 8604 8608

[20] V. LiptchinskyG. SynnaeveR. Collobert 2017 Letter-Based Speech Recognition with Gated ConvNets CoRR Vol. abs/1712.09444

[21] H. Maxwell 2018 Can we talk about the other 7%? http://www.redbeemedia.com/blog/can-talk-7/ Red Bee Media Blog

[22] A. MohamedG. DahlG. Hinton 2009 Deep belief networks for phone recognition In Proc. NIPS Workshop on Deep Learning for Speech Recognition and Related Applications B. C. Vancouver Canada 1 9

[23] A. MohamedG. DahlG. Hinton 2012 Acoustic modeling using deep belief networks IEEE Trans. on Audio, Speech, and Language Processing 20 1 14 22

[24] Ofcom (2013 Measuring the quality of live subtitling http://www.ofcom.org.uk/__data/assets/pdf_file/0017/51731/qos-statement.pdf

[25] V. PanayotovG. ChenD. PoveyS. Khudanpur 2015 Librispeech: an ASR corpus based on public domain audio books In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Queensland, Australia 5206 5210

[26] J. PanC. LiuZ. WangY. HuH. Jiang 2012 Investigation of Deep Neural Networks (DNN) for Large Vocabulary Continuous Speech Recognition: Why DNN Surpasses GMMS in Acoustic Modelling In Proc. of 8th International Symposium on Chinese Spoken Language Processing (ISCSLP’2012) Hong Kong 301 305

[27] D. PoveyX. ZhangS. Khudanpur 2015 Parallel Training of Deep Neural Networks with Natural Gradient and Parameter Averaging In Proc. of 3rd International Conference on Learning Representations (ICLR2015) San DiegoUSA

[28] P. Romero-Fresco 2014 Subtitling through speech recognition: respeaking Manchester, UK St. Jerome Publishing

[29] T. SainathA. MohamedB. KingsburyB. Ramabhadran 2013 Deep convolutional neural networks for lvcsr In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing Vancouver, BC, Canada 8614 8618

[30] F. SeideG. LiD. Yu 2011 Conversational Speech Transcription Using Context-Dependent Deep Neural Networks In Proc. Interspeech, Florence, Italy 444 447

[31] T. SercuV. Goel 2016 Advances in Very Deep Convolutional Neural Networks for LVCSR Multimodal Algorithms and Engines Group, IBM J. Watson Research Center USA

[32] W. SongJ. Cai 2015 End-to-End Deep Neural Network for Automatic Speech Recognition Technical Report, Department of Computer Science Stanford University

[33] G. A. Stevenson 2016 Aalysis of Pre-Trained Deep Neural Networks for Large-Vocabulary Automatic Speech Recognition LLNL-TH-698797 July 28, Lawrence Livermore National Laboratory

[34] P. C. WoodlandX. LiuY. QianC. ZhangM. GalesP. KaranasouP. LanchantinL. Wang 2015 Cambridge University Transcription Systems for the Multi-Genre Broadcast Challenge, Automatic Speech Recognition and Understanding (ASRU) IEEE Automatic Speech Recognition and Understanding Workshop Scottsdale, Arizona, USA 639 646

[35] C. ZhangP. C. Woodland 2015 A general artificial neural network extension for HTK In Proc. Interspeech Dresden, Germany 3581 3585

[36] Y. ZhangM. Pezeshki 2016 Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks In Proc. Interspeech San Francisco, USA 410 414

Celebrating 65 years of The Computer Journal - free-to-read perspectives - bcs.org/tcj65