996
views
0
recommends
+1 Recommend
1 collections
    4
    shares

      Celebrating 65 years of The Computer Journal - free-to-read perspectives - bcs.org/tcj65

      scite_
       
      • Record: found
      • Abstract: found
      • Conference Proceedings: found
      Is Open Access

      Broadcast Language Identification & Subtitling System (BLISS)

      proceedings-article
      1 , 2 , 3 , 2 , 4
      Proceedings of the 32nd International BCS Human Computer Interaction Conference (HCI)
      Human Computer Interaction Conference
      4 - 6 July 2018
      Automatic Speech Recognition (ASR), Accent, Automated Subtitling, Background Noise, BLISS, Human-Computer Interaction, Kaldi, LibriSpeech
      Bookmark

            Abstract

            Accessibility is an important area of Human Computer Interaction (HCI) and regulations within many countries mandate that broadcast media content be accessible to all. Currently, most subtitles for offline and live broadcasts are produced by people. However, subtitling methods employing re-speaking with Automatic Speech Recognition (ASR) technology are increasingly replacing manual methods. We discuss here the subtitling component of BLISS (Broadcast Language Identification & Subtitling System), an ASR system for automated subtitling and broadcast monitoring built using the Kaldi ASR Toolkit. The BLISS Gaussian Mixture Model (GMM)/Hidden Markov Model (HMM) acoustic model has been trained with ~960 hours of read speech, and language model with ~900k words combined with a pronunciation dictionary of 200k words from the LibriSpeech corpus. In tests with ~5 hours of unseen clean speech test data with little background noise and seen accents BLISS gives recognition accuracy of 91.87% based on the WER (Word Error Rate) metric. For ~5 hours of unseen challenge speech test data, with higher-WER speakers, BLISS’s accuracy reduces to 75.91%. A BLISS Deep Learning Neural Network (DNN) acoustic model has also been trained with ~100 hours of read speech data. It’s accuracy for ~5 hours of unseen clean and unseen challenge speech test data is 92.88% and 77.27% respectively based on WER. Future work includes training the DNN model with ~960 hours of read speech data using CUDA GPUs and also incorporating algorithms for background noise reduction. The BLISS core engine is also intended as a Language Identification system for broadcast monitoring (BLIS). This paper focuses on its Subtitling application (BLSS).

            Content

            Author and article information

            Contributors
            Conference
            July 2018
            July 2018
            : 1-6
            Affiliations
            [ 1 ] Faculty of Computing, Engineering & Built Environment, Ulster University (Magee), Derry/Londonderry, BT48 7JL, Northern Ireland
            [ 2 ] Faculty of Computing, Engineering & Built Environment, Ulster University (Jordanstown), Newtownabbey, BT37 0QB
            [ 3 ] Department of Computing, Letterkenny Institute of Technology (LYIT), Port Road, Letterkenny, IRL- F92 FC93, Co. Donegal, Ireland
            [ 4 ] Faculty of Arts, Humanities & Social Sciences, Ulster University (Magee), Derry/Londonderry, BT48 7JL, Northern Ireland
            Article
            10.14236/ewic/HCI2018.150
            649e0789-3917-4243-b134-11fc47c5b303
            © Wang et al. Published by BCS Learning and Development Ltd. Proceedings of British HCI 2018. Belfast, UK.

            This work is licensed under a Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

            Proceedings of the 32nd International BCS Human Computer Interaction Conference
            HCI
            32
            Belfast, UK
            4 - 6 July 2018
            Electronic Workshops in Computing (eWiC)
            Human Computer Interaction Conference
            History
            Product

            1477-9358 BCS Learning & Development

            Self URI (article page): https://www.scienceopen.com/hosted-document?doi=10.14236/ewic/HCI2018.150
            Self URI (journal page): https://ewic.bcs.org/
            Categories
            Electronic Workshops in Computing

            Applied computer science,Computer science,Security & Cryptology,Graphics & Multimedia design,General computer science,Human-computer-interaction
            Automated Subtitling,Automatic Speech Recognition (ASR),Accent,Background Noise,BLISS,Human-Computer Interaction,Kaldi,LibriSpeech

            References

            1. 2013 Exploring convolutional neural network structures and optimization for speech recognition Interspeech, ISCA Lyon, France 3366 3370

            2. 2016 Automating live and batch subtitling of multimedia contents for several European languages Multimed. Tools Appl 75 10823 10853

            3. Gales, MJF., 2015 The MGB challenge: Evaluating multi-genre broadcast media recognition In Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU2015) Scottsdale, Arizona, USA, 687 693

            4. 2016 Wav2letter: an end-to-end convnet-based speech recognition system CoRR, Vol. abs/1609.03193

            5. 2014 Broadcast Language Identification System (BLIS) In: Proc. of the 16th Irish Machine Vision and Image Processing Conference (IMVIP-14) Ulster University, UK

            6. 2011 Large vocabulary continuous speech recognition with context-dependent DBN-HMMs IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Prague, Czech Republic 4688 4691

            7. 2012 Context Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition 20 1 30 42

            8. 2013 Improving DNNs for LVCSR using rectified linear units and dropout IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Vancouver ca8609 8613

            9. 2018 An introduction to Audio Description: a practical guide New York, USA Routledge

            10. 2006 A fast learning algorithm for deep belief nets Neural Computation 18 1527 1554

            11. 2006 Reducing the dimensionality of data with neural networks, Science 313 5786 504 507

            12. 2012 Deep Neural Networks for Acoustic Modelling in Speech Recognition IEEE Signal Processing Magazine 29 6 82 97

            13. 2012 Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition The 13th International Speech Communication Association, in Proc Interspeech, New York, USA 2578 2581

            14. 2017 Lecture 12: End-to-End Models for Speech Processing Stanford University School of Engineering http://www.youtube.com/watch?v=3MjIkWxXigM&app=desktop

            15. Kaldi 2018 http://kaldi-asr.org/doc/index.html

            16. 2016 A Review on Automatic Speech Recognition Architecture and Approaches, International Journal of Signal Processing Image Processing and Pattern Recognition 9 4 393 404

            17. 2016 Automatic Speech Recognition: A Review InternatIonal Journal of Computer ScIence and Technology (IJCST) 7 4 Oct.-Dec

            18. 2013 New Types of Deep Neural Network Learning for Speech Recognition and Related Applications: An Overview IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada 8599 8603

            19. 2013 Recent Advances in Deep Learning for Speech Research at Microsoft IEEE International Conference on Acoustics, Speech and Signal Processing B.C. Vancouver Canada, 8604 8608

            20. 2017 Letter-Based Speech Recognition with Gated ConvNets CoRR Vol. abs/1712.09444

            21. 2018 Can we talk about the other 7%? http://www.redbeemedia.com/blog/can-talk-7/ Red Bee Media Blog

            22. 2009 Deep belief networks for phone recognition In Proc. NIPS Workshop on Deep Learning for Speech Recognition and Related Applications B. C. Vancouver Canada 1 9

            23. 2012 Acoustic modeling using deep belief networks IEEE Trans. on Audio, Speech, and Language Processing 20 1 14 22

            24. Ofcom (2013 Measuring the quality of live subtitling http://www.ofcom.org.uk/__data/assets/pdf_file/0017/51731/qos-statement.pdf

            25. 2015 Librispeech: an ASR corpus based on public domain audio books In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Queensland, Australia 5206 5210

            26. 2012 Investigation of Deep Neural Networks (DNN) for Large Vocabulary Continuous Speech Recognition: Why DNN Surpasses GMMS in Acoustic Modelling In Proc. of 8th International Symposium on Chinese Spoken Language Processing (ISCSLP’2012) Hong Kong 301 305

            27. 2015 Parallel Training of Deep Neural Networks with Natural Gradient and Parameter Averaging In Proc. of 3rd International Conference on Learning Representations (ICLR2015) San DiegoUSA

            28. 2014 Subtitling through speech recognition: respeaking Manchester, UK St. Jerome Publishing

            29. 2013 Deep convolutional neural networks for lvcsr In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing Vancouver, BC, Canada 8614 8618

            30. 2011 Conversational Speech Transcription Using Context-Dependent Deep Neural Networks In Proc. Interspeech, Florence, Italy 444 447

            31. 2016 Advances in Very Deep Convolutional Neural Networks for LVCSR Multimodal Algorithms and Engines Group, IBM J. Watson Research Center USA

            32. 2015 End-to-End Deep Neural Network for Automatic Speech Recognition Technical Report, Department of Computer Science Stanford University

            33. 2016 Aalysis of Pre-Trained Deep Neural Networks for Large-Vocabulary Automatic Speech Recognition LLNL-TH-698797 July 28, Lawrence Livermore National Laboratory

            34. 2015 Cambridge University Transcription Systems for the Multi-Genre Broadcast Challenge, Automatic Speech Recognition and Understanding (ASRU) IEEE Automatic Speech Recognition and Understanding Workshop Scottsdale, Arizona, USA 639 646

            35. 2015 A general artificial neural network extension for HTK In Proc. Interspeech Dresden, Germany 3581 3585

            36. 2016 Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks In Proc. Interspeech San Francisco, USA 410 414

            Comments

            Comment on this article