Broadcast Language Identification & Subtitling System ( BLISS )

Faculty of Computing, Engineering & Built Environment, 1 Ulster University (Jordanstown), Newtownabbey, BT37 0QB, 2 Ulster University (Magee), Derry/Londonderry, BT48 7JL, Northern Ireland {j.wang, kc.munoz-esquivel, kj.curran} @ulster.ac.uk 3 Department of Computing, Letterkenny Institute of Technology (LYIT), Port Road, Letterkenny, IRLF92 FC93, Co. Donegal, Ireland James.Connolly@lyit.ie 4 Faculty of Arts, Humanities & Social Sciences, Ulster University (Magee), Derry/Londonderry, BT48 7JL, Northern Ireland p.mckevitt@ulster.ac.uk


INTRODUCTION
An important aspect of Human Computer Interaction (HCI) is accessibility, involving production of text from speech (Subtitles/Captions) (Romero-Fresco, 2014) for those who cannot hear and audio from video (Audio Description) (Fryer, 2016) for those who cannot see.Manual production of time-aligned transcriptions of audio-visual content requires considerable effort.It is prone to manual typing errors, slow for real-time delivery and human subtitlers can be costly (Alvarez et al., 2016).For live subtitles, re-speaking techniques are combined with off-the-shelf Automatic Speech Recognition (ASR) engines to produce subtitles.With respeaking, the audio content is re-spoken by a professional speaker.This results in speech with reduced accents and noise which can be processed by ASR engines with an acceptable accuracy for live subtitling.However, re-speaking causes delays in real-time subtitling tasks and requires the re-speaker to dictate the audio content in a speaker independent manner.Improving the quality of subtitles for people with audio and visual impairments is an important focus for the UK's communications regulator, Ofcom (Ofcom, 2013).Broadcasters are required to measure and improve the quality of live broadcasts so that subtitles are synchronised with video and speech in addition to achieving high accuracy rates.In terms of broadcast monitoring, broadcasters must also ensure that correct language playout occurs in different geographic regions.
Advanced ASR technology can help solve the problems of subtitle delay during live broadcasts in addition to improving the accuracy obtained.We discuss here a platform called BLISS (Broadcast Language Identification & Subtitling System) for performing language identification and subtitling.The core BLISS technology is based on advanced ASR with the use of Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs), and Deep Neural Networks (DNNs) implemented within the Kaldi Toolkit platform (Kaldi, 2018).BLISS is tailored to the traditional and online broadcast media and video production industries.The BLISS core engine is also intended as a Language Identification system for broadcast monitoring (BLIS) (Connolly et al., 2014), a software application which identifies the language of the broadcast and alerts and even corrects if it is the wrong language for a particular region.In this paper we focus on its Subtitling application (BLSS).
Key problems BLISS addresses include: a) Latency and hence speed of transcription from spoken word to text output.b) Accuracy and compliance performance to mitigate reputation damage and financial penalties.c) Reduction of costs through automation of human-based manual transcription.
In this paper section 2 gives a brief review of recently developed ASR technology.In Section 3, the design, architecture and implementation of BLISS is discussed.Section 4 discusses experimental results from testing BLISS and the impact of new unseen accents and noise.Section 5 concludes and discusses future work.

BACKGROUND & RELATED WORK
ASR is concerned with developing technologies that enable the recognition and translation of spoken language into text by computerised systems.ASR is also referred to as automatic Speech-To-Text (STT).In recent years, ASR technology has made remarkable progress, but the design of ASR systems still needs to pay careful attention to problems such as accents and noise.
Most ASR systems rely on phoneme recognition and word decoding.Classification algorithms (e.g., GMMs) are used on highly specialised features such as Mel Frequency Cepstral Coefficients (MFCCs) or Perceptual Linear Predictive coefficients (PLPs) so that a distribution of possible phonemes for each frame can be obtained (Kaur et al., 2016;Karpagavalli & Chandra, 2016).A HMM with a pretrained language model is used to find the most likely sequence of phonemes that can be mapped to the output words during the decoding phase.HMMs are often utilised to handle the temporal variability of speech, and have been popular because they are flexible, versatile, and have a consistent statistical framework (Mohamed et al., 2009(Mohamed et al., , 2012;;Stevenson, 2016).
An alternative to GMMs evaluating performance is a feed-forward neural network, which takes several frames of coefficients as input and produces posterior probabilities over HMM states as output.

DNN-HMM hybrid systems
By taking advantage of DNN discriminative power, several successful hybrid DNN-HMM ASR systems have been developed for phoneme recognition (Hinton et al., 2012, Pan et al., 2012).DNNs have been used for acoustic model likelihood computation.Here DNNs outperformed traditional GMMs in predicting emission probabilities of HMM states representing phonemes in a hybrid model setup.DNNs have given promising results for large vocabulary continuous speech recognition (LVCSR) tasks, showing significant gains over GMM/HMM systems on a wide variety of small and large vocabulary tasks (Seide et al., 2011;Dahl et al., 2011Dahl et al., , 2012Dahl et al., , 2013;;Li et al., 2013;Jaitly et al., 2012;Sainath et al., 2013;Zhang & Woodland, 2015).
A trained DNN output is not the end result of an ASR system, but instead supplies a HMM with the best acoustic modelling information to predict the target HMM states.The advantage of DNNs over GMMs in ASR is their ability to predict many thousands of tied triphone HMM states.This creates a large number of HMM classes and also inherently adds to the amount of training data and time needed to initialise a DNN-HMM system.For the 2015 Multi-Genre Broadcast (MGB) challenge, Woodland et al. (2015) outline a speech to text model containing a segmentation system based on DNNs.The model uses HTK 3.5 for building the DNN-based hybrid and tandem acoustic model in a joint decoding framework.The final system had the lowest (23.7%)Word Error Rate (WER) metric error rate for speechto-text transcription on the MGB evaluation data (Bell et al., 2015).

CNN-HMM hybrid systems
Convolutional Neural Networks (CNNs) can be used to model correlation between spatial and temporal signals, and reduce spectral variance in acoustic features for ASR.Hybrid ASR systems incorporating CNNs with HMMs/GMMs have achieved promising results with various benchmarks (Abdel-Hamid et al., 2013;Sainath et al., 2013).CNNs are a more effective model for speech compared to extensively used fully-connected acoustic DNN models.The number of convolutional layers, the optimal number of hidden units along with the best pooling strategy, and the best input feature type for CNNs should all be considered.Comparing CNNs to DNNs and GMMs show that CNNs can have a 13-30% Figure 1: Architecture of BLISS improvement over GMMs, and a 4-12% improvement over DNNs, on a variety of LVCSR tasks such as the 400 hours Broadcast News and 300 hours Switchboard tasks (Sercu & Goel, 2016).

End-to-End systems
Hybrid systems have been developed for phoneme recognition and decoding by HMMs.Recurrent neural networks (RNNs) can handle recognition and decoding simultaneously.Connectionist Temporal Classification (CTC) with RNNs (Zhang & Pezeshki, 2016) for labelling unsegmented sequences makes it feasible to train an 'end-to-end' ASR system instead of using hybrid settings.However, RNNs are computationally expensive and sometimes difficult to train.Inspired by the advantages of both CNNs and the CTC approach, an end-to-end ASR model was developed for sequence labelling, by combining hierarchical CNNs with CTC directly without recurrent connections.In evaluating the approach in the TIMIT phoneme recognition task, this model is not only computationally efficient, but also competitive with existing baseline systems.Moreover, CNNs have the capability to model temporal correlations with appropriate context information.LVCSR systems perform differently in terms of accuracy depending on the ASR task.
Clean read speech data gives better results than broadcast speech data.There are several subtitling tools on the market that enable the conversion of audio into text, e.g., the SAVAS project automatic live and batch subtitling application for several European languages such as Basque & Portugese (Alvarez et al., 2016).However, we still do not have human-level ASR systems for the automated subtitling task (Maxwell, 2018).

BLISS DESIGN & IMPLEMENTATION
Here we discuss the Design and Implementation of BLISS in terms of requirements analysis, architecture, and implementation with the Kaldi Toolkit and LibriSpeech dataset for model training and testing.

Customer requirements analysis
We have conducted requirements analysis and customer conversations with 25 Vendors and End Users within the 2017 ICURe NI LIC Lean Launch programme hosted by the SETsquared Partnership, Ulster University and Queen's University Belfast.Benchmarking of subtitle quality requires at least 95% accuracy in terms of WER or NER (Number Edition Recognition) metrics (Alvarez et al., 2016), with some customers requiring 100%.Vendors and End Users quote subtitler charging costs of e.g.~£3.50/min., and sales pricing of e.g.~£550/hr.with differences depending on nature of media content.Most Vendors access an external bank of outsourced subtitlers with internal staff quality checking and packaging.Vendors are currently investigating solutions that use cutting-edge ASR technology to automatically transcribe speech and format it for subtitling and captioning purposes, e.g.Red Bee Media (Maxwell, 2018).Any ASR system that reduces the cost, time and penalties for subtitling (Ofcom, 2013) would be of huge benefit to the subtitling industry.

BLISS architecture design
Figure 1 illustrates the modules in BLISS.Raw speech is converted to sequences of feature vectors using classical signal processing methods, e.g.MFCCs.The data X is a sequence of frames of audio features x1, x2…xt.Y represents text sequences.
Using the language models, a sequence of words is produced.
In the pronunciation dictionary for each word there is a pronunciation model for how this word is spoken.The pronunciation model with associated probabilities is written as a sequence of phonemes (or pronunciation tokens) which are basic unit of The models are fed into an acoustic model to test token sounds.Acoustic models are typically built using three state left-to-right GMMs which output frames of data.
When the acoustic model is built, recognition can be performed by conducting inference on data received.For example, when some waveforms are received and their features (X) are extracted, the acoustic model deciphers the sequence of Y's that would cause this sequence of X with the highest probability.Traditionally, each of the components shown in Figure 1 were completed with traditional statistical methods (e.g., HMMs) but neural networks have recently proven superior.

Implementation of BLISS
Here we discuss the implementation of BLISS in terms of LibriSpeech datasets for training and testing, Language Model (LM) and Acoustic Model.

Training and testing datasets
BLISS was trained and tested on the LibriSpeech 1 dataset, and in addition a USA Alabama broadcast News dataset was used for testing.Table 1 gives details on the LibriSpeech datasets.For example, the train_100c dataset includes ~100 hours of clean speech data, 28,539 utterances with 990,101 words that are spoken by 125 female and 126 male speakers.Each speaker spoke for 25 minutes.Speakers in the LibriSpeech corpus were ranked according to transcript WER and were divided roughly in the middle.The lower-WER speakers were designated as, clean (c), and the higher-WER speakers were designated as, challenge (h).The test_c dataset includes ~5.4 hours of unseen clean test data, 2,620 utterances with 52,576 words that are spoken by 20 females and 20 males.Each speaker spoke for 8 minutes.
The test_h dataset includes ~5.1 hours of unseen challenge test data, 2,939 utterances with 52,343 words that are spoken by 17 females and 16 males.Each speaker spoke for 10 minutes.

BLISS Language Model (LM) 2
The full, non-pruned 3-gram LM was trained using the most frequent 200k word vocabulary from 14,500 public domain books in which about 803 million tokens and 900k unique words were selected (Panayotov et al., 2015).The pronunciation lexicon includes a 206,510 word pronunciation dictionary as some words have more than one pronunciation.

EXPERIMENTAL RESULTS
The BLISS tri5b model was tested on both the unseen clean (test_c) and challenge (test_h) test datasets listed in Table 1 and it gave WER results of 8.13% and 24.09% respectively.The USA Alabama News data was used to test background noise and accent effects, e.g.AN2088: "there was that survey which said that the united states wasn't going to use nukes to help south korea", with all words in the pronunciation dictionary except "nukes".
Figure 2 shows the performance of tri5b in terms of WER results without and with unseen accent, noise and music on the unseen USA Alabama News audio utterance AN2088 using the full, non

CONCLUSION & FUTURE WORK
In this paper we discussed the design and implementation of a software platform called BLISS (Broadcast Language Identification & Subtitling System) for performing subtitling (BLSS) and language identification (BLIS) within the broadcast entertainment industry, with a focus on subtitling.BLISS is based on customer requirements analysis conducted with more than 25 Vendors and End Users.The LibriSpeech USA English audio book read speech and USA Alabama broadcast News datasets were used to build and test the BLISS acoustic models using the Kaldi ASR Toolkit.BLISS gives promising WER metric results of 8.13% and 24.09% respectively on ~5 hours of unseen test clean (test_c) and unseen test challenge (test_h) audio data subsets from the LibriSpeech corpus.BLISS performance on USA Alabama broadcast News data with new unseen accents gives WER results of 52.63%.
Our experiments demonstrate that speech with new unseen accents, background noise and music degrade BLISS model performance, with music degrading more than white noise.The BLISS DNN model performs better than the BLISS GMM/HMM tri4b model giving WER results of 7.12% (over 9.74%) with test_c, and 22.73% (over 32.93%) with test_h.Future work includes developing further BLISS DNNs models and methods for noise removal.
then converted into sequences of text with corresponding pronunciation tokens.

Figure 2 :
Figure 2: Effect of unseen accents, white noise and background music on WER

Table 1 :
LibriSpeech dataset (Panayotov et al., 2015) Results show the BLISS DNN model achieves better performance than the BLISS tri4b model using the same volume of acoustic training data (100 hours) with the full, non-pruned 3-gram LM.WER for the BLISS DNN model decreases to 7.12% from that using the BLISS tri4b model (9.74%) for test_c clean audio data.WER for the DNN model decreases to 22.73% from that using BLISS tri4b model (32.93%) for test_h challenge audio data.