Access Restriction

Author Liao, Hank ♦ Mcdermott, Erik
Source CiteSeerX
Content type Text
File Format PDF
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Audio Indexing ♦ Youtube Video ♦ Visited Video ♦ Article De-scribes Recent Improvement ♦ Useful Training Segment ♦ Word Error Rate ♦ Automatic Gener-ation ♦ Indexing Purpose ♦ Deep Neural Network Acoustic Model ♦ Model Size ♦ Original System ♦ Excellent Application ♦ Additional Semi-supervised Training Data ♦ Automatic Speech Recognition ♦ Gaussian Mixture Model ♦ Owner-uploaded Video Transcript ♦ Vocabulary Speech Recognition ♦ Closed Caption ♦ Im-proving Accessibility ♦ Context Dependent State ♦ Deep Learning ♦ Index Term ♦ English Speech ♦ Different Language ♦ Acoustic Model ♦ Large State Inventory ♦ Deep Neu-ral Network ♦ Dnn Result ♦ Semi-supervised Training Data ♦ Automatic Speech Recognition System ♦ Youtube Video Transcription
Description YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Im-proving accessibility to these videos for the hard of hearing and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely chal-lenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009 YouTube has provided automatic gener-ation of closed captions for videos detected to have English speech; the service now supports ten different languages. This article de-scribes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories. Applying an “island of confidence ” fil-tering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low-rank final layer weight matrix approximation, improved performance by about 13 % relative compared to previously reported sequence trained DNN results for this task. Index Terms — Large vocabulary speech recognition, deep neu-ral networks, deep learning, audio indexing. 1.
Educational Role Student ♦ Teacher
Age Range above 22 year
Educational Use Research
Education Level UG and PG ♦ Career/Technical Study
Learning Resource Type Article
Publisher Date 2013-01-01
Publisher Institution in Workshop on Automatic Speech Recognition and Understanding (ASRU