Before we answer the burning questions – why caption accuracy is so important and what is WER; we should first understand what closed captions are, or captions as people usually call them
Closed captions are the text versions of the spoken part of the television, movie, or computer presentation that can be turned on or off. Closed captions enhance the viewing experience by adding non-speech elements, too, and they are represented by the [CC] sign. You can read more information about closed captions here.
It is always desirable to have accurate captions like accessibility and federal laws. The industry benchmark for closed captions is captioning with 99% accuracy. Let us take you through some of the reasons why it is essential to provide accurate captions on all the videos –
National Institute on Deafness and Other Communication Disorders (NIDCD) states that approximately 15% of American adults (37.5 million) aged 18 and over report some trouble in hearing, and about 28.8 million U.S. adults could benefit from using hearing aids. With a growing population of deaf and hard-of-hearing individuals, adding closed captions to the videos makes them accessible. No captions or wrong captions will make the content inaccessible, resulting in a significant loss of viewers as people are unable to follow the video.
Legal Requirement –
Title II of the 21st Century Communications and Video Accessibility Act (CVAA) requires any program with closed captions broadcast on TV must be closed captioned before distribution on the internet. This, however, does not apply to programs shown only on the internet.
The FCC lays down the rules to maintain the quality of the captions generated. As per FCC rules, captions must be –
- Accurate: Captions must match the spoken words in the dialogue and fully convey background noises and other sounds possible.
- Synchronous: Captions must coincide with their corresponding spoken words and sounds to the greatest extent possible and must be displayed on the screen at a speed that viewers can read.
- Complete: Captions must fully run from the beginning to the end of the program possible.
- Correctly placed: Captions should not block other important visual content on the screen, overlap one another, or run off the edge of the video screen.
Read more about the captioning laws and guidelines here.
Boost SEO –
Do you know that captions can improve the search engine rankings of your video? In the same way search engines scan a webpage for keywords and phrases to match what the user is looking for; they examine the video captions. Hence, a video with closed captions will have a better ranking than one without it. Search engines will rely on video descriptions and metadata without closed captions or video transcriptions. While it’s good, it’s no match for videos with closed captions.
Improved User Experience –
Imagine watching a video while commuting in the subway or uber; you’d want to keep yourself aware of the surrounding sounds while watching the video. It sounds impossible otherwise but not with closed captions. Closed captions allow people to watch videos in sound-sensitive environments. This results in the following direct gains for the broadcasters – Closed captions increase the average watch time and ensure users stay engaged with the content as captions provide context to the viewer. Captions are proven to be one of the most prominent factors when users decide to buy.
WER to measure captioning accuracy –
Word Error Rate (WER) is a measure to calculate the captioning accuracy. WER is used when Automatic Speech Recognition (ASR) technology generates captions. ASR technology uses software to identify and process the spoken language to create captions. ASR technology is not just limited to captions, but it is a massive part of our daily life. Between Alexa and Siri, most of our daily tasks, from checking on the weather to making calls to switching on our favorite show, ASR technology has made life easy for us.
However, when it comes to closed captioning, using just Automated Speech Recognition technology won’t suffice. Close captions generated using ASR technology are low on accuracy because some words get left out or wrongly translated most of the time, and WER determines all these inaccuracies.
In layman’s terms, Word Error Rate is the number of errors divided by the total number of words. There are three categories of mistakes –
Substitution (S) – This error occurs when a word gets replaced. For example, “pull” is transcribed as “bull.”
Insertion (I) – This error occurs when a transcribed word doesn’t match the spoken word. For example, “IELTS” becomes “eye else.”
Deletion (D) – This error occurs when a word is deleted or left out of the transcription. For example, “Are we doing this?” becomes “We are doing this?”
Thus, Word Error Rate would be calculated as –
WER = S+I+D/N (N = number of words)
Common Causes of Word Error Rate –
- Multiple speakers
- Overlapping speech
- Background noise
- Poor audio quality
- False starts
- Acoustic errors
Word Error Rate (WER) is the most common metric for the accuracy of Automatic Speech Recognition (ASR) technology, which is used to power Alexa and Siri. Creators and developers of ASR technology depend on WER when measuring the improvement of their software. Users of ASR consider WER when choosing a product that fits their business.
While WER is one of the most important tools for measuring captioning accuracy, one should not completely depend on it. Many factors influence the WER score, and some of them are –
- Recording quality
- Microphone quality
- Pronunciation of words
- Unique names of people and places or other proper nouns.
- Technical terminology
Digital Nirvana combines the state-of-the-art speech-to-text that leverages machine learning and experienced captioners to help deliver accurate and compliant captions to our customers. To know more about generating closed captions with industry-standard accuracy, write to us at firstname.lastname@example.org.