Captioning is making the world a more inclusive place

Matthew Johnston

Published: January 27, 2020

I only hear what I see

Last year, I became the first deaf person to serve on an English jury. When I received my jury summons, I never imagined I’d actually serve on a jury, let alone be selected as foreman. But after meeting with court officials, I convinced them that I could perform my civic duty through the use of subtitles. What they didn’t realise was that others in the courtroom would benefit from the subtitles, too.

Making public spaces more inclusive

I served as a juror over three trials, during which all courtroom dialogue was captured by stenographers (also known as ‘human captioners’). The subtitles appeared on both a laptop and on my iPad, which I held close at hand. Although I was the only deaf person on the jury, I noticed the other jurors — as well as spectators — reading the words on the screen. It didn’t matter if people couldn’t hear the occasional remark; With subtitles, everyone could follow the conversation.

There is something powerful about seeing the words that you hear. It grounds them in your mind, giving them a visual solidity that builds confidence in the communicator’s ability to express and in your ability to understand.

Subtitles are shown only sparingly in public but would make many public spaces more inclusive. Imagine if impromptu announcements on the London Tube appeared as text inside each carriage. Or if you could follow the cricket commentary via subtitles in a noisy stadium? In theatres, on museum tours and at conferences and festivals, too.

Subtitles aren’t just for the deaf

A survey from Verizon Media and Publicis Media showed that 80% of viewers who watch videos with captions are not deaf or hard of hearing. In addition, it’s been found that 80% more people watch a video to completion when captions are included.

Captions are also useful in online learning. Video captions have been shown to improve literacy and reading comprehension, as well as enhance the viewing experience for those watching a programme in a non-native language. There are almost 10 million people living in the UK, for instance, whose native language isn’t English.

Human-captioning tech

In my previous post, I mentioned I’m profoundly deaf since birth and use subtitles in real time for my meetings and conference calls. As you can imagine, subtitles need to be very accurate so that I can follow conversations and participate in timely decision-making and planning.

Human captioners have provided me with this service for many years. Using a special stenography (or ‘Palantype’) machine, stenographers can write up to 300 words per minute. The keyboard layout is different from a QWERTY setup, using a phonetic keyboard with words broken up into different strokes and short forms. Each stroke or combination of strokes is translated against a dictionary, which each stenographer has written uniquely for them. The text is instantly translated and appears live on a laptop or on a big screen as subtitles. I use MyClearText but there are others including Ai-Media and White Coat Captioning which provide such a service.

Stenographers have ‘human ears’, which help to determine the main speaker when several people are talking at once so that the correct conversations appear on the screen as subtitles. Human ears are also able to pick up quiet conversations which auto-ears may struggle with. Stenographers are also familiar with the nuances of accents so that subtitles make sense grammatically, enabling a pleasant reading experience.

The technology used by stenographers has been around for many years but good stenographers are a rare commodity.

Auto-captioning tech

Stenography isn’t the only option: increasingly, other tech-based solutions are coming to the fore. Speech-to-text technology uses voice recognition to provide auto-captioning. It’s been around for some time, during which time its accuracy has improved. In the early days, you could expect lots of grammatical and spelling mistakes, as well as missing words and sentences.

Tech companies including Google and Microsoft have invested heavily in this technology, developing their own Speech APIs. Using these Speech APIs, speech-to-text apps such as Google’s Live Transcribe and Meet and Microsoft’s Office 365 and Teams are available on desktops and mobile devices. I use these tools to follow subtitles in conference calls and presentations.

There are other similar applications available including Nuance’s Dragon Speech Recognition, Otter.a’, Amazon’s Transcribe and Speechmatics.

Having auto-captioning in real time means I don’t have to worry about booking the stenographer. I can also spontaneously attend unscheduled meetings, giving me much more independence.

But my success with auto-captioning has been mixed. It struggles to capture words spoken with a strong accent or with a speech impairment. Speech impairment includes stroke, dysphagia, cerebral palsy, autism and stammer. (There are tech companies focusing on speech impairment including Voiceitt and Google’s Euphonia.)

Auto-captioning only works well if one person speaks clearly and together with a high-quality microphone (I have Jabra Speak 710). It also struggles when there are several speakers talking after one another since the speech-to-text often pauses for several seconds when there is a new speaker. During the pause, I miss out and am unable to participate proactively. I must then play catch up on the subsequent conversations, leaving me further behind.

I can control this if I facilitate the meeting and dictate the pace of the conversation (it’s not always easy!), asking the current speaker to wait while the speech-to-text is ‘pausing’.

Bringing human- and auto-captioning tech together

This time last year, I couldn’t rely on auto-captioning for my meetings. That changed in the second half of 2019 and I now use it for internal meetings with colleagues, who understand how auto-captioning behaves — and they typically don’t mind when I dictate the speed of our conversations. When I’m attending public events or working with others or there is low latency, however, I continue to use human-captioning to provide consistency.

In some scenarios, it helps to bring human- and auto-captioning together. When I served on a jury, public dialogue in the courtroom was captured via stenography, which auto-captioning would struggle to do. However, auto-captioning could be used during deliberations where stenographers aren’t allowed. Using a high-quality microphone to capture the comments of the fellow jurors around the table would work well. More importantly, conversations don’t have to be stored, which would meet legal obligations.

Embedded auto-captioning

Later last year, speech-to-text made a significant advance. Google’s Pixel 4 mobile phone came out with a special feature called ‘Live Caption’ (which has now rolled out to other Pixel models). Captions come on when there is audio from streamed content including videos, podcasts, live streaming and radio. Because the feature is embedded in the phone, the sound does not need to be turned on, enabling the content to be read in silence. In addition, the embedded nature of the feature means that caption quality isn’t disrupted by background noise. I bought a Pixel 4 and can now follow almost everything that is being said online. For the first time in my life.

Auto-captioning can be truly empowering — but the technology isn't yet perfect

In the hopefully not-too-distant future, the Live Caption feature will support phone calls. This will have a significant impact on the deaf community, giving us greater autonomy. I’ve never been able to have spontaneous phone conversations and this is about to change. I really can’t wait. The only thing more exciting would be the release of smart glasses that auto-caption everything people say… that would be good to see.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Industries

Publications and Tools

All Insights