Not listening to talkback radio: developing a speech analytics pipeline
Two main ways we get information in the modern world are written and spoken language. Written language crosses blogs, emails, newspapers and twitter, whereas spoken language includes podcasts, recorded lectures, and talkback radio.
A main issue with passing information through audio to an individual is that it’s difficult to extract information without listening to the entire piece — and that’s if you can listen to the entire piece. Even if it’s accessible to you to listen to an hour-long recorded lecture or podcast, and even if you listen to those at 1.5x speed or more (you monster), that’s still a significant time investment.
If your intake of audio information is leisure time for you, that investment is fine. But when extracting information from audio is part of work or study, that time investment can become frustrating and potentially unnecessary. Consider also the comparative ease of reading back over a section, rather than rewinding through audio to listen again.
We wanted to see if we could extract information and themes from speech in real time. We used livestreams of two Melbourne-based talkback radio stations, 3LO and 3AW to test this.
How does it work?
While the meat of each step can be provided by existing systems, assembling them in a useful formation requires more thought. A significant point of focus was defining interfaces to link each step in the pipeline. The speech-to-text piece doesn’t care where the audio came from or what happens to the text it outputs. Likewise the text analysis needn’t know about how the text was transcribed, or how the sentiment is stored or visualised. This enables us to update and compare different systems easily.
Ensuring that the system is able to provide real-time insights was another major design element. There is no guarantee that with all the systems involved, a one minute piece of audio will take less than a minute to process — the incoming audio would queue up forever! Breaking the audio into discrete blocks at the start of the pipeline means we are able to scale the underlying infrastructure and continue to process the audio in real-time, regardless of how long any individual piece takes to process.
Why is this useful?
Making a machine listen to talkback radio for you is either incredibly useful or planting the seeds of the robot uprising, depending on who you talk to. But the value of building this wasn’t to determine scientifically that these stations both mention “Australia” a lot. Rather, the value of this pipeline exists in the interfaces. Instead of having to perform each step separately, we’ve created a straightforward “plug and play” pipeline that can take in continuous audio and provide a real-time summary and findings.
There are a range of applications for this data flow, including:
- Tracking of subject mentions across radio & television — whether this is your brand, organisation, or area of interest. It’s also straightforward to add in alerts for when words of interest are mentioned.
- Decreasing the workload of anything that requires real-time monitoring, even when the format isn’t digital
- Automated summaries for content, especially live shows
- A live monitoring solution that inherently supports advanced analytics and data visualisation
Right now, we’re working on adding a way to compare the performance of different components of the pipeline in real time. If we could take an hour of live-streamed audio and compare theme and sentiment analysis from a selection of open-source and enterprise tools, we could then better understand which toolset is the best value for a specific use case.
Beyond that, developing a way to determine when words of interest are mentioned at the same time and with similar sentiment across different radio stations might provide a way to start identifying breaking news events — something that could be particularly valuable for local or special interest news items.
We’re also interested in seeing how this could improve accessibility, so comparing against the performance of existing live captioning systems would also be meaningful.
Other developments really depend on user needs: does your stream contain voice all of the time or only some of the time? What specific trigger words should generate alerts? What should the user interface support and enable?
This pipeline has the potential to save hours of time. The most meaningful next step will be putting it to work in a scenario where that benefit can be realised. If you think that’s you, please get in touch!