Audio AI is hacking your brain

In previous posts, I have explored the behavioral dynamics of text-centric chatbots. Yet AI is also changing the way people relate to audio. The impact of audio AI on user experience is, in many ways, even more profound than textual AI.

Listening to audio is more passive than either reading text or watching videos. It can be less obvious how audio content influences user behavior, because clicks and eye movements aren’t involved.

AI audio generation uses LLMs in combination with text-to-speech and speech-to-speech technologies. It’s having a far-reaching impact on user experiences, influencing attention, emotions, and decision-making.

The rising consumption of personalized audio content

People are binging on digital audio content. Smartphone and computer users have become “pod people”, constantly listening to audio while plugged into headphones, lost in their own world.

Audio can distract people and detach them from their environment. We encounter individuals helmeted with headphones, walking in front of moving vehicles, or unaware that the barista is asking them a question. They’ve become withdrawn, absorbed in some content whose topic or purpose we can only speculate about.

The Washington Post recently published an article on how listening to digital audio is becoming addictive.

Much has been made of the addictive nature of screens and of screen time as a metric of technology overuse. But audio use, too, captures many people’s waking hours, yet it hasn’t been studied nearly as much. – Washington Post

People are listening more to audio and thinking less in the process.

The Post described the situation where “if you listen to so much on-demand audio that you don’t want to stop.”

One interviewee observed that they no longer have thinking time: “That mental processing time — I don’t really have that now, because I have always got a podcast on.”

Audio has become more personalized and inward-looking. Public audio, such as listening to the radio, can be pro-social: it can be heard by others and be a shared experience. Multiple people hear and can discuss the same content. Digital audio, by contrast, is consumed privately. People withdraw from the world by listening to their personal audio stream. The feed is more user-specific and reflective of individual habits. It’s anti-social: saying something to an individual wearing buds interrupts and annoys them.

Another person interviewed said that listening to audio “makes it a little less tolerable, I think, to be around actual people. The on-demand, no-expectations social engagement with these parasocial personalities — I think it’s just a little addicting.”

And the addition of AI to digital audio is accelerating this trend.

Generative speech synthesis: turn content into audio

Text-to-speech (TTS) synthesis has been around for decades, but in recent years it has improved in quality with more life-like voices and prosody. Its implementation has been turbocharged through its combination with LLMs.

Rather than simply reading text aloud, AI adapts written texts for audio by making the content more conversational in format and tone.

The paradigm shift occurred in 2024 when Google introduced a tool called Deep Dive that allowed anyone to convert a web article into a podcast-like audio. Now incorporated as the “Audio Overview” in NotebookLM, the tool provides a chatty summary featuring two voices seeming to discuss the contents of an article.

Other vendors have followed Google’s lead. Examples include:

Huxe, built by the developers of GoogleLM, embodied the boldest vision of how AI audio will change user behavior. The app suggested that a lack of time or expertise would no longer be a barrier to consuming content.

Huxe promised to revolutionize how users consumed content:

“We gave millions of people the ability to turn their documents into podcasts— to finally ‘read’ that 200-page report while walking their dog. Teachers, lawyers, students all said the same thing: ‘I’m finally keeping up.’”
“This is radio made for you.”
“Turn any curiosity into a personal podcast. Whether it’s ‘Why is everyone talking about this new social app?’ or ‘Tell me the history of that building I walk past every day,’ you get a personalized, clear, audio explanation.”

Spoiler: Huxe shut down abruptly as I was writing this post, less than a year after launch. Already, the field of audio AI apps has become crowded, and Huxe folded.

AI audio’s possibilities are vast. But before getting too excited by their potential, we should understand how it’s being adopted in practice. Unfortunately, audio AI is being used indiscriminately. The combination of ease of creation and anticipated demand has prompted a slurry of AI-generated audio.

Bloomberg discusses the growing phenomenon of “podslop”. It describes “the modern era of podcasting in which thousands of new shows are released into the world every day, with a sizable portion likely being AI-generated.”

Interactive audio: Speech-to-speech synthesis

Interactive audio is made possible by speech-to-speech (STS) technology. Instead of simply reading existing text to the user, STS allows the user to speak to ask questions and get audio replies. STS makes audio conversational.

Conversing with early speech-to-speech bots, like Siri or Alexa, felt stilted. The conversation wasn’t open-ended, as with a human conversation. Earlier implementations required combining speech-to-text, dialog management, and text-to-speech, which introduced latency and limited the scope of what could be addressed. Generative AI makes the orchestration between user and bot more seamless.

Recent speech-to-speech applications are changing how people interact with voice-based solutions.

Huxe in particular promised greater interactivity than offered by traditional streams:

Jump in anytime — ask, react, or go deeper as you listen.
The audio “listens back” – Interrupt and say “wait, explain that differently” or “give me more technical detail” or “actually, what about this other thing?”

Huxe’s interactivity seemed appealing, putting the user in control. But to be successful, it needed to effectively explain the content domain and allow the user to control how they want information.

Delivering a truly seamless experience is challenging in practice. Voicebots and conversational design have always been limited by predefined scripts. Even generative AI relies on scripts: bots need to recognize the user’s intent accurately and rely on prototypical conversational patterns to do so. These can simulate a human conversation, but don’t supply the same range of freedom.

A more realistic guide to how STS will emerge will be how it is implemented by large enterprises to support self-service.

Microsoft says its Azure AI Voice Live API is “ideal for scenarios where voice-driven interactions improve user experience,” such as:

Customer support, product catalog navigation, and self-service solutions
Voice-enabled learning companions and virtual tutors for interactive training
Administrative queries and public service information
Voice-enabled tools for employee support, career development, and training

Microsoft’s vision for the next-gen STS doesn’t seem all that different from the previous generation that lacked AI. Adding AI doesn’t redefine the customer’s relationship with the organization. It’s still about saving the organization money, and avoiding having customers talk to a human.

STS voicebots are likely to employ the same operating framework that enterprises use for other forms of customer experience: the funnel. The goal will be to get as many people as possible to a specific conversion event. Customers may be told they can have a conversation with the bot, but the direction of the bot’s conversation will be predetermined.

AI and attention

AI summaries let people read less. AI audio makes reading optional. And attention spans are short-circuiting.

Because we can listen at times we can’t read, listening is taking up a larger share of our time. Many people are binging on audio. This trend pre-dates AI, but AI is amplifying it manyfold.

AI exists to accelerate the pace of activity, to squeeze more activity into available time. It’s affecting how users pay attention.

Audio AI removes user decisions about content consumption. The stream feed decides what merits your attention. A reinforcing feedback loop tends to narrow choices. Agentic skills may force choices on users they might not otherwise make.

Audio AI lulls users into not paying close attention. Listening typically requires less effort than reading and can result in lower retention. Unlike audio, reading allows for previewing and reviewing text, broadening reflection on the content. Choosing the least-effort mode can result in less cognitive engagement

But for important topics, audio AI can force users to maintain attention to an exhausting degree. If the content is critical to the user, audio is often the wrong medium. While “sit back and listen” sounds relaxing, it can be tiring over long stretches when it involves important material.

Audio doesn’t pause when the user encounters distractions. There’s no chance to re-read what was missed. AI audio may lack the informational redundancy associated with natural human conversations

Audio AI crowds out free attention, leaving no time for reflection. The audio paces the listener. The audio’s autoplay makes it harder to pause to think about what’s being said. It’s unlike a classroom discussion where points are elaborated or discussed from various perspectives

Audio has become a new intrusion. Previously, users might monitor their screen time. Now, plugged into pods and earbuds (and even smart glasses), they get real-time audio notifications.

AI audio is a voice whispering inside your head.

Sonification of emotions

Voicebots now have personalities. Their designers work hard to make sure they don’t sound robotic.

Voice has been a longstanding alternative mode of interaction with computers, one often dreaded by users. It’s tempting to view the incorporation of AI into audio as yet another stage in the evolution of accessibility. Such a view would overlook the emotional dimension of audio.

TTS is no longer a utilitarian technology. Early in my career, I volunteered at the Royal National Institute for the Blind and got my first exposure to how visually impaired people use screen readers. I observed the utilitarian character of the computer-synthesized voice. Users would zip through synthesized utterances at a blazing speed, attentively waiting to hear what they were seeking. To my ears, the utterances were too quick and high-pitched to understand easily. But those who relied on screen readers did not use voice at a conversational pace.

When the goal of TTS screen readers was assistive technology, its focus was on offering parity between interaction modes (visual scanning and audio scanning). Plain voices read functional text: menus, options, headings, text bodies. Accents exist to foster intelligibility, not likeability.

The screen reader experience has morphed into making voicebots companions. The design goal now is for computers to have a conversation with you, and in so doing, persuade you.

I elevate your text into impactful speech with deep meaning. “People will forget your words, but they will always remember, how those forgotten words made them feel.” – AIPRM Voicebot

Voicebots are now as focused on conveying a feeling as they are on conveying information.

Few people like to feel they can be emotionally manipulated. Yet synthetic voice offers invisible pathways for brands to manipulate customers. Even if you dismiss the idea that AI voices can change how you feel, there are countless vendors who are betting otherwise.

How do synthetic voices in your head affect you?

The first dimension is relateability. The voice and tone of the content are more visceral in audio than in written text. Users often remark that voicebots have overly enthusiastic voices. These bots aim to foster trust. Synthetic actors can cultivate a false sense of intimacy. The avatar can spoof the personality of the content’s real author (if there is one).

The second dimension is behavior. Designers are just starting to explore how user behavior may be different when interacting with audio rather than a screen. Audio offers new opportunities for brands to nudge users. It becomes easier to prompt users to say or repeat a phrase to prime them toward taking a desired action.

Voicebots are seeking to optimize user trust. How do voicebots earn trust?

Voicebots prioritize enthusiasm over dispassion and objectivity. Voicebots are known for enthusiasm that borders on cultishness. But compared with other user situations, user defenses are lower when listening to bots.

In real-life encounters, excessive enthusiasm accompanied by exaggerated body language may make users wary. The same is true with stilted TV commercials that try too hard to compress emotions into 30 seconds.

Voicebots rely only on voice, and not visuals or gestures, to convey messages. It offers fewer telltale signs that a message might be inauthentic. And it won’t necessarily do so in a short burst. Voice content can run for a long session and be repeated, so users acclimate themselves to the enthusiasm and accept it as normal.

Voicebots optimize their likability and resonance. Chatbots have become surrogate friends for many users. The more a voice sounds and talks like your alter ego, the more you are likely to trust what they say.

It doesn’t take much imagination to foresee the development of voicebots that are targeted by a user’s personality. How the bot sounds will reflect what a user likes. Yet the danger is that the user will trust the synthetic personality rather than assign trust based on the voicebot’s content.

Voicebots encourage users to surrender control by leading them. Voicebots position themselves as advisors and guides. The Microsoft Voice API, for example, promotes its ability to allow developers to build “learning companions.”

The promise of a coach is appealing. Language learning apps are an early example of how voicebot coaching can be both helpful and overbearing. Voicebots can provide directed feedback to users, correcting their mistakes. They can also become menacing nags, pestering users with passive-aggressive reminders when they fail to keep up with the bot’s schedule.

Unlike language learning, most voicebot content won’t involve drilling questions with right-or-wrong answers. Real-life topics will involve nuances where being told what to do might lead the user to the wrong choice. Audio AI lacks the contextual affordances of text, which allows users to explore background, dig into concepts not understood, compare different perspectives, and see alternative pathways.

Over-listening to the cult of productivity

Voicebots want to become advisors and emotional companions, ready to tell us what we need to hear, when we need it. Unlike text chatbots, voicebots can be used hands-free and without looking at a screen.

Voicebots seem like the answer to coping with busy lives. You can multitask, learning a new skill or doing an online chore while exercising, driving, walking your dog, or watching your kids at the playground.

Without question, many people have busy lives. But voicebots can perversely make people even busier, encouraging them to squeeze yet more into their daily routines.

As the Washington Post article noted, overstimulation, whether from screens or audio, takes a toll on our judgment and mental health. People need mental quiet: “When we’re not involved in some kind of cognitive task, the part of the brain called the default mode network takes over, [which] helps us regulate our emotions, helps us make sense. It’s how we form internal narratives about ourselves.”

Audio AI can supply ready-made narratives, but users need to form their own.

– Michael Andrews