Categories
Content Experience User centered AI

Audio AI is hacking your brain

In previous posts, I have explored the behavioral dynamics of text-centric chatbots. Yet AI is also changing the way people relate to audio. The impact of audio AI on user experience is, in many ways, even more profound than textual AI.

Listening to audio is more passive than either reading text or watching videos. It can be less obvious how audio content influences user behavior, because clicks and eye movements aren’t involved. 

AI audio generation uses LLMs in combination with text-to-speech and speech-to-speech technologies. It’s having a far-reaching impact on user experiences, influencing attention, emotions, and decision-making. 

The rising consumption of personalized audio content

People are binging on digital audio content.  Smartphone and computer users have become “pod people”, constantly listening to audio while plugged into headphones, lost in their own world. 

Audio can distract people and detach them from their environment. We encounter individuals helmeted with headphones, walking in front of moving vehicles, or unaware that the barista is asking them a question. They’ve become withdrawn, absorbed in some content whose topic or purpose we can only speculate about.

The Washington Post recently published an article on how listening to digital audio is becoming addictive

Screen cap: Washington Post

Much has been made of the addictive nature of screens and of screen time as a metric of technology overuse. But audio use, too, captures many people’s waking hours, yet it hasn’t been studied nearly as much. – Washington Post

People are listening more to audio and thinking less in the process. 

The Post described the situation where “if you listen to so much on-demand audio that you don’t want to stop.”

One interviewee observed that they no longer have thinking time: “That mental processing time — I don’t really have that now, because I have always got a podcast on.”

Audio has become more personalized and inward-looking. Public audio, such as listening to the radio, can be pro-social: it can be heard by others and be a shared experience. Multiple people hear and can discuss the same content.  Digital audio, by contrast, is consumed privately. People withdraw from the world by listening to their personal audio stream. The feed is more user-specific and reflective of individual habits.  It’s anti-social: saying something to an individual wearing buds interrupts them and annoys them.  

Another person interviewed said that listening to audio “makes it a little less tolerable, I think, to be around actual people. The on-demand, no-expectations social engagement with these parasocial personalities — I think it’s just a little addicting.”

And the addition of AI to digital audio is accelerating this trend.

Generative speech synthesis: turn content into audio

Text-to-speech (TTS) synthesis has been around for decades, but in recent years  it has improved in quality with more life-like voices and prosody. Its implementation has been turbocharged through its combination with LLMs.

Rather than simply reading text aloud, AI adapts written texts for audio by making the content more conversational in format and tone.

The paradigm shift occurred in 2024 when Google introduced a tool called Deep Dive that allowed anyone to convert a web article into a podcast-like audio.  Now incorporated as the “Audio Overview” in NotebookLM, the tool provides a chatty summary featuring two voices seeming to discuss the contents of an article.

Other vendors have followed Google’s lead.  Examples include:

Huxe, built by the developers of GoogleLM, embodied the boldest vision of how AI audio will change user behavior. The app suggested that a lack of time or expertise would no longer be a barrier to consuming content. 

Screen cap: Huxe

 Huxe promised to revolutionize how users consumed content:

  • “We gave millions of people the ability to turn their documents into podcasts— to finally ‘read’ that 200-page report while walking their dog. Teachers, lawyers, students all said the same thing: ‘I’m finally keeping up.’”
  • “This is radio made for you.”
  • “Turn any curiosity into a personal podcast. Whether it’s ‘Why is everyone talking about this new social app?’ or ‘Tell me the history of that building I walk past every day,’ you get a personalized, clear, audio explanation.”

Spoiler: Huxe shut down abruptly as I was writing this post, less than a year after launch. Already, the field of audio AI apps has become crowded, and Huxe folded. 

AI audio’s possibilities are vast.  But before getting too excited by their potential, we should understand how it’s being adopted in practice.  Unfortunately, audio AI is being used indiscriminately.  The combination of ease of creation and anticipated demand has prompted a slurry of AI-generated audio.

Bloomberg discusses the growing phenomenon of “podslop”.  It describes “the modern era of podcasting in which thousands of new shows are released into the world every day, with a sizable portion likely being AI-generated.”

Screen cap: Bloomberg

Interactive audio: Speech-to-speech synthesis

Interactive audio is made possible by speech-to-speech (STS) technology. Instead of simply reading existing text to the user, STS allows the user to speak to ask questions and get audio replies. STS makes audio conversational. 

Conversing with early speech-to-speech bots, like Siri or Alexa, felt stilted.  The conversation wasn’t open-ended, as with a human conversation.  Earlier implementations required combining speech-to-text, dialog management, and text-to-speech, which introduced latency and limited the scope of what could be addressed. Generative AI makes the orchestration between user and bot more seamless. 

Recent speech-to-speech applications are changing how people interact with voice-based solutions.

Huxe in particular promised greater interactivity than offered by traditional streams:

  • Jump in anytime — ask, react, or go deeper as you listen.
  • The audio “listens back” – Interrupt and say “wait, explain that differently” or “give me more technical detail” or “actually, what about this other thing?”

Huxe’s interactivity seemed appealing, putting the user in control.  But to be successful, it needed to effectively explain the content domain and allow the user to control how they want information.  

Delivering a truly seamless experience is challenging in practice. Voicebots and conversational design have always been limited by predefined scripts.  Even generative AI relies on scripts: bots need to recognize the user’s intent accurately and rely on prototypical conversational patterns to do so. These can simulate a human conversation, but don’t supply the same range of freedom. 

A more realistic guide to how STS will emerge will be how it is implemented by large enterprises to support self-service. 

Microsoft says its Azure AI Voice Live API is “ideal for scenarios where voice-driven interactions improve user experience,” such as:

  • Customer support, product catalog navigation, and self-service solutions
  • Voice-enabled learning companions and virtual tutors for interactive training
  • Administrative queries and public service information
  • Voice-enabled tools for employee support, career development, and training

Microsoft’s vision for the next-gen STS doesn’t seem all that different from the previous generation that lacked AI.  Adding AI doesn’t redefine the customer’s relationship with the organization.  It’s still about saving the organization money, and avoiding having customers talk to a human. 

STS voicebots are likely to employ the same operating framework that enterprises use for other forms of customer experience: the funnel. The goal will be to get as many people as possible to a specific conversion event. Customers may be told they can have a conversation with the bot, but the direction of the bot’s conversation will be predetermined. 

AI and attention

AI summaries let people read less.  AI audio makes reading optional. And attention spans are short-circuiting. 

Because we can listen at times we can’t read, listening is taking up a larger share of our time.  Many people are binging on audio. This trend pre-dates AI, but AI is amplifying it manyfold.

AI exists to accelerate the pace of activity, to squeeze more activity into available time. It’s affecting how users pay attention.  

Audio AI removes user decisions about content consumption. The stream feed decides what merits your attention. A reinforcing feedback loop tends to narrow choices.  Agentic skills may force choices on users they might not otherwise make.

Audio AI lulls users into not paying close attention.  Listening typically requires less effort than reading and can result in lower retention. Unlike audio, reading allows for previewing and reviewing text, broadening reflection on the content. Choosing the least-effort mode can result in less cognitive engagement

But for important topics, audio AI can force users to maintain attention to an exhausting degree.  If the content is critical to the user, audio is often the wrong medium. While “sit back and listen” sounds relaxing, it can be tiring over long stretches when it involves important material.  Audio doesn’t pause when the user encounters distractions. No chance to re-read what was missed.  AI audio may lack the informational redundancy associated with natural human conversations

Audio AI crowds out free attention, leaving no time for reflection. The audio paces the listener. The audio’s autoplay makes it harder to pause to think about what’s being said. It’s unlike a classroom discussion where points are elaborated or discussed from various perspectives

Audio has become a new intrusion.  Previously, users might monitor their screen time.  Now, plugged into pods and earbuds (and even smart glasses), they get real-time audio notifications.

AI audio is a voice whispering inside your head.

Sonification of emotions 

Voicebots now have personalities. Their designers work hard to make sure they don’t sound robotic.

Screen cap: Speechify

Voice has been a longstanding alternative mode of interaction with computers, one often dreaded by users. It’s tempting to view the incorporation of AI into audio as yet another stage in the evolution of accessibility. Such a view would overlook the emotional dimension of audio.

TTS is no longer a utilitarian technology. Early in my career, I volunteered with the Royal National Institute for the Blind and got my first exposure to how visually impaired people use screen readers.  I observed the utilitarian character of the computer-synthesized voice.  Users would zip through synthesized utterances at a blazing speed, attentively waiting to hear what they were seeking.  To my ears, the utterances were too quick and high-pitched to understand easily.  But those who relied on screen readers did not use voice at a conversational pace.

When the goal of TTS screen readers was assistive technology, it focused on offering parity between interaction modes (visual scanning and audio scanning).  Plain voices read functional text: menus, options, headings, text bodies. Accents exist to foster intelligibility, not likeability.

The screen reader experience has morphed into making voicebots companions.  The design goal now is for computers to have a conversation with you, and in so doing, persuade you.

I elevate your text into impactful speech with deep meaning. “People will forget your words, but they will always remember, how those forgotten words made them feel.” – AIPRM Voicebot

Voicebots are now as focused on conveying a feeling as they are on conveying information.

Few people like to feel they can be emotionally manipulated.  Yet synthetic voice offers invisible pathways for brands to manipulate customers.  Even if you dismiss the idea that AI voices can change how you feel, there are countless vendors who are betting otherwise.

How do synthetic voices in your head affect you?

The first dimension is relateability. The voice and tone of the content are more visceral in audio than in written text. Users often remark that voicebots have overly enthusiastic voices. These bots aim to foster trust. Synthetic actors can cultivate a false sense of intimacy. The avatar can spoof the personality of the content’s real author (if there is one).

The second dimension is behavior. Designers are just starting to explore how user behavior may be different when interacting with audio rather than a screen. Audio offers new opportunities for brands to nudge users. It becomes easier to prompt users to say or repeat a phrase to prime them toward taking a desired action. 

Voicebots are seeking to optimize user trust. How do voicebots earn trust?

Voicebots prioritize enthusiasm over dispassion and objectivity. Voicebots are known for enthusiasm that borders on cultishness. But compared with other user situations, user defenses are lower when listening to bots. 

In real-life encounters, excessive enthusiasm accompanied by exaggerated body language may make users wary. The same is true with stilted TV commercials that try too hard to compress emotions into 30 seconds.  

Voicebots rely only on voice, and not visuals or gestures, to convey messages. It offers fewer telltale signs that a message might be inauthentic. And it won’t necessarily do so in a short burst. Voice content can run for a long session and be repeated, so users acclimate themselves to the enthusiasm and accept it as normal. 

Voicebots optimize their likability and resonance. Chatbots have become surrogate friends for many users.  The more a voice sounds and talks like your alter ego, the more you are likely to trust what they say.

It doesn’t take much imagination to foresee the development of voicebots that are targeted by a user’s personality. How the bot sounds will reflect what a user likes. Yet the danger is that the user will trust the synthetic personality rather than assign trust based on the voicebot’s content.

Voicebots encourage users to surrender control by leading them. Voicebots position themselves as advisors and guides. The Microsoft Voice API, for example, promotes its ability to allow developers to build “learning companions.”

The promise of a coach is appealing. Language learning apps are an early example of how voicebot coaching can be both helpful and overbearing. Voicebots can provide directed feedback to users, correcting their mistakes.  They can also become menacing nags, pestering users with passive-aggressive reminders when they fail to keep up with the bot’s schedule. 

Unlike language learning, most voicebot content won’t involve drilling questions with right-or-wrong answers.  Real-life topics will entail nuances in which being told what to do might lead the user to the wrong choice. Audio AI lacks the contextual affordances of text, which allows users to explore background, dig into concepts not understood, compare different perspectives, and see alternative pathways. 

Overlistening to the cult of productivity

Voicebots want to become advisors and emotional companions, ready to tell us what we need to hear, when we need it. Unlike text chatbots, voicebots can be used hands-free and without looking at a screen.  

Voicebots seem like the answer to coping with busy lives. You can multitask, learning a new skill or doing an online chore while exercising, driving, walking your dog, or watching your kids at the playground.

Without question, many people have busy lives. But voicebots can perversely make people even busier, encouraging them to squeeze yet more into their daily routines. 

As the Washington Post article noted, overstimulation, whether from screens or audio, takes a toll on our judgment and mental health. People need mental quiet: “When we’re not involved in some kind of cognitive task, the part of the brain called the default mode network takes over, [which] helps us regulate our emotions, helps us make sense. It’s how we form internal narratives about ourselves.”  

Audio AI can supply ready-made narratives, but users need to form their own. 

– Michael Andrews

Categories
User centered AI

Knowledge isn’t communication

Our faith in the power of words and data to convey knowledge can lead us to believe that AI promotes understanding when it doesn’t.

Knowledge has always depended on words and data. But it’s easy to assume that words or data themselves encapsulate and transmit knowledge. That’s especially true as generative AI encourages users to approach knowledge evaluation and acquisition through the medium of natural language sentences.

Expertise, knowledge, and understanding are specific to the individual. Users want advice from credible experts. Yet such expertise can be a barrier to clear communication and interfere with user understanding.

Writers have long faced the challenge of bridging what they know with how they communicate that knowledge. When done poorly, it results in a chasm between the writer’s and the reader’s understanding of a topic. Experts are often lousy communicators.

Readers judge a source’s expertise by its mastery of factual details. If a person or system can answer detailed factual questions, then it has expert-level knowledge. They are credible.

But knowledge isn’t a universal standard. The notion of a knowledge base can be deceptive because it implies a scope that is relevant to everyone. Knowledge is simply externalized understanding. We can talk about collective knowledge — the sum of available information contributed by various experts. But the understanding attained from this body of knowledge will depend on the individual.

Clear writing does not necessarily create consumable knowledge. For knowledge to be usable, it must be readable. But readability does not guarantee the information is useful to an individual.

Matthew Crawford, the author of Shop Class as Soulcraft, draws a distinction between thinking-as-doing and thinking-as-writing. What’s written too often is divorced from what the user needs to do. He argues:

The writers of modern technical manuals are neither mechanics nor engineers but rather technical writers. This is a profession that is institutionalized on the assumption that it has its own principles that can be mastered without the writer being immersed in any particular problem; it is universal rather than situated. Technical writers know that, but they don’t know how.

The problem, as Crawford sees it, is that the writer does not necessarily have a firsthand understanding of the topic they write about. The writer has thought about the topic abstractly, but without direct experience.

When confronted with explanations that don’t make sense, the user must:

render the gibberish coherent, and he can only do this by referring it to a model he had in his own mind of how the thing works.

The key to promoting understanding is the mental model framing of the discussion. It must be accurate and intelligible to the user.

Writing generates qualified knowledge. The writing process can help writers understand a topic. It can help them clarify the relationships between issues and encourage them to develop precise terminology to describe concepts.

Writing is, without question, valuable for developing an understanding of complex issues. But the writing will only express the knowledge of the writer. It won’t necessarily convey understanding that another reader can acquire. The knowledge articulated must be further revised to match the expectations and understanding of external readers.

The writer must decouple how they arrived at a conclusion from how they communicate that conclusion to others. Their goal should not be to show their own thinking process, but to address how the reader is likely to evaluate the topic.

The writer will normally have different background knowledge and motivations than the readers seeking to understand a topic. While in some fields (scientific research, for example), both the writer and the readers will follow a common, standard process for arriving at conclusions, this pattern is not the norm in most cases. Instead, the writer is presumed to have more expertise than the reader and will be concerned with factual and logical details that won’t be relevant to the non-expert.

With LLMs, we find that responses to user prompts often replicate the expert’s explanatory logic and terminology that was expressed in the source content, even though these are inappropriate for the user.

Explanations or conclusions without rationales sound authoritarian. Users are exacting: they want expert opinions, yet don’t want to put up with expert-level explanations. At the same time, many users find explanations lacking rationales less credible. They question why they should believe what’s being said.

Public trust in professional experts has been eroding with the “democraticization” of information. People don’t like being told what to do when they feel they aren’t permitted to make the decision themselves. Users feel empowered and entitled to form their own conclusions. Yet LLMs don’t always provide an extensive rationale for their responses. The tendency of chatbots to provide “instant answers” can undercut their credibility.

Knowledge sources (justifications) must be transparent and easy to trace. Users want to know where information comes from. AI solutions make tracing sources difficult.

The major generative AI platforms provide limited links to sources in their responses. A response might contain 5-8 links, often to obvious sources such as Wikipedia. At best, these links provide a shallow overview of available knowledge.

Most chatbot responses have limited source citations. Source: Generative Pulse, survey of 25 million bot links

Dense knowledge is not necessarily clear information. Knowledge graphs represent an alternative paradigm for answering user questions. They can provide better traceability, but at the expense of poorer usability. They don’t facilitate deeper understanding.

Because of their complexity, knowledge graph implementations increasingly rely on natural language chatbot user interfaces. When using a chatbot to interact with knowledge graph data, information traceability is lost. Google’s Knowledge Panels, which aggregate information from multiple sources, require users to click links to perform new searches to trace the origins of the data shown.

Google Knowledge Panel: most links send users to another search, which might yield another knowledge panel. The answers and links raise more questions than they answer. To improve ease of use, traceability is sacrificed.

Just because the writer can explain something doesn’t mean they understand it. It’s easy to make statements and to come up with reasons why readers should believe what you’ve said. But harder to justify statements with a sound rationale. Many assertions fail to withstand scrutiny.

Writers and AI platforms are both aware that credibility depends on providing a rationale. Writers have long used formulas such as “six reasons why….” to buttress their assertions. Professional writers might follow a template that outlines methods, procedures, and evidence before presenting conclusions. AI platforms reproduce these formulas in their responses.

It’s tempting for users to see formulaic explanations as valid because they match expectations. But in many cases, especially with LLM responses, the justification is developed after the conclusion is generated. Chatbots explain things without understanding them.

A widespread myth about LLMs is that they can reason. Vendors promote the “Thought-Action-Observation Loop” technique and ReAct (Reasoning/Acting) prompting, suggesting that LLMs are reasoning. These techniques build on a prompt-engineering approach called Chain-of-Thought (CoT) reasoning, which supposedly shows the thought process of the LLM. However, more detailed research indicates that these measures are largely proformative. They slow down responses without yielding appreciable improvements in answer quality. On the contrary, the techniques can make answers less consistent.

The core problem is misplaced trust: CoTs can appear persuasive even when they do not faithfully reflect a model’s actual decision process

Chain-of-Thought Is Not Explainability

Explanations can be myths. To enhance perceived credibility, communications aim to sound expert and appear to be based on a comprehensive review of available information.

But mimicking common arguments does not make the arguments valid. Few people apply critical reading online, parsing claims, reasons, and evidence.

One of the most reliable and relatable forms of communication is the story. Stories often rely on reasoning by analogy. Chatbots have access to countless examples of material that are vaguely similar to the user’s prompt. We shouldn’t be surprised if we start seeing chatbot explanations including sports or TV show analogies.

— Michael Andrews