Categories
Misinformation and AI

How everyday misinformation pervades online content

Online misinformation is a hidden problem.  We tend to be on guard for disinformation, outright lies such as fake reviews. Because disinformation intends to mislead, we treat it as dangerous. We learn to avoid sources of disinformation. 

Misinformation is more subtle. The content wasn’t created to deceive, but it still misleads, due to its ambiguity. 

Anyone can post misleading information online, even if they don’t intend to. Misinformation is not a problem of bad actors. We can’t divide information between misleading and non-misleading sources, because most sources can mislead someone at some point.  

Content can be misleading without being deceptive.  Human communication is inherently ambiguous, and once statements are posted online, it’s hard to tighten up the meaning of what’s already been said.

Readers are expected to exercise judgment and care before taking any statement at face value. They ask themselves what details have not been mentioned, or might have changed? They verify the content’s credibility and relevance from their personal perspective. 

We evaluate online information as a matter of habit.  We sift through online content looking for content that seems relevant and reliable. We focus on finding content relevant to our need and skip over content that seems off target. 

People online are jaded: they scan answers with a skeptical eye and filter out misleading information.  As a consequence, they typically won’t notice how much of the content online is potentially misleading. They encounter misleading content only when they get a “bum steer” from Google, telling them something is relevant when it isn’t, or else find that no online sources seem to answer the question they have.  

Red flags in online content

On a recent vacation, I encountered how common misinformation is online. I was visiting Romania for the first time, and my guidebooks didn’t cover many of the details I needed to know. 

Vacations are full of microdecisions: where to eat, which place to go to, and how to get from point A to point B. Each of these decisions depends on reliable information. The traveler’s primary goal is to find factual information to assess, rather than sample subjective opinions.  

Much online travel advice comes from crowd-sourced information, such as traveler’s forums or reviews. Other information is aggregated from mix of sources, such as reservation consolidators.  Unknown sources often supply the facts about schedules, options, or feasibility. Travelers need to check various sources to determine what information and advice is “best” for them. However, even hyper-vigilance and cross-checking data do not ensure reliability.

Travelers start with a question, such as, Can they visit a venue tomorrow?  On the surface, it is a simple question of the hours during which the venue is open.  But in practice, the question involves assumptions that only the opening hours are relevant.  I discovered that opening hours may be misleading.  Bots list venue opening hours, but include some venues that were out of business or temporarily closed for renovations or relocation.  They won’t disclose hidden availability factors, such as when access is unavailable during peak periods that are not mentioned.  Venue opening hours can be misleading. 

Another situation I faced was how to get to a neighboring town.  Google displayed bus routes that were no longer in operation. It apparently lifted the timetables from another timetable aggregator website, which had not updated them. The information may have been accurate at some point, but it is misleading now. 

I searched online for how to pay for city buses. While each town I visited had broadly similar buses, the payment process varied, involving either tickets and passes, transit cards, credit cards, and/or apps. Yet, online answers did not note any of these differences. Instead, Google generated generic answers that weren’t necessarily accurate for the town in question.  Generic answers proved misleading.

Another case of wrong answers arose from restaurant searches. I was searching for vegetarian restaurant options, but Google was giving me many “bum steers”.

My searches often yielded false positives, where the information suggested something available, but it wasn’t.  Both searches and chatbots rely on keywords that can serve as false flags.  An online reviewer might mention that a restaurant had no vegetarian options, but chatbots present the restaurant as a vegetarian-friendly option. 

Yet, false negatives – misleading indications that nothing is available – are equally a problem.  Other restaurants with many vegetarian options remain invisible, as they have not been discussed online by reviewers.  When crowd-sourced content remains silent about something that exists in the real world, it highlights the limitations of crowd-sourced information in staying current.  

Crowd-sourced content can be misleading because it may contain outdated information or fail to incorporate recent developments.   It fails to keep pace with the realities on the ground.  

But why highlight online misinformation if it isn’t a new problem? It’s because the human review is increasingly bypassed.  Humans can spot misleading information, but AI bots can’t.  AI bots don’t understand the ambiguities within information that might cause it to be misleading. It doesn’t understand the context of the information it draws from. 

AI agents promise to offload numerous decisions from customers.  Travel is one such domain that AI agents promise to simplify and streamline for travellers. No longer will travelers need to worry about pesky details – the bot will take care of them.

Yet, rather than liberate from the chore of chasing down information, AI bots impose a risk on us that the information is incomplete and misleading. And when we can’t rely on the information, we face even more work trying to verify or augment what bots tell us. 

Online misinformation is a larger problem than is generally recognized. It presents a big risk to the performance of AI bots. 

 – Michael Andrews

Categories
Content Engineering

First-party AI in the post-webpage era

My previous post on the demise of webpages and the need for AI-native content has elicited good feedback and questions. I wanted to elaborate more on how publishers will need to take greater ownership of AI applications as users visit webpages less and less.

Some questions concerned how consumers will access AI-native content. Many folks imagined that customers would access the content through a third-party AI platform such as ChatGPT, Google, Claude, Perplexity, X, or Microsoft Bing Copilot. That’s certainly possible, but it is not what I envision as the default.

The goal of AI-native content is for publishers to take ownership of their AI pipeline rather than delegate that responsibility to a third party. The result is first-party AI tools, where the process and outcomes are entirely under the control and supervision of the publisher.

In the current era, third parties such as Google scrape webpages, extract information, rewrite the content, and publish it themselves. Most of the web traffic goes directly to Google rather than to the publisher, which is why traffic levels are down.

But numerous risks are associated with the third-party extraction of webpage content. The major one is that the third party won’t represent the content in the same way that the original publishers would. The third party is interpreting your content based on their bot’s internal (often opaque) criteria.

No one will care more about your content than you will. What’s good enough for a third party may be damaging for your organization in some cases. Consider how a third party might get their summary wrong, even if their technology is generally robust and popular with users:

  • Leaving out information you or your customers would consider essential
  • Using the wrong tone of voice
  • Substituting words that have specific meaning to your customers
  • Providing misleading information by drawing on similar products or different timeframes that aren’t relevant to the user’s needs

All these potential issues can be quality checked, but only when the AI bot is overseen by the publisher who understands these nuances.

But today, even enterprises that are developing their own AI tools tend to rely on general-purpose third-party platforms that have generic settings, which pretend to provide everything needed in a single platform. Results, unsurprisingly, have been disappointing.

Few publishers have yet invested in the foundations necessary for AI-native content:

  • a stable LLM controlled by the publisher that can be tuned if necessary;
  • organization of resources according to their role in content generation;
  • mappings to other resources the AI engine must access (for example, RAG and MCP connectors); or
  • libraries of repeatable prompts, output patterns, and rule engines.

With AI-native content, there are no webpages for third parties to crawl and misinterpret. Third parties can’t mislead customers because they are starved of the source material on which to base their summaries.

Instead, customers will get content directly from the source organization using first-party AI tools.

First-party AI will be a radical shift from past decades, where Google supplied answers directly and was always the first, and sometimes only, port of call. In the post-webpage era, users will interact with many AI bots, both directly and indirectly.

If your enterprise is an airline or a major retailer that customers regularly use, they will access your AI tools via an app. If they are an infrequent or first-time customer, they may start with a traditional search, but instead of getting a full website, they get a URL that is a portal to your AI tools.

It may also be possible for the publisher to directly supply AI-native content to third parties, such as Google or ChatGPT, as a feed. What’s important is that the publishers retain control over how AI-generated content is provided in this scenario. This scenario is unlike the current wave of licensing deals, where certain publishers grant permission for their content to be crawled by third parties in exchange for payment, and where the third party assumes responsibility for generating the summary.

With first-party AI, publishers can gate access to content in terms of topics, details, and quantity.

Already, we see examples in the market of vendors such as Cloudflare offering “pay per crawl” tools to limit AI platforms from using publishers’ content unless those platforms pay a license fee to access the content. This kind of contractual arrangement can easily be extended to AI-native content. And the growing availability of AI connection protocols will support controlling access much the same way APIs do.

For high-value content and interactions, firms will want to steer customers directly to their AI tools, and they will limit third parties from intermediating these interactions.

But for lower-value content and interactions, firms may allow AI platforms limited direct access to their AI-native content. The publishers retain control over how the content is offered but gain wider exposure through the third-party platform’s reach.

For content that is entirely promotional in nature, firms may supply AI-native content to third-party platforms on a fee basis, paying the platforms to show this content in generic queries, similar to how search ads work today. Despite the reliance on the platform for visibility, the publisher retains control over how messages appear, instead of allowing third parties to decide for themselves.

AI-native content enables publishers to provide first-party AI experiences. Publishers can control many parameters of content to ensure that generated content aligns with publisher goals.

In my previous post, I mentioned the need for a new kind of schema to support AI-native content. This schema will be richer than a traditional content model or data model. It will allow the mixing of structured data within semistructured narrative (text, video, audio). It will describe recurring word patterns that should appear in an exact way, while allowing for adaptable text that must merely conform in a general way to style or other governance guidelines. It will allow defined content variables to be referenced by prompts or agents. It may have factual rules against which statements must agree.

While we are still in the earliest days of this transition, I am impressed by how quickly language models have become commoditized and open-sourced, and RAG and MCP tools have become widespread. Medium and large-sized firms now have the opportunity to build first-party AI tools without outsourcing their customers’ AI experience to third parties.

— Michael Andrews