AI Archives - Page 3 of 6

In this series, I’ve examined how misinformation can creep into chatbot responses. It can be hard to trust these answers because the information’s provenance is unclear and potentially unreliable.

What sources should AI bots rely on? One suggestion is that bots should use canonical content. If it’s canonical, it should be reliable. As elegant as that solution may sound, the concept of canonical content is waffly.

This post will define canonical content more precisely, helping us determine when and whether it can guide chatbots’ use of scraped online content.

SEO concepts have lost their authority

Search engine optimization practices, which dominated the website era, developed a series of axiomatic platitudes about the value of content: authoritative content was trustworthy, and trustworthy content was canonical. Lofty words without much meaning.

In SEO practice, a canonical tag is no more than a self-declaration to bots that a given piece of content was the primary version. Such self-declarations are question-begging: compared to what, exactly? Search canonicalization applies only to one’s own content. Telling a search engine which page on your website to prioritize didn’t imply that your page was more important than those on another website. This kind of implementation won’t help chatbots decide which sources to draw upon for answers.

It’s time to retire SEO notions of canonical content and instead develop a new approach for the AI era.

Deciding what belongs in the canon involves distinguishing the genuine from the fake or flawed.

Historically, canons are sacred books or genuine works of the highest quality. Scholars debate whether writings belong in Shakespeare’s canon or in the canon of the greatest works of poetry. Theologians debate canon law. IT folks talk about canonical data – the sources of record that systems can rely on. What is common in all these domains is that interpretation is involved in deciding what belongs in the canon.

Both humans and bots need a clear definition of what canonical means and clarity about who makes the decision. Canonical can’t simply be a matter of individual belief. For the concept to work in practice, various people and machines need a common understanding of what content is canonical.

Who said that? The importance of the role of information sources

A large portion of online content is crowd-sourced. Bots crawl this online content indiscriminately and can’t distinguish different roles and who is responsible for decisions or information accuracy. Few chatbot users realize the Wild West rodeo that’s corralling the information fed into the answers they see.

Scraped online content is often of mysterious provenance. As I have been writing this series, the New York Times reported on the legal actions that the crowd-source platform Reddit is taking against AI platforms such as OpenAI, Anthropic, Perplexity, and others. The reality of today’s AI ecosystems is that AI platforms are scraping other online platforms for content written by unidentified individuals who may not even have direct knowledge of what they are posting.

AI platforms can’t interpret the roles and responsibilities of the sources they crawl. A chatbot might decide that Wikipedia is a more trusted source than the social media platform X (formerly Twitter), but it can’t say why.

Both Wikipedia and X contain statements by people who convey information. The difference is that Wikipedia articles should never be written by someone directly involved, whereas an X post can be. Wikipedia is always third-hand information, while X posts sometimes are first-hand. A statement on X by a famous person about a decision they made or action they took will be more authoritative than a Wikipedia article that footnotes a news article that cites the original post.

The X versus Wikipedia example highlights the differences in the role of sources. It’s important to drill down into the details of the different roles and their relationships to information.

We can categorize sources into three tiers: first-party, second-party, and third-party. Each tier involves different types of information creators, who have varying degrees of direct knowledge of what they write about.

1st party content

All content and data developed and published by the party responsible for deciding policies and specifications (prices, rules, service availability, performance, etc.)

2nd party content

Statements, postings, and advice offered by partners, distributors, and paid influencers

3rd party content

Crowd-sourced information
User-generated content
Republishers of information
Summarizers of 1st party information

First parties offer content that figuratively comes “straight from the horse’s mouth.” First-party sources are far fewer than third-party sources, which cover a wide range of online content. Second parties sometimes look like first parties, but aren’t really.

Only first-party information can be canonical

Only first parties have a direct relationship to the knowledge they write about. Consequently, only first-party content can be canonical. That means any source that writes about what others are doing can’t be canonical.

For example, only a manufacturer of a product is the canonical source of information about that product. Others writing about the product – whether resellers, customers, reviewers, or news reporters – can’t be considered canonical sources of information about it. They may have valuable insights about the product, but nothing they say will be a definitive statement. Unless their role as a source is clearly identified in chatbot answers, people may believe that these second and third-party views are definitive.

Role	Relationship to knowledge	Examples of content and parties
1st party	The party responsible for deciding policies and specifications	All content and data developed and published by the deciding party (prices, rules, service availability, performance, etc.); Government department; Manufacturer; Insurer
2nd party	A party financially or organizationally affiliated with a deciding party, but not responsible for decisions about policies or specifications	Statements, postings, and advice offered by partners, distributors, guest posters, paid or incentivized influencers
3rd party	Unaffiliated party that is not financially dependent on the 1st party, such as a user, customer, competitor, news organization, or automated platform	Crowd-sourced information (aggregated from multiple sources); User-generated content (statements by users, customers, citizens, or non-affiliated contributors, which may be posted to a single platform or to distributed platforms); Republishers of information (unaffiliated curators and aggregators of articles and data from various sources); Summarizers of 1st party information

It’s important to differentiate first-party information from the concept of trusted information. Despite considerable overlap, these are distinct concepts, and it’s important to keep them separate.

A product manufacturer is the first-party source of information about its products. A trusted review publication such as Consumer Reports isn’t.

The first party provides the baseline information that others will evaluate. Most often, the first-party information is accurate as far as it goes, though it may be incomplete. That’s one reason second and third-party sources are valuable.

First-party information is generally accurate, since the organization is legally responsible for the policies and specifications in the content. Readers presume the primary source knows best. Even so, first-party information might contain errors, omissions, out-of-date facts, or even willful distortions. But since they are responsible for legally binding claims, they are considered the authoritative source. Only they can correct the record.

But manufacturers are not always first-party information sources. Manufacturers may compare their products to competitors’ products. The information they offer about their competitors’ products is third-party. The notion of a first party is linked to a source and its role. The source alone won’t tell us whether the information is from a first-party. We need to know what it is writing about, and its relationship to that topic.

Third-party information is broad and diverse. It includes user-generated posts such as product reviews and help questions, as well as news reporting and machine-generated content.

Only some first-party statements are canonical

Not all statements by first parties are canonical. Even when a party is writing about itself, what it says is not necessarily definitive. Whether the statement is canonical depends on whether it is declarative or interpretive.

Declarative statements refer to factual assertions
Interpretive statements relate to what something means to the writer or for the reader; they are not legally binding claims

Declarative statements include product specifications, pricing, customer warranties, and so on. They represent promises about what the first party will provide (or not do, because it is the customer or someone else’s responsibility).

First-party interpretative statements aren’t canonical because they aren’t factual or legally binding. Instead, they are “official” endorsements and advice on how or why others should do something. Most marketing and non-contractual customer care content is interpretative rather than declarative. These statements aren’t absolute directives that would void a warranty if not followed, but rather recommendations that customers are responsible for interpreting and following. Because the importance of these instructions is unclear, it is common for second and third-party advisors to offer their own advice on the same topics.

The table below shows the kinds of declarative and interpretive statements offered by primary sources (first parties), surrogates (second parties), and outsiders (third parties).

(scope)	1st party (the primary source of information)	2nd party (an affiliated party) contributing what they believe	3rd party (an unaffiliated party) contributing what they know or think
Declarative (what is said)	Canonical statements from the decider of specifications or policies	Surrogates Restatements and rewordings	Outsider Understandings (what it means to them)
Interpretive (what it means)	Primary source Justifications: How the decider conveys their decisions (why)	Surrogates Perspectives (what’s best for most)	Outsider Opinions (what’s best for them)

First parties use surrogates as message multipliers to extend coverage and reach. If a customer asks how to fix a problem not addressed on the customer help, or advice on making a choice not covered by marketing, a second party might volunteer their own advice independently of the first party.

Because of their affiliation with the first party, surrogates are often perceived as more trustworthy than unaffiliated outsiders. However, second-party information is seldom approved or verified by the first party and often addresses edge cases that the first party hasn’t covered. Second-party statements are never canonical, even when they address factual information.

Both surrogates and outsiders sometimes restate the first party’s factual statements. For example, a tax preparer might restate an IRS rule in layman’s language that is easier to understand. But such a restatement, despite its factual nature, will not be canonical because it isn’t issued by the IRS, which is the decision authority on the rule.

Accurate information depends on clear provenance

Indicating where information comes from is not just a matter of supplying a link to a source, since that source itself may be a compilation of sources.

As soon as the chain of attribution gets complicated, the provenance of the information becomes murky.

Both people and bots need a simple yet robust framework for evaluating how the source of information influences its expected accuracy. If it’s confusing for people, it’s likely to be confusing for bots, too.

Two factors influence the likely accuracy of the content: the reliability of the information source and its timeliness. People and bots need these dimensions to be traceable and clear. If evaluating these dimensions gets complex, then people and bots will tend to ignore them altogether.

Does the information come from the original source that would have decided the information, or is it a pastiche of assertions from random people? Is the information fresh, or was it cobbled together at different times?

The central question becomes: who owns the information and takes responsibility for its accuracy? AI platforms that spider the entire web don’t do that. In some cases, they don’t want to know the origin of the information because it may expose them to potential legal liability for copyright infringement.

While canonical information provides the benchmark for reliability, online users can’t rely solely on published canonical information. There are too many questions that canonical sources do not answer online. Outside sources can fill these gaps, though they must be scrutinized for accuracy. For example, the New York Times does not normally make the news (acting as a canonical source for stories about itself), but it is often a good source for reporting news that newsmakers don’t publish online themselves.

Even when the information is not canonical, it’s still possible to evaluate its accuracy, provided it comes from an identifiable source at an identifiable time. We can assess how intimate and complete the source’s knowledge is, and whether events occurred before, during, or after the content was published.

How, then, can one evaluate the accuracy of crowd-sourced information? Much online information consists of posts from individuals who add facts and observations about topics that otherwise don’t get much coverage.

Crowd-sourced information tends to be most accurate when everyone reports the same thing at the same time. When various people report different things, we need to know if these differences are correlated with differing timeframes. We don’t know whether everyone’s circumstances changed, or whether different people were in different circumstances either at the same time, or at different times.

What’s wickedly challenging to evaluate is the accuracy of information from a mix of sources developed at different times. It’s not easy to untangle this information, and, as with Grisham’s law, bad information can drive away trust in good information.

Crowd-sourced content will contain misleading information. Not only is the information not from a clearly identifiable single source that can be traced, but it tends to be composed of contributions made at different times, making it unclear which parts are current. This caution isn’t to imply that crowd-sourced content doesn’t contain valuable information. But finding, evaluating, and contextualizing that information requires sustained attention. A cursory reading or bot crawl won’t be able to separate the wheat from the chaff.

Accountability in content is essential for AI applications

AI platforms have been happy to crawl crowd-sourced information, with little concern about its provenance. This represents the biggest vulnerability of chatbots to misinformation.

As bots, rather than people, become key readers of crowd-sourced content, we must jettison the nostalgic belief in the “wisdom of the crowd” and the hope that user-generated content is self-correcting because users will spot others’ errors and correct them. In practice, that’s not the case routinely.

Even Wikipedia, the gold standard for crowd-contributed content, where edits are debated and revised for accuracy, can be bedeviled by misinformation that persists for considerable time before it is corrected – if it ever is. Unlike most user-generated content, Wikipedia has an established editorial review process, but like all other forms of user-generated content, it relies on the goodwill and time of volunteer contributors who are stretched too thinly to correct more than the most high-profile errors. Unfortunately, these systems have been under severe strain in recent years, and the fabled reliability of Wikipedia may not be something to take for granted in the coming years.

Past confidence in the democratization of information has eroded alongside changes in online user behavior, as people shifted from active information seekers to passive receivers. They’ve disengaged, developing shorter attention spans and reducing their interest in reading. They’ve decided that swiping left or right is the most effort they are willing to expend.

Bots look like the answer to lazy interaction. Indeed, bots can correct simple errors – even Wikipedia relies on bots for basic content maintenance. But bots can’t replace active editorial oversight. Bots excel at learning patterns but don’t make critical judgments, despite claims to the contrary.

AI platforms promise convenience. But as bots increasingly substitute for people online, the solutions create their own problems – an example of iatrogenic progress. Once platforms began aggregating reviews and making each review less informative in the process, bots began writing reviews themselves, hiding within the crowd that platforms summarize. Now, users face a second-order set of problems, where answers might be based on bots harvesting reviews written by other bots.

AI platforms won’t earn credibility until they cultivate and support the sources they use to supply answers. Yet AI platforms seem to be moving in the opposite direction. Elon Musk is promoting an AI-generated encyclopedia called Grokipedia to replace Wikipedia. The sources of information get more opaque, and their quality more dubious.

While the risk of misinformation is growing on third-party AI platforms, chatbots can provide accurate answers when implemented sensibly. The most reliable chatbots will be those that draw on clear and traceable information. The most direct way to do that is for publishers to develop their own AI platforms, rather than rely on third-party ones.

– Michael Andrews