Categories
Misinformation and AI

Trust, advice, and sourcing: what information is canonical?

In this series, I’ve examined how misinformation can creep into chatbot responses. It can be hard to trust these answers because the information’s provenance is unclear and potentially unreliable. 

What sources should AI bots rely on?  One suggestion is that bots should use canonical content.  If it’s canonical, it should be reliable. As elegant as that solution may sound, the concept of canonical content is waffly.  

This post will define canonical content more precisely, helping us determine when and whether it can guide chatbots’ use of scraped online content.

SEO concepts have lost their authority

Search engine optimization practices, which dominated the website era, developed a series of axiomatic platitudes about the value of content: authoritative content was trustworthy, and trustworthy content was canonical. Lofty words without much meaning. 

In SEO practice, a canonical tag is no more than a self-declaration to bots that a given piece of content was the primary version. Such self-declarations are question-begging: compared to what, exactly?  Search canonicalization applies only to one’s own content. Telling a search engine which page on your website to prioritize didn’t imply that your page was more important than those on another website. This kind of implementation won’t help chatbots decide which sources to draw upon for answers.

It’s time to retire SEO notions of canonical content and instead develop a new approach for the AI era.

Deciding what belongs in the canon involves distinguishing the genuine from the fake or flawed. 

Historically, canons are sacred books or genuine works of the highest quality. Scholars debate whether writings belong in Shakespeare’s canon or in the canon of the greatest works of poetry. Theologians debate canon law.  IT folks talk about canonical data – the sources of record that systems can rely on. What is common in all these domains is that interpretation is involved in deciding what belongs in the canon.  

Both humans and bots need a clear definition of what canonical means and clarity about who makes the decision. Canonical can’t simply be a matter of individual belief.  For the concept to work in practice, various people and machines need a common understanding of what content is canonical. 

Who said that? The importance of the role of information sources

A large portion of online content is crowd-sourced. Bots crawl this online content indiscriminately and can’t distinguish different roles and who is responsible for decisions or information accuracy.  Few chatbot users realize the Wild West rodeo that’s corralling the information fed into the answers they see. 

Scraped online content is often of mysterious provenance.  As I have been writing this series, the New York Times reported on the legal actions that the crowd-source platform Reddit is taking against AI platforms such as OpenAI, Anthropic, Perplexity, and others. The reality of today’s AI ecosystems is that AI platforms are scraping other online platforms for content written by unidentified individuals who may not even have direct knowledge of what they are posting.

AI platforms can’t interpret the roles and responsibilities of the sources they crawl.  A chatbot might decide that Wikipedia is a more trusted source than the social media platform X (formerly Twitter), but it can’t say why.  

Both Wikipedia and X contain statements by people who convey information. The difference is that Wikipedia articles should never be written by someone directly involved, whereas an X post can be. Wikipedia is always third-hand information, while X posts sometimes are first-hand. A statement on X by a famous person about a decision they made or action they took will be more authoritative than a Wikipedia article that footnotes a news article that cites the original post.  

The X versus Wikipedia example highlights the differences in the role of sources.  It’s important to drill down into the details of the different roles and their relationships to information.

We can categorize sources into three tiers: first-party, second-party, and third-party. Each tier involves different types of information creators, who have varying degrees of direct knowledge of what they write about.  

1st party content

  • All content and data developed and published by the party responsible for deciding policies and specifications (prices, rules, service availability, performance, etc.)

2nd party content

  • Statements, postings, and advice offered by partners, distributors, and paid influencers

3rd party content

  • Crowd-sourced information 
  • User-generated content 
  • Republishers of information 
  • Summarizers of 1st party information

First parties offer content that figuratively comes “straight from the horse’s mouth.”  First-party sources are far fewer than third-party sources, which cover a wide range of online content.  Second parties sometimes look like first parties, but aren’t really.

Only first-party information can be canonical

Only first parties have a direct relationship to the knowledge they write about. Consequently, only first-party content can be canonical. That means any source that writes about what others are doing can’t be canonical.  

For example, only a manufacturer of a product is the canonical source of information about that product.  Others writing about the product – whether resellers, customers, reviewers, or news reporters – can’t be considered canonical sources of information about it.  They may have valuable insights about the product, but nothing they say will be a definitive statement.  Unless their role as a source is clearly identified in chatbot answers, people may believe that these second and third-party views are definitive. 

RoleRelationship to knowledgeExamples of content and parties
1st partyThe party responsible for deciding policies and specificationsAll content and data developed and published by the deciding party (prices, rules, service availability, performance, etc.); Government department; Manufacturer; Insurer
2nd partyA party financially or organizationally affiliated with a deciding party, but not responsible for decisions about policies or specificationsStatements, postings, and advice offered by partners, distributors, guest posters, paid or incentivized influencers
3rd partyUnaffiliated party that is not financially dependent on the 1st party, such as a user, customer, competitor, news organization, or automated platformCrowd-sourced information (aggregated from multiple sources); User-generated content (statements by users, customers, citizens, or non-affiliated contributors, which may be posted to a single platform or to distributed platforms); Republishers of information (unaffiliated curators and aggregators of articles and data from various sources); Summarizers of 1st party information

It’s important to differentiate first-party information from the concept of trusted information. Despite considerable overlap, these are distinct concepts, and it’s important to keep them separate.

A product manufacturer is the first-party source of information about its products.  A trusted review publication such as Consumer Reports isn’t.  

The first party provides the baseline information that others will evaluate.  Most often, the first-party information is accurate as far as it goes, though it may be incomplete. That’s one reason second and third-party sources are valuable. 

First-party information is generally accurate, since the organization is legally responsible for the policies and specifications in the content. Readers presume the primary source knows best.  Even so, first-party information might contain errors, omissions, out-of-date facts, or even willful distortions. But since they are responsible for legally binding claims, they are considered the authoritative source. Only they can correct the record. 

But manufacturers are not always first-party information sources. Manufacturers may compare their products to competitors’ products. The information they offer about their competitors’ products is third-party. The notion of a first party is linked to a source and its role. The source alone won’t tell us whether the information is from a first-party. We need to know what it is writing about, and its relationship to that topic.

Third-party information is broad and diverse.  It includes user-generated posts such as product reviews and help questions, as well as news reporting and machine-generated content.

Only some first-party statements are canonical

Not all statements by first parties are canonical. Even when a party is writing about itself, what it says is not necessarily definitive. Whether the statement is canonical depends on whether it is declarative or interpretive.

  • Declarative statements refer to factual assertions
  • Interpretive statements relate to what something means to the writer or for the reader; they are not legally binding claims

Declarative statements include product specifications, pricing, customer warranties, and so on.  They represent promises about what the first party will provide (or not do, because it is the customer or someone else’s responsibility).

First-party interpretative statements aren’t canonical because they aren’t factual or legally binding.  Instead, they are “official” endorsements and advice on how or why others should do something. Most marketing and non-contractual customer care content is interpretative rather than declarative. These statements aren’t absolute directives that would void a warranty if not followed, but rather recommendations that customers are responsible for interpreting and following. Because the importance of these instructions is unclear, it is common for second and third-party advisors to offer their own advice on the same topics. 

The table below shows the kinds of declarative and interpretive statements offered by primary sources (first parties), surrogates (second parties), and outsiders (third parties).

(scope)1st party (the primary source of information)2nd party (an affiliated party) contributing what they believe 3rd party (an unaffiliated party) contributing what they know or think
Declarative (what is said)Canonical statements from the decider of specifications or policiesSurrogates Restatements and rewordingsOutsider Understandings (what it means to them)
Interpretive (what it means)Primary source Justifications: How the decider conveys their decisions (why) Surrogates Perspectives (what’s best for most)Outsider Opinions (what’s best for them)

First parties use surrogates as message multipliers to extend coverage and reach.  If a customer asks how to fix a problem not addressed on the customer help, or advice on making a choice not covered by marketing, a second party might volunteer their own advice independently of the first party.  

Because of their affiliation with the first party, surrogates are often perceived as more trustworthy than unaffiliated outsiders.  However, second-party information is seldom approved or verified by the first party and often addresses edge cases that the first party hasn’t covered.  Second-party statements are never canonical, even when they address factual information.

Both surrogates and outsiders sometimes restate the first party’s factual statements.  For example, a tax preparer might restate an IRS rule in layman’s language that is easier to understand.  But such a restatement, despite its factual nature, will not be canonical because it isn’t issued by the IRS, which is the decision authority on the rule.

Accurate information depends on clear provenance 

Indicating where information comes from is not just a matter of supplying a link to a source, since that source itself may be a compilation of sources.

As soon as the chain of attribution gets complicated, the provenance of the information becomes murky.

Both people and bots need a simple yet robust framework for evaluating how the source of information influences its expected accuracy.  If it’s confusing for people, it’s likely to be confusing for bots, too.

Two factors influence the likely accuracy of the content: the reliability of the information source and its timeliness.  People and bots need these dimensions to be traceable and clear. If evaluating these dimensions gets complex, then people and bots will tend to ignore them altogether. 

Does the information come from the original source that would have decided the information, or is it a pastiche of assertions from random people?  Is the information fresh, or was it cobbled together at different times? 

The central question becomes: who owns the information and takes responsibility for its accuracy?  AI platforms that spider the entire web don’t do that. In some cases, they don’t want to know the origin of the information because it may expose them to potential legal liability for copyright infringement. 

While canonical information provides the benchmark for reliability, online users can’t rely solely on published canonical information. There are too many questions that canonical sources do not answer online. Outside sources can fill these gaps, though they must be scrutinized for accuracy. For example, the New York Times does not normally make the news (acting as a canonical source for stories about itself), but it is often a good source for reporting news that newsmakers don’t publish online themselves. 

Even when the information is not canonical, it’s still possible to evaluate its accuracy, provided it comes from an identifiable source at an identifiable time.  We can assess how intimate and complete the source’s knowledge is, and whether events occurred before, during, or after the content was published.

How, then, can one evaluate the accuracy of crowd-sourced information? Much online information consists of posts from individuals who add facts and observations about topics that otherwise don’t get much coverage.

Crowd-sourced information tends to be most accurate when everyone reports the same thing at the same time. When various people report different things, we need to know if these differences are correlated with differing timeframes. We don’t know whether everyone’s circumstances changed, or whether different people were in different circumstances either at the same time, or at different times. 

What’s wickedly challenging to evaluate is the accuracy of information from a mix of sources developed at different times.  It’s not easy to untangle this information, and, as with Grisham’s law, bad information can drive away trust in good information.

Crowd-sourced content will contain misleading information. Not only is the information not from a clearly identifiable single source that can be traced, but it tends to be composed of contributions made at different times, making it unclear which parts are current. This caution isn’t to imply that crowd-sourced content doesn’t contain valuable information. But finding, evaluating, and contextualizing that information requires sustained attention.  A cursory reading or bot crawl won’t be able to separate the wheat from the chaff.

Accountability in content is essential for AI applications

AI platforms have been happy to crawl crowd-sourced information, with little concern about its provenance. This represents the biggest vulnerability of chatbots to misinformation.

As bots, rather than people, become key readers of crowd-sourced content, we must jettison the nostalgic belief in the “wisdom of the crowd” and the hope that user-generated content is self-correcting because users will spot others’ errors and correct them. In practice, that’s not the case routinely.  

Even Wikipedia, the gold standard for crowd-contributed content, where edits are debated and revised for accuracy, can be bedeviled by misinformation that persists for considerable time before it is corrected – if it ever is.  Unlike most user-generated content, Wikipedia has an established editorial review process, but like all other forms of user-generated content, it relies on the goodwill and time of volunteer contributors who are stretched too thinly to correct more than the most high-profile errors. Unfortunately, these systems have been under severe strain in recent years, and the fabled reliability of Wikipedia may not be something to take for granted in the coming years. 

Past confidence in the democratization of information has eroded alongside changes in online user behavior, as people shifted from active information seekers to passive receivers. They’ve disengaged, developing shorter attention spans and reducing their interest in reading. They’ve decided that swiping left or right is the most effort they are willing to expend. 

Bots look like the answer to lazy interaction. Indeed, bots can correct simple errors – even Wikipedia relies on bots for basic content maintenance. But bots can’t replace active editorial oversight. Bots excel at learning patterns but don’t make critical judgments, despite claims to the contrary. 

AI platforms promise convenience.  But as bots increasingly substitute for people online, the solutions create their own problems – an example of iatrogenic progress.  Once platforms began aggregating reviews and making each review less informative in the process, bots began writing reviews themselves, hiding within the crowd that platforms summarize.  Now, users face a second-order set of problems, where answers might be based on bots harvesting reviews written by other bots.

AI platforms won’t earn credibility until they cultivate and support the sources they use to supply answers. Yet AI platforms seem to be moving in the opposite direction.  Elon Musk is promoting an AI-generated encyclopedia called Grokipedia to replace Wikipedia. The sources of information get more opaque, and their quality more dubious.

While the risk of misinformation is growing on third-party AI platforms, chatbots can provide accurate answers when implemented sensibly. The most reliable chatbots will be those that draw on clear and traceable information. The most direct way to do that is for publishers to develop their own AI platforms, rather than rely on third-party ones.  

– Michael Andrews

Categories
Misinformation and AI

Chatbots must consider the role of sources, but don’t

Many chatbot problems stem from their inability to understand context. In earlier posts in this series, I’ve discussed how aggregation flattens nuances in individual statements and how scraped content can disregard the timeframes for which the original source statements applied.  

This post explores how user context affects the statements LLMs use to generate answers. It argues that essential context is routinely omitted from statements crawled by AI platforms and, as a result, is not included in chatbot responses. Notably, chatbots don’t consider the point of view of sources expressed in statements they draw upon.

AI platforms harvest online information that’s been stripped of its original context. Bots omit essential context by ignoring the role of the source posting the information.

Information accuracy is often highly contingent on its circumstances. While most online information was reasonably accurate at some point, it may be accurate only in specific circumstances. It can be described as “yes, it is (or was) true, but only if or when a particular circumstance is true.” These qualifications extend to who is making an assertion and what their role is. Although people do lie online, the bigger problem is that they misunderstand and miscommunicate. Bots struggle even more than humans with these ambiguities.   

When assessing the credibility of information, readers must consider the circumstances of the person providing the information. They are interested not only in who said something, but also in their role.

We are accustomed to distinguishing between primary and secondary sources from years of schooling. We separate direct statements by people from indirect ones, where they are quoted or summarized.  We focus on who said something. 

Google recommends users search for information about sources they find online.

It’s important to look beyond the naive idea that sources have either a good or a bad reputation.  Many platforms make simplistic assumptions about whether a source is trustworthy, without regard to the scope or domain of the topic.  Contrary to SEO folklore, authority online isn’t an attribute of a website; it is intrinsically related to the topic of the content itself.

People and platforms should look more broadly at how information originates.  

First-party and third-party information are similar to primary and secondary sources in that both concepts distinguish different categories of sources. But the concepts are slightly different. Instead of focusing only on who said something (the source), we also consider their authority to speak about what is said (the information).  

In online forums, that rich source of advice, reviews and updates, first-person observations can be third-party information – someone’s interpretation.  For example, John might post in an online forum that the IRS doesn’t allow a certain deduction because he wasn’t able to take it himself. But John doesn’t work for the IRS (which isn’t noted for posting helpful advice in online forums). He is only conveying his personal experience. The issue is not necessarily John’s credibility or knowledge – he’s candid about what he knows, as far as he knows it.  And read carefully, John’s post may offer useful information for understanding how some taxpayers are able to take deductions or not.  But John’s post can’t be taken as the universal truth. 

First-hand statements are not first-party information unless they are made by someone who works for the organization that decides the information. An individual’s views can be first-hand and appear credible but not authoritative, as they involve interpretations, opinions, or experiences. Statements can be true as they relate to the individual’s circumstances, but not be correct if taken as global statements that apply to all situations. 

Information provenance leads to an important qualification: eyewitness accounts are not the absolute truth. 

This scepticism challenges the widely cherished idea that first-hand experiences provide the unvarnished truth.  But in reality, experiences expressed online offer at best a limited truth that’s constrained by the circumstance of when, where, and who said it.   

Chatbots can’t discern the context of the information they crawl. Even Google’s Gemini chatbot doesn’t follow Google’s guidelines for humans to investigate “why it’s sharing that info.”  Gemini offers a blanket disclaimer, “AI responses may include mistakes.”  It’s up to the human to figure out if the chatbot made mistakes and what those mistakes might be. 

Chatbots have trouble distinguishing between third-hand and first-hand information. I’ll return to an example I raised in an earlier post in this series about finding a vegetarian restaurant while on vacation. Platforms scrape reviews, which can be misleading when someone mentions the word “vegetarian” in passing, even if it’s just a general comment.  That’s an example of the unreliability of third-party information. The restaurant never made this claim. 

Every time third-party information is used, someone else’s assumptions are being applied.

If platforms were scraping restaurants’ menus and could decipher which dishes were vegetarian, they would be relying on first-party information.  If, however, the platform were deciding if the dish was vegetarian based solely on its name, we would be back to third-party information. The bot interprets menu names using third-party information to determine whether a dish is vegetarian. But many vegetable dishes have bacon or chicken stock in them, which won’t be apparent from the name of the dish. So even with first-hand information, the full context may be missing. 

Textual declarations seldom explicitly qualify the limitations of a statement – the reader is expected to infer any limitations from the context in which the declaration is made.  Bots, however, tend to decontextualize statements and make them into universal ones.  Bot-generated statements derived from crowd-contributed content are often misleading. 

Your experience may vary

The source’s identity will reflect their role: what matters to them and what they know about a situation. Various people can make statements that are inconsistent but nonetheless valid for them individually.

Online forums are where people share stories about themselves. A person will write in a forum about “what I did, and what worked for me”, with little initial consideration of how readers might be in different circumstances. Such egocentricity reflects the incentives and motivations of crowd-contributed forums.  People enjoy talking about themselves and believe they are influencing others to emulate them. They enjoy getting praise and recognition when they post something deemed notable that hasn’t been seen before.   

The individual posts that bots crawl contain sampling biases (the advice in each post is a sample of one).  People write about what they did – what they considered and tried. Rarely do they write about having tried all possibilities and evaluated them. The information is selective. 

When all parties view communication as a point-to-point exchange, each party strips out the context they deem unnecessary. They emphasize what they want to know rather than spending much time discussing what others may know. The information tends to be personal.

The writer of advice and the seeker of advice can have different preference profiles.  The “best way” to do something depends heavily on the situation and individual preferences. For many tasks, determining the best approach can be challenging without understanding who, when, and why someone wants to undertake the task.  

The challenges of human communication are magnified online, where distance in time and space makes clarification and qualification of statements much harder.

Even with these challenges, many forum participants want to help and may clarify statements in subsequent threads, especially when questions arise.

But bots crawl online forums with a more acquisitive agenda.  They are indifferent to the discussion’s context.  They simply want to harvest statements made.  Whereas humans may engage in a close reading of the discussion, bots engage in a distant reading of it.

The problem is that much of the context shaping what’s said online is never explicitly stated, and if it is revealed, it may be noted later in the discussion. 

Where context is omitted, gaps in understanding emerge. The writer’s context may not be transparent (even to the writer).  The reader’s context – their preferences and circumstances – may be unknown to the writer.  The bot, driven by its mission to scrape the discussion, is indifferent to the context. 

The phantom of contextual AI 

The omission of context in crawled online content poses a formidable challenge to the growth and development of AI.  

The latest wave of AI development is focused on agents that use the Model Context Protocol. Context is essential for AI, but chatbots can’t supply the context needed.

There’s no simple fix for the omission of context in online information.

Content professionals often champion the importance of context in supplying relevant information.  Many argue that contextual metadata should be added to source statements to enable bots to provide high-quality answers. Approaches such as GraphRAG are having a moment. Although commendable in principle, applying context to online content after it’s been written is difficult in practice. 

Online content, particularly forum discussions, is not written for machines. People are writing for each other – in some cases, telling stories to themselves. The writer may be blissfully unaware of the limitations of their pronouncements and how those pronouncements reflect their personal biases.  

Bots can’t detect the possibility that the facts of the matter may be specific to what the individual experienced in a given context.  Omitted context can’t be auto-magically restored.  

Yes, some context can be applied after the fact with automated tags.  Yet, realistically, much of the context of online content requires close human reading to infer.  Bots process text superficially, relying on relatively crude tools such as keyword and entity recognition, which are no match for the inherent ambiguity of most online discussions. 

– Michael Andrews