Chatbots effortlessly answer unremarkable questions. But we can’t trust them to answer unexpected ones.
I’ve been exploring the complex roots of misinformation in AI. So far in this series, I have noted that online content is full of crowd-sourced information, and that AI platforms depend on crawling online content that’s full of crowd-sourced information. These dependencies have created a deep vulnerability for AI platforms toward misinformation.
This post looks at how platforms are changing due to AI, and why these changes make crowd-sourced information less useful and informative. AI platforms are perversely making essential online information less intelligent.
The evolution of platforms
Platforms emerged to solve the problem of how to access information written by different parties. Platforms aspire to be a one-stop source of information. To deliver that promise, they host information from many sources.
No single organization can publish everything anyone would need to know and provide comprehensive information online that covers every contingency its customers or stakeholders might face. Even when they should ideally be the authoritative source of information about a topic, organizations face the realities of resource constraints. They can only publish about those issues that are most frequently sought or have the highest business impact.
Outside parties, such as partners or customers, contribute advice and information that authoritative sources don’t have the capacity or inclination to provide. Long tail information may cover lesser-known details, considerations associated with resolving issues, edge cases, and sometimes, problems companies would prefer not to publicize prominently, such as known problems. Platforms recognize that users seek such information and aggregate it to make it more easily available.
Platforms emerged that specialized in offering access to a range of online content. Search platforms collect and rank relevant links to any web page from any website. Ratings platforms like Rotten Tomatoes or Angie’s List collect and host user comments from any commentator. Marketplace platforms such as eBay or Amazon collect customer reviews of products and vendors. GitHub emerged as a platform for discussions about all kinds of code, hosting bug reports, feature requests, and proposed solutions.
Platforms live on content contributed by outside sources. Crowd-sourced content is often characterized as “user-generated”, implying customers write it. Yet platforms also aggregate content from other parties that contribute online content, such as partners, distributors, journalists, critics, and influencers. Some platforms aggregate or syndicate machine-generated data (such as prices, inventories, or schedules) from different sources.
Platforms aggregate details that no single contributor could develop. The platform assembles a mosaic from many individual pieces. Sometimes the mosaic is complete, though often it’s not.
Platforms have taken advantage of – and benefited from – the web’s open contribution model. Anyone can post their views online, and it’s up to readers to decide what’s useful. Readers vote their preferences by clicking links, which signals the value of the content, which algorithms in turn rank. Such a ranking is never a perfect process, but at least individual readers played a role in shaping the process.
AI platforms alter the utility of crowd-sourced information
AI platforms such as ChatGPT and Claude are the latest stage in the evolution of platforms. Like their predecessors, they pull together information originally contributed by diverse sources and present themselves as a one-stop destination for answers. But they change what value readers get from the sources.
Readers value crowd-sourced information according to whether it’s efficient and informative for them.
When there are many contributions, it can be inefficient to read them all.
But in many situations, each contribution will provide extra perspectives that make reading additional contributions informative. For example, it’s often informative to compare different sources of information about a topic, such as from a vendor and its competitor. It is rarer to rely on a single source of information confidently.
Sifting through numerous postings is an inefficient way of determining undisputed truths, because there’s a lot of redundancy in them. Yet the collective voice of the crowd is informative for complex situations where distinct perspectives contribute to a fuller picture, though the process is still inefficient.
AI platforms make the process of assessing crowd-sourced content more efficient. But by doing so, it makes the information less informative.
When aggregated, individual insights can be flattened into anodyne statements. For example, we learn from an AI summary of customer reviews of a bookstore chain branch that the store offers a variety of books – an obvious observation. But AI summaries won’t tell us if the store has many books about philosophy or learning an instrument. We expect computers to produce “intelligence” but find it missing.

AI platforms damage the quality of crowd-sourced information
Before LLMs, platforms encouraged users to view the original posts. The platform’s role was to act as a clearinghouse that indexes contributed content.
Now, clearinghouse-oriented platforms are morphing into AI platforms. Forums like StackExchange have been replaced by tools such as Copilot and ChatGPT.
The AI platform transforms the content developed by others, a role I refer to as third-party AI. The AI platform does not originate the source information nor does it take responsibility for its accuracy. It operates on the assumption that relevant and accurate information exists within the corpus of content it has crawled.
AI platforms utilize open web content that is “freely available” (not blocked by paywalls) and repurposable (easily scraped and tokenized). Bots harvest online content and transform it enough to avoid copyright infringement. For AI platforms, online content is a cost-free resource on which to build services.
But source content can only be bent so far before it deforms.
Long tail information – highly specific information that’s not common knowledge – is most likely to be crowd-sourced. It is also least likely to be fact-checked, qualified, or maintained. Crowd-sourced information is incomplete in both its coverage of issues and the scope it addresses for each issue. An answer you seek may never have been written about.
Imagine you are troubleshooting a software glitch, which could be caused by many factors, such as your hardware, other software you run, the version of software you are using, and so on. The software vendor doesn’t offer clear information about solving your specific problem, so you turn to an online forum for answers. Others have posted similar problems and offered a range of diverging solutions. Some solutions don’t seem to make sense in your situation, while others don’t work. As far as you can tell, none of the suggestions relates to the exact setup or circumstances you have.
With crowd-sourced information, it can be challenging to figure out what answers are similar enough to a question. Some problems are perennial, and some are novel. Solutions can be routine or idiosyncratic. Rebooting your computer or clearing your browser cache is common advice that may be helpful sometimes, but often isn’t.
These examples highlight the challenges of matching queries with information in long tail scenarios. Until recently, people needed to vet all the answers one by one to decide which were useful. Now, LLMs promise to do this.
The folly of the crowd in AI platforms
Crowd-sourced content provides vital information not otherwise available, though it is not reliable. Individual contributions can be informative, though they are rarely definitive. When summarized collectively, they become both unspecific and prone to collective biases.
LLMs are reliable when summarizing ubiquitous, stable knowledge with a high degree of consensus and agreement. A chatbot will confidently tell us the year of US independence from Britain because there’s little controversy about it.
When everyone knows the same facts or has identical experiences, all crawled text says the same things. There is little need to consult many sources. After all, if everyone has the same opinion or says the same thing, each person’s view adds no new information.
When bots crawl content and encounter the same information repeated in multiple sources, they infer that the information is likely accurate. Yet, the ubiquity of a statement is not always a reliable proxy for its presumed accuracy.
Instead of leveraging the “wisdom of the crowd,” bots can fall prey to the “tragedy of the commons”: collective ignorance embedded in past online content.
Bot answers are anchored in eclectic and unvetted sources that are blended together into a vast corpus. Bots have trouble surfacing information that is not widely known, especially if it is at variance with more common explanations.
Bot behavior can perpetuate a bias toward legacy content and ideas. Much of the content that bots crawl may contain dated information or unreliable folk knowledge that is widely repeated but misleading.
Bots misuse content from online forums. Readers find forums useful as places of discovery, not for their past history. Forums are often where new issues first surface. A freshly discovered problem, or an alternative viewpoint, starts as a weak signal that could emerge into a more significant piece of information. But until new issues are widely discussed (and noticed by bot crawlers), they aren’t likely to show up in bot answers.
The rationale for online platforms is being upended. In the pre-bot era, platforms offered the convenience of gathering different, sometimes diverging perspectives in a single place. Readers could scan for the most relevant or recent information. Contributors had an incentive to post if they felt their statements would be read and noticed.
Now, bots become the audience and the judge of the value of contributions. Bots read posts and decide if and how to summarize them for human readers. They are hungry for any content they can access. They can’t be caught not knowing an answer.
Yet chatbots have limited powers of discrimination. They rely on vast quantities of legacy content that may no longer be current.
AI platforms depend on crowd-sourced content to generate answers, but make crowd-sourced content less informative.
– Michael Andrews