Categories
Content Engineering

Paradata: where analytics meets governance

Organizations aspire to make data-informed decisions. But can they confidently rely on their data? What does that data really tell them, and how was it derived? Paradata, a specialized form of metadata, can provide answers.

Many disciplines use paradata

You won’t find the word paradata in a household dictionary and the concept is unknown in the content profession.  Yet paradata is highly relevant to content work. It provides context showing how the activities of writers, designers, and readers can influence each other.

Paradata provides a unique and missing perspective. A forthcoming book on paradata defines it as “data on the making and processing of data.” Paradata extends beyond basic metadata — “data about data.” It introduces the dimensions of time and events. It considers the how (process) and the what (analytics).

Think of content as a special kind of data that has a purpose and a human audience. Content paradata can be defined as data on the making and processing of content.

Paradata can answer:

  • Where did this content come from?
  • How has it changed?
  • How is it being used?

Paradata differs from other kinds of metadata in its focus on the interaction of actors (people and software) with information. It provides context that helps planners, designers, and developers interpret how content is working.

Paradata traces activity during various phases of the content lifecycle: how it was assembled, interacted with, and subsequently used. It can explain content from different perspectives:

  • Retrospectively 
  • Contemporaneously
  • Predictively

Paradata provides insights into processes by highlighting the transformation of resources in a pipeline or workflow. By recording the changes, it becomes possible to reproduce those changes. Paradata can provide the basis for generalizing the development of a single work into a reusable workflow for similar works.

Some discussions of paradata refer to it as “processual meta-level information on processes“ (processual here refers to the process of developing processes.) Knowing how activities happen provides the foundation for sound governance.

Contextual information facilities reuse. Paradata can enable the cross-use and reuse of digital resources. A key challenge for reusing any content created by others is understanding its origins and purpose. It’s especially challenging when wanting to encourage collaborative reuse across job roles or disciplines. One study of the benefits of paradata notes: “Meticulous documentation and communication of contextual information are exceedingly critical when (re)users come from diverse disciplinary backgrounds and lack a shared tacit understanding of the priorities and usual practices of obtaining and processing data.“

While paradata isn’t currently utilized in mainstream content work, a number of content-adjacent fields use paradata, pointing to potential opportunities for content developers. 

Content professionals can learn from how paradata is used in:

  • Survey and research data
  • Learning resources
  • AI
  • API-delivered software

Each discipline looks at paradata through different lenses and emphasizes distinct phases of the content or data lifecycle. Some emphasize content assembly, while others emphasize content usage. Some emphasize both, building a feedback loop.

Conceptualizing paradata
Different perspectives of paradata. Source: Isto Huvila

Content professionals should learn from other disciplines, but they should not expect others to talk about paradata in the same way.  Paradata concepts are sometimes discussed using other terms, such as software observability. 

Paradata for surveys and research data

Paradata is most closely associated with developing research data, especially statistical data from surveys. Survey researchers pioneered the field of paradata several decades ago, aware of the sensitivity of survey results to the conditions under which they are administered.

The National Institute of Statistical Sciences describes paradata as “data about the process of survey production” and as “formalized data on methodologies, processes and quality associated with the production and assembly of statistical data.”  

Researchers realize how information is assembled can influence what can be concluded from it. In a survey, confounding factors could be a glitch in a form or a leading question that prompts people to answer in a given way disproportionately. 

The US Census Bureau, which conducts a range of surveys of individuals and businesses, explains: “Paradata is a term used to describe data generated as a by-product of the data collection process. Types of paradata vary from contact attempt history records for interviewer-assisted operations, to form tracing using tracking numbers in mail surveys, to keystroke or mouse-click history for internet self-response surveys.”  For example, the Census Bureau uses paradata to understand and adjust for non-responses to surveys. 

Paradata for surveys
Source: NDDI 

As computers become more prominent in the administration of surveys, they become actors influencing the process. Computers can record an array of interactions between people and software.

 Why should content professionals care about survey processes?

Think about surveys as a structured approach to assembling information about a topic of interest. Paradata can indicate whether users could submit survey answers and under what conditions people were most likely to respond.  Researchers use paradata to measure user burden. Paradata helps illuminate the work required to provide information –a topic relevant to content professionals interested in the authoring experience of structured content.

Paradata supports research of all kinds, including UX research. It’s used in archaeology and archives to describe the process of acquiring and preserving assets and changes that may happen to them through their handling. It’s also used in experimental data in the life sciences.

Paradata supports reuse. It provides information about the context in which information was developed, improving its quality, utility, and reusability.

Researchers in many fields are embracing what is known as the FAIR principles: making data Findable, Accessible, Interoperable, and Reusable. Scientists want the ability to reproduce the results of previous research and build upon new knowledge. Paradata supports the goals of FAIR data.  As one study notes, “understanding and documentation of the contexts of creation, curation and use of research data…make it useful and usable for researchers and other potential users in the future.”

Content developers similarly should aspire to make their content findable, accessible, interoperable, and reusable for the benefit of others. 

Paradata for learning resources

Learning resources are specialized content that needs to adapt to different learners and goals. How resources are used and changed influences the outcomes they achieve. Some education researchers have described paradata as “learning resource analytics.”

Paradata for instructional resources is linked to learning goals. “Paradata is generated through user processes of searching for content, identifying interest for subsequent use, correlating resources to specific learning goals or standards, and integrating content into educational practices,” notes a Wikipedia article. 

Data about usage isn’t represented in traditional metadata. A document prepared for the US Department of Education notes: “Say you want to share the fact that some people clicked on a link on my website that leads to a page describing the book. A verb for that is ‘click.’ You may want to indicate that some people bookmarked a video for a class on literature classics. A verb for that is ‘bookmark.’ In the prior example, a teacher presented resources to a class. The verb used for that is ‘taught.’ Traditional metadata has no mechanism for communicating these kinds of things.”

“Paradata may include individual or aggregate user interactions such as viewing, downloading, sharing to other users, favoriting, and embedding reusable content into derivative works, as well as contextualizing activities such as aligning content to educational standards, adding tags, and incorporating resources into curriculum.” 

Usage data can inform content development.  One article expresses the desire to “establish return feedback loops of data created by the activities of communities around that content—a type of data we have defined as paradata, adapting the term from its application in the social sciences.”

Unlike traditional web analytics, which focuses on web pages or user sessions and doesn’t consider the user context, paradata focuses on the user’s interactions in a content ecosystem over time. The data is linked to content assets to understand their use. It resembles social media metadata that tracks the propagation of events as a graph.

“Paradata provides a mechanism to openly exchange information about how resources are discovered, assessed for utility, and integrated into the processes of designing learning experiences. Each of the individual and collective actions that are the hallmarks of today’s workflow around digital content—favoriting, foldering, rating, sharing, remixing, embedding, and embellishing—are points of paradata that can serve as indicators about resource utility and emerging practices.”

Paradata for learning resources utilizes the Activity Stream JSON, which can track the interaction between actors and objects according to predefined verbs called an “Activity Schema” that can be measured. The approach can be applied to any kind of content.

Paradata for AI

AI has a growing influence over content development and distribution. Paradata is emerging as a strategy for producing “explainable AI” (XAI).  “Explainability, in the context of decision-making in software systems, refers to the ability to provide clear and understandable reasons behind the decisions, recommendations, and predictions made by the software.”

The Association for Intelligent Information Management (AIIM) has suggested that a “cohesive package of paradata may be used to document and explain AI applications employed by an individual or organization.” 

Paradata provides a manifest of the AI training data. AIIM identifies two kinds of paradata: technical and organizational.

Technical paradata includes:

  • The model’s training dataset
  • Versioning information
  • Evaluation and performance metrics
  • Logs generated
  • Existing documentation provided by a vendor

Organizational paradata includes:

  • Design, procurement, or implementation processes
  • Relevant AI policy
  • Ethical reviews conducted
Paradata for AI
Source: Patricia C. Franks

The provenance of AI models and their training has become a governance issue as more organizations use machine learning models and LLMs to develop and deliver content. AI models tend to be ” black boxes” that users are unable to untangle and understand. 

How AI models are constructed has governance implications, given their potential to be biased or contain unlicensed copyrighted or other proprietary data. Developing paradata for AI models will be essential if models expect wide adoption.

Paradata and document observability

Observing the unfolding of behavior helps to debug problems to make systems more resilient.

Fabrizio Ferri-Benedetti, whom I met some years ago in Barcelona at a Confab conference, recently wrote about a concept he calls “document observability” that has parallels to paradata.

Content practices can borrow from software practices. As software becomes more API-focused, firms are monitoring API logs and metrics to understand how various routines interact, a field called observability. The goal is to identify and understand unanticipated occurrences. “Debugging with observability is about preserving as much of the context around any given request as possible, so that you can reconstruct the environment and circumstances that triggered the bug.”

Observability utilizes a profile called MELT: Metrics, Events, Logs, and Traces. MELT is essentially paradata for APIs.

Software observability pattern
Software observability pattern.  Source: Karumuri, Solleza, Zdonik, and Tatbul

Content, like software, is becoming more API-enabled. Content can be tapped from different sources and fetched interactively. The interaction of content pieces in a dynamic context showcases the content’s temporal properties.

When things behave unexpectedly, systems designers need the ability to reverse engine behavior. An article in IEEE Software states: “One of the principles for tackling a complex system, such as a biochemical reaction system, is to obtain observability. Observability means the ability to reconstruct a system’s internal state from its outputs.”  

Ferri-Benedetti notes, “Software observability, or o11y, has many different definitions, but they all emphasize collecting data about the internal states of software components to troubleshoot issues with little prior knowledge.”  

Because documentation is essential to the software’s operation, Ferri-Benedetti  advocates treating “the docs as if they were a technical feature of the product,” where the content is “linked to the product by means of deep linking, session tracking, tracking codes, or similar mechanisms.”

He describes document observability (“do11y”) as “a frame of mind that informs the way you’ll approach the design of content and connected systems, and how you’ll measure success.”

In contrast to observability, which relies on incident-based indexing, paradata is generally defined by a formal schema. A schema allows stakeholders to manage and change the system instead of merely reacting to it and fixing its bugs. 

Applications of paradata to content operations and strategy

Why a new concept most people have never heard of? Content professionals must expand their toolkit.

Content is becoming more complex. It touches many actors: employees in various roles, customers with multiple needs, and IT systems with different responsibilities. Stakeholders need to understand the content’s intended purpose and use in practice and if those orientations diverge. Do people need to adapt content because the original does not meet their needs? Should people be adapting existing content, or should that content be easier to reuse in its original form?

Content continuously evolves and changes shape, acquiring emergent properties. People and AI customize, repurpose, and transform content, making it more challenging to know how these variations affect outcomes. Content decisions involve more people over extended time frames. 

Content professionals need better tools and metrics to understand how content behaves as a system. 

Paradata provides contextual data about the content’s trajectory. It builds on two kinds of metadata that connect content to user action:

  • Administrative metadata capturing the actions of the content creators or authors, intended audiences, approvers, versions, and when last updated
  • Usage metadata capturing the intended and actual uses of the content, both internal (asset role, rights, where item or assets are used) and external (number of views, average user rating)

Paradata also incorporates newer forms of semantic and blockchain-based metadata that address change over time:

  • Provenance metadata
  • Actions schema types

Provenance metadata has become essential for image content, which can be edited and transformed in multiple ways that change what it represents. Organizations need to know the source of the original and what edits have been made to it, especially with the rise of synthetic media. Metadata can indicate on what an image was based or derived from, who made changes, or what software generated changes. Two corporate initiatives focused on provenance metadata are the Content Authenticity Initiative and the Coalition for Content Provenance and Authenticity.

Actions are an established — but underutilized — dimension of metadata. The widely adopted schema.org vocabulary has a class of actions that address both software interactions and physical world actions. The schema.org actions build on the W3C Activity Streams standard, which was upgraded in version 2.0 to semantic standards based on JSON-LD types.

Content paradata can clarify common issues such as:

  • How can content pieces be reused?
  • What was the process for creating the content, and can one reuse that process to create something similar?
  • When and how was this content modified?

Paradata can help overcome operational challenges such as:

  • Content inventories where it is difficult to distinguish similar items or versions
  • Content workflows where it is difficult to model how distinct content types should be managed
  • Content analytics, where the performance of content items is bound up with channel-specific measurement tools

Implementing content paradata must be guided by a vision. The most mature application of paradata – for survey research – has evolved over several decades, prompted by the need to improve survey accuracy. Other research fields are adopting paradata practices as research funders insist that data be “FAIR.” Change is possible, but it doesn’t happen overnight. It requires having a clear objective.

It may seem unlikely that content publishing will embrace paradata anytime soon. However, the explosive growth of AI-generated content may provide the catalyst for introducing paradata elements into content practices. The unmanaged generation of content will be a problem too big to ignore.

The good news is that online content publishing can take advantage of existing metadata standards and frameworks that provide paradata. What’s needed is to incorporate these elements into content models that manage internal systems and external platforms.

Online publishers should introduce paradata into systems they directly manage, such as their digital asset management system or customer portals and apps. Because paradata can encompass a wide range of actions and behaviors, it is best to prioritize tracking actions that are difficult to discern but likely to have long-term consequences. 

Paradata can provide robust signals to reveal how content modifications impact an organization’s employees and customers.  

– Michael Andrews

Categories
Big Content

Time to end Google’s domination of schema.org

Few companies enjoy being the object of public scrutiny.  But Google, one of the world’s most recognized brands, seems especially averse.  Last year, when Sundar Pichai, Google’s chief executive, was asked to testify before Congress about antitrust concerns, he refused.  They held the hearing without his presence.  His name card was there, in front of an empty chair.

Last month Congress held another hearing on antitrust.  This time, Pichai was in his chair in front of the cameras, remotely if reluctantly.   During the hearings, Google’s fairness was a focal issue.  According to a summary of the testimony on the ProMarket blog of the University of Chicago Business School’s Stigler Center: “If the content provider complained about its treatment [by Google], it could be disappeared from search. Pichai didn’t deny the allegation.”

One of the major ways that content providers gain (or lose) visibility on Google — the first option that most people choose to find information — is through their use of a metadata standard known as schema.org.  And the hearing revealed that publishers are alleging that Google engages in bullying tactics relating to how their information is presented on the Google platform.  How might these issues be related?  Who sits in the chair that decides the stakes?

Metadata and antitrust may seem arcane and wonky topics, especially when looked at together. Each requires some basic knowledge to understand, so it is rare that the interaction between the two is discussed. Yet it’s never been more important to remove the obscurity surrounding how the most widely used standard for web metadata, schema.org influences the fortunes of Google, one of the most valuable companies in the world. 

Why schema.org is important to Google’s dominant market position

Google controls 90% of searches in the US, and its Android operating system powers 9 of 10 smartphones globally.  Both these products depend on schema.org metadata (or “structured data”) to induce consumers to use Google products.  

A recent investigation by The Markup noted that “Google devoted 41 percent of the first page of search results on mobile devices to its own properties and what it calls ‘direct answers.’”  Many of these direct answers are populated by schema.org metadata that publishers provide in hopes of driving traffic to their websites.  But Google has a financial incentive to stop traffic from leaving its websites.  The Markup notes that “Google makes five times as much revenue through advertising on its own properties as it does selling ad space on third-party websites.”  Tens of billions of dollars of Google revenues depend in some way on the schema.org metadata.  In addition to web search results, many Google smartphone apps including Gmail capture schema.org metadata that can support other ad-related revenues.  

During the recent antitrust hearings, Congressman Cicilline told Sundar Pichai that “Google evolved from a turnstile to the rest of the web to a walled garden.”  The walled garden problem is at the core of Google’s monopoly position.   Perversely, Google has been able to manipulate the use of public standards to create a walled garden for its products.  Google reaps a disproportionate benefit from the standard by preventing broader uses of the standards that could result in competitive threats to Google.

There’s a deep irony is that a W3C-sanctioned metadata standard, schema.org, has been captured by a tech giant not just to promote its unique interests but to limit the interests of others.  Schema.org was supposed to popularize the semantic web and help citizens gain unprecedented access to the world’s information. Yet Google has managed to monopolize this public asset. 

How schema.org became a walled garden

How Google came to dominate a W3C-affiliated standard requires a little history.  The short history is that Google has always been schema.org’s chief patron.  It created schema.org and promoted it in the W3C.  Since then, it has consolidated its hold on it. 

The semantic web — the inspiration for schema.org — has deep roots in the W3C.  Tim Berners Lee, the founder of the World Wide Web, coined the concept and has been its major champion.  The commercialization of the approach has been long in the making. Metaweb was the first venture-funded company to commercialize the semantic web with its product Freebase  The New York Times noted at the time: “In its ambitions, Freebase has some similarities to Google — which has asserted that its mission is to organize the world’s information and make it universally accessible and useful. But its approach sets it apart.”  Google bought Metaweb and its Freebase database in 2010, buying and removing a potential competitor.   The following year (2011) Google launched the schema.org initiative, bringing along Bing and Yahoo, the other search engines that competed with Google.  While the market share of Bing and Yahoo were small compared to Google, the launch initiative raised hopes that more options would be available for search.  Google noted: “With schema.org, site owners can improve how their sites appear in search results not only on Google, but on Bing, Yahoo! and potentially other search engines as well in the future.”  Nearly a decade later, there is even less competition in search than there was when schema.org was created.

In 2015 a Google employee proposed that schema.org become a W3C community group.  He soon became the chair of the group once it was formed.  

By making schema.org a W3C community, the Google-driven initiative gained credibility through its W3C endorsement as a community-driven standard. Previously, only Google and its initiative partners (Microsoft’s Bing, Yahoo, and later Russia’s Yandex) had any say over the decisions that webmasters and other individuals involved with publishing web content needed to follow, a situation which could have triggered antitrust alarms relating to collusion.   Google also faced the challenge of encouraging webmasters to adopt the schema.org standard.  Webmasters had been slow to embrace the standard and assume the work involved with using it.  Making schema.org an open community-driven standard solved multiple problems for Google at once.  

In normal circumstances — untinged by a massive and domineering tech platform — an open standard should have encouraged webmasters to participate in the standards-making process and express their goals and needs. Ideally, a community-driven standard would be the driver of innovation. It could finally open up the semantic web for the benefit of web users.  But the tight control Google has exercised over the schema.org community has prevented that from happening.

The murky ownership of the schema.org standard

From the beginnings of schema.org, Google’s participation has been more active than anyone else, and Google’s guidance about schema.org has been more detailed than even the official schema.org website.  This has created a great deal of confusion among webmasters about what schema.org requires for compliance to the standard, as opposed to what Google requires for compliance for its search results and ranking.  It’s common for an SEO specialist to ask in a schema.org forum a question about Google’s search results.  Even people with a limited knowledge of schema.org’s mandate assume — correctly — that it exists primarily for the benefit of Google.  

In theory, Google is just one of numerous organizations that implements a standard that is created by a third party.  In practice, Google is both the biggest user of the schema.org standard — and also its primary author.  Google is overwhelmingly the biggest consumer of schema.org structured data.  It also is by far the most active contributor to the standard.  Most other participants are along for the ride: trying to keep up with what Google is deciding internally about how it will use schema.org in its products, and what it is announcing externally about changes Google wants to make to the standard.

In many cases, if you want to understand the schema.org standard, you need to rely on Google’s documentation.  Webmasters routinely complain about the quality of schema.org’s documentation: its ambiguities, or the lack of examples.  Parts of the standard that are not priorities for Google are not well documented anywhere.  If they are priorities for Google, however, Google itself provides excellent documentation about how information should be specified in schema.org so that Google can use it.   Because schema.org’s documentation is poor, the focus of attention stays on Google.

The reliance that nearly everyone has on Google to ascertain compliance with schema.org requirements was highlighted last month by Google’s decision to discontinue its Structured Data Testing Tool, which is widely used by webmasters to check that their schema.org metadata is correct — at least as far as Google is concerned.  Because the concrete implementation requirements of schema.org are often murky, many rely on this Google tool to verify the correctness of the data independently of how the data would be used.  Google is replacing this developer-focused tool with a website that checks whether the metadata will display correctly in Google’s “rich results.”  The new “Rich Results Test Tool” acknowledges finally what’s been an open secret: Google’s promotion of schema.org is primarily about populating its walled garden with content.  

Google’s domination of the schema.org community

The purpose of a W3C group should be to serve everyone, not just a single company. In the case of schema.org, a W3C community has been dominated from the start by a single company: Google.

Google has chaired the schema.org community continuously since its inception in 2015.   Microsoft (Bing) and Yahoo (now Verizon), who are minor players in the search business, participate nominally but are not very active considering they were founding members of schema.org.  Google, in contrast, has multiple employees active in community discussions, steering the direction of conversations.  These employees shape the core decisions, together with a few independent consultants who have longstanding relationships with Google.  It’s hard to imagine any decision happening without Google’s consent.  Google has effective veto power over decisions.

Google’s domination of the schema.org community is possible because the community has no resources of its own.  Google conveniently volunteers the labor of its employees to perform duties related to community business, but these activities will naturally reflect the interests of the employer, Google.  Since other firms don’t have the same financial incentives that Google has through its market dominance of search and smartphones in the outcomes of schema.org decisions, they don’t allocate their employees to spend time on schema.org issues.  Google corners the discussion while appearing to be the most generous contributor.

The absence of governance in the schema.org community

The schema.org community essentially has zero governance — a situation Google is happy with.  There are no formal rules, no formal process for proposals and decisions, no way to appeal a decision, and no formal roles apart from the chair, who ultimately can decide everything. There’s no process of recusal.  Google holds sway in part because the community has no permanent and independent staff.  And there’s no independent board of oversight reviewing how business is conducted.

It’s tempting to see the absence of governance as an example of a group of developers who have a disdain for bureaucracy — that’s the view Google encourages.  But the commercial and social significance of these community decisions is enormous and shouldn’t be cloaked in capricious informality.  Moreover, the more mundane problems of a lack of process are also apparent.  Many people who attempt to make suggestions feel frozen out and unwelcome. Suggestions may be challenged by core insiders who have deep relationships with one another.  The standards- making process itself lacks standardization.  

 In the absence of governance, the possibilities of a conflict of interest are substantial.  First, there’s the problem of self-dealing: Google using its position as the chair of a public forum to prioritize its own commercial interests ahead of others.  Second, there’s the possibility that non-Google proposals will be stopped because they are seen as costly to Google, if only because they create extra work for the largest single user of schema.org structured data.  

As a public company, Google is obligated to its shareholders — not to larger community interests.  A salaried Google employee can’t simultaneously promote his company’s commercial interests and promote interests that could weaken his company’s competitive position.  

Community bias in decisions

Few people want an open W3C community to exhibit biases in their decisions.  But owing to Google’s outsized participation and the absence of governance, decision making that’s biased toward Google’s priorities is common.

Whatever Google wants is fast-tracked — sometimes happening within a matter of days.  If a change to schema.org is needed to support a Google product that needs to ship, nothing will slow down that from happening.

 Suggestions from people not affiliated with Google face a tougher journey.  If the suggestion does not match Google priorities, it is slow-walked. They will be challenged as to their necessity or practicality.  They will languish as an open issue in Github, where they will go unnoticed unless they generate an active discussion.  Eventually, the chair will cull proposals that have been long buried in the interest of closing out open issues.

While individuals and groups can propose suggestions of their own, successful ones tend to be incremental in nature, already aligned with Google’s agenda.  More disruptive or innovative ones are less likely to be adopted.

In the absence of a defined process, the ratification of proposals tends to happen through an informal virtual acclamation.  Various Google employees will conduct a public online discussion agreeing with one another on the merits of adopting a proposal or change.  With “community” sentiment demonstrated, the change is pushed ahead.  

Consumer harm from Google’s capture of schema.org

Google’s domination of schema.org is an essential part of its business model.  Schema.org structured data drives traffic to Google properties, and Google has leveraged it so that it can present fewer links that would drive traffic elsewhere.  The more time consumers spend on Google properties, the more their information decisions are limited to the ads that Google sells.  Consumers need to work harder to find “organic” links (objectively determined by their query and involving no payment to Google) to information sources they seek.

A technical standard should be a public good that benefits all.  In principle, publishers that use schema.org metadata should be able to expand the reach of their information, so that apps from many firms take advantage of it, and consumers have more choices about how and where they get their information.  The motivating idea behind semantic structured data such as schema.org provides is that information becomes independent of platforms.  But ironically, for consumers to enjoy the value of structured data, they mostly need to use Google products.  This is a significant market failure, which hasn’t happened by accident.

The original premise of the semantic web was based on openness.  Publishers freely offered information, and consumers could freely access it.  But the commercial version, driven by Google, has changed this dynamic.  The commercial semantic web isn’t truly open; it is asymmetrically open.  It involves open publishing but closed access.  Web publishers are free to publish their data using the schema.org standard and are actively encouraged to do so by Google. The barriers to creating structured data are minimal, though the barriers to retrieving it aren’t.  

Right now, only a firm with the scale of Google is in a position to access this data and normalize it into something useful for consumers.  Google’s formidable ad revenues allow it to crawl the web and harvest the data for its private gain.  A few other firms are also harvesting this data to build private knowledge graphs that similarly provide gated access.  The goal of open consumer access to this data remains elusive.  A small company may invest time or money to create structured data, but they lack the means to use structured data for their own purposes.   But it doesn’t have to be this way.   

Has Google’s domination of schema.org stifled innovation?

When considering how big tech has influenced innovation, it is necessary to pose a counterfactual question: What might have been possible if the heavy hand of a big tech platform hadn’t been meddling?

Google’s routine challenge to suggestions for additions to the schema.org vocabulary is to question whether the new idea will be used.  “What consuming application is going to use this?” is the common screening question.  If Google isn’t interested in using it, why is it worthwhile doing?  Unless the individual making the suggestion is associated with a huge organization that will build significant infrastructure around the new proposal, the proposal is considered unviable.  

The word choice of “consuming applications” is an example of how Google avoids referring to itself and its market dominance.  The Markup recently revealed how Google coaches its employees to avoid phrases that could get it in additional antitrust trouble.  Within the schema.org community group, Google employees strive to make discussion appear objective, where Google seems disinterested in the decision.  

One area where Google has discouraged alternative developments is in discouraging the linking of schema.org data with data using other metadata vocabularies (standards).  This is significant for multiple reasons.  The schema.org vocabulary is limited in its scope, mostly focusing on commercial entities and not on non-commercial entities.  Because Google is not interested in non-commercial entity coverage, publishers need to rely on other vocabularies.  But Google doesn’t want to look at other vocabularies, claiming that it is too taxing for them to crawl data described by other vocabularies.  In this, Google is making a commercial decision that goes against the principles of linked data (a principle of the semantic web), which explicitly encourages the mixing of vocabularies. For publishers, they are forced to obey Google’s diktats.  Why should they supply metadata that Google, the biggest consumer of schema.org metadata, says it will ignore?  With a few select exceptions, Google mandates that only schema.org metadata should be used in web content and no other semantic vocabularies.  Google sets the vision of what schema.org is, and what it does.

To break this cycle, the public should be asking: How might consumers access and utilize information from the information commons without relying on Google?

There are several paths possible.  One might involve opening up the web crawl to wider use by firms of all sizes.  Another would be to expand the role of the schema.org vocabularies in APIs to support consumer apps.  Whatever path is pursued, it needs to be attractive to small firms and startups to bring greater diversity to consumers and spark innovation.

Possibilities for reform: Getting Google out of the way

If schema.org is to continue as a W3C community and associated with the trust conferred by that designation, then it will require serious reform.  It needs governance — and independence from Google.  It may need to transform into something far more formal than a community group.  

In its current incarnation, it’s difficult to imagine this level of oversight.  The community is resource-starved, and relies on Google to function. But if schema.org isn’t viable without Google’s outsized involvement, then why does it exist at all?  Whose community is it?

There’s no rationale to justify the W3C  lending its endorsement to a community that is dominated by a single company.  One solution is for schema.org to cease being part of the W3C umbrella and return to its prior status of being a Google-sponsored initiative.  That would be the honest solution, barring more sweeping changes.

Another option would be to create a rival W3C standard that isn’t tied to Google and therefore couldn’t be dominated by it, but a standard Google couldn’t afford to ignore.  That would be a more radical option, involving significant reprioritization by publishers.  It would be disruptive in the short term, but might ultimately result in greater innovation.  A starting point for this option would be to explore how to popularize Wikidata as a general-purpose vocabulary that could be used instead of schema.org.

A final option would be for Google to step up in order to step down.  They could acknowledge that they have benefited enormously from the thousands of webmasters and others who contribute structured data and they owe a debt to them.  They could offer to payback in kind.  Google could draw on the recent example of Facebook’s funding of an independent body that will provide oversight that company.  Google could fund a truly independent body to oversee schema.org, and financially guarantee the creation of a new organizational structure.   Such an organization would leave no questions about how decisions are made and would dispel industry concerns that Google is gaining unfair advantages.  Given the heightening regulatory scrutiny of Google, this option is not as extravagant as it may first sound.

On a pragmatic level, I would like to see schema.org realize its full potential.  This issue is important enough to merit broader discussion, not just in the narrow community of people who work on web metadata, but those involved with regulating technology and looking at antitrust.  Google spends considerable sums, often furtively, hiring academic experts and others to dismiss concerns about their market dominance.  The role of metadata should be to make information more transparent.  That’s why this matters in many ways.

— Michael Andrews

Clarification: schema.org’s status as a  north star and as a standard (August 12)

The welcome page of schema.org notes it is “developed by an open community process, using the public-schemaorg@w3.org mailing list.”   When I first published this post I referred to schema.org as an “open W3C metadata standard.” Dan Brickley of Google tweeted to me and others stating that I made a “simple factual error” doing so. He is technically correct that my characterization of the W3C’s role is not precise, so I have changed the wording to say “a W3C-sanctioned metadata standard” instead (sanctioned = permitted), which is the most accurate I can manage, given the intentionally confusing nature of schema.org’s mandate.  This may seem like mincing words, but the implications are important, and I want to elaborate on what those are.

It is true that schema.org is not an official W3C standard in the sense that HTML5 is, which had a cast of thousands involved in its development.  For standards to become an official W3C standard, they need to go through a long process of community vetting, moving through stages such as being a recommendation first.  Even a recommendation is not yet an official standard, though it is widely followed.  Just because technical guidelines aren’t official W3C standards or are even referred to as standards does not mean they don’t have the effect of a standard that others would be expected to follow in order to gain market acceptance. Standards vary in the degree they are voluntary — schema.org has always been a voluntary standard.  And there are different levels of standards maturity within the W3C’s standards making framework, with the most mature ones reflecting the most stringent levels of compliance.  A W3C community group discussions around standards proposals would be the least rigorous and normally associated with the least developed stage of standards activity.  It is typically associated with new ideas for standards, rather than well-formed standards that are widely used by thousands of companies.  

A key difference with the schema.org community group is that is hosts discussions about a fully-formed standard.  This standard was fully formed before there was ever a community group to discuss it.  In other words, there was never any community input on the basic foundation of schema.org.  Google decided this together with its partners in the schema.org initiative.  

So I agree that schema.org fails to satisfy the expectations of a W3C standard.  The W3C has a well-established process for standards, and schema.org’s governance doesn’t remotely align with how a W3C standard is developed.  

The problem is that by having fully-formed standard being discussed in a W3C forum, it appears as if schema.org is a W3C standard of some sort.  Appearances do matter. Webmasters on the W3C mailing list can reasonably assume the W3C endorses schema.org.  And by hosting a community group on schema.org, the W3C has lent support to schema.org.  To outsiders, they appear to be sponsoring its development and one would presume be interested in having open participation in decision making about it.  The terms of service for schema.org treat “the schemas published by Schema.org as if the schemas were W3C Recommendations.”  The optics of schema.org imply it is W3C-ish.  

Dan Brickley refers to schema.org as a “independent project” and not a “W3C thing.”  I’m not reassured by that characterization, which is the first I’ve heard Google draw explicit distance from W3C affiliation.  He seems to be rejecting the notion that the W3C should provide any oversight over the schema.org process.  They’re merely providing a free mailing list.  The four corporate “sponsors” of schema.org set the binding conditions of the terms of service.  Nothing schema.org is working on is intended to become an official W3C standard and hence subject to W3C governance.  

Even though 10 million websites use schema.org metadata and are affected by its decisions, schema.org’s decision making is tightly held.   Ultimate decision making authority rests with a Steering Committee (also chaired by Google) that is invitation-only and not open to public participation.  Supposedly, a W3C representative is allowed to sit on this committee, though the details about this, like much else in schema.org’s documentation, are unclear.   

It may seem reassuring to imagine that schema.org belongs to a nebulous entity called the “community,” but that glosses over how much of the community activities and decisions are Google-driven. Google does draw on the expertise and ideas of others, so that schema.org is more than one company’s creation.  But in the end, Google keeps tight control over the process so that schema.org reflects its priorities.  It would be simpler to call this the Google Structured Data schema.  

 Schema.org appears to be public and open, while in practice is controlled by a small group of competitors and one firm in particular. Google is having its cake and eating it too.  If schema.org does not want W3C oversight, then the W3C should disavow having a connection with them, and help to reduce at least some of the confusion about who is in control of schema.org.  

Categories
Content Engineering

Tailless Content Management

There’s a approach to content management that is being used, but doesn’t seem to have a name.  Because it lacks a name, it doesn’t get much attention.  I’m calling this approach tailless content management — in contrast to headless content management.  The tailless approach, and the headless approach, are trying to solve different problems.

What Headless Doesn’t Do

Discussion of content management these days is dominated by headless CMSs.   A crop of new companies offer headless solutions, and legacy CMS vendors are also singing the praises of headless.  Sitecore says: “Headless CMSs mean marketers and developers can build amazing content today, and—importantly—future-proof their content operation to deliver consistently great content everywhere.”  

In simple terms, a headless CMS strips functionality relating to how web pages are presented and delivered to audiences.  It’s supposed to let publishers focus on what the content says, rather than what it looks like when delivered.  Headless CMS is one of several trends to unbundle functionality customarily associated with CMSs.  Another trend is moving the authoring and workflow functionality into a separate application that is friendlier to use.  CMS vendors have long touted that their products can do everything needed to manage the publication of content.  But increasingly content authors and designers are deciding that vendor choices are restricting, rather than helpful.  CMSs have been too greedy in making decisions about how content gets managed.  

“Future-proof” headless CMSs may seem like the final chapter in the evolution of the CMS.  But even headless CMSs can still be very rigid in how they handle content elements.  Many are based on the same technology stack (LAMP) that’s obliquely been causing problems for publishers over the past two decades.   In nearly every CMS, all audience-facing factual information needs to be described as a field that’s attached to a specific content type.  The CMS may allow some degree of content structuring, and the ability to mix different fragments of content in different ways.  But they don’t solve important problems that complex publishers face: the ability to select and optimize alternative content-variables, to use data-variables across different content, and to create dynamic content-variables incorporating data-variables.   To my mind, those three dimensions are the foundation for what a general-purpose approach to content engineering must offer.  Headless solutions relegate the CMS to being an administrative interface for the content.  The CMS is a destination to enter text.  But it often does a poor job supporting editorial decisions, and giving publishers true flexibility.   The CMS design imposes restrictions on how content is constructed.  

Since the CMS no longer worries about the “head”, headless solutions help publishers focus on the body.  But the solution doesn’t help publishers deal with a neglected aspect: the content’s tail.

Content’s ‘Tail’

Humans are one of the few animals without tails.  Perhaps that’s why we don’t tend to talk about the tail as it relates to content.  We sometimes talk about the “long tail” of information people are looking for.  That’s about as close as most discussions get to considering the granular details that appear within content. The long tail is a statistical metaphor, not a zoological one.  

Let’s think about content management as having three aspects: the head at the top (and which is top of mind for most content creators), the body in the middle (which has received more attention lately), and the tail at the end, which few people think much about. 

The head/body distinction in content is well-established.  The metaphor needs to be extended to include the notion of a tail.  Let’s breakdown the metaphor:

  • The head — is the face of the content, as presented to audiences.  
  • The body — are the organs (components) of the content.  Like the components of the human body (heart, lungs, stomach, etc.) the components within the body of content each should have a particular function to play.
  • The tail — are the details in the content (mnemonic: deTails).  The tail provides stability, keeping the body in balance

 In animals, tails play an important role negotiating with their surroundings.  Tails offer balance.  They swat flies.  They can grab branches to steady oneself.  Tails help the body adjust to the environment.  To do this, tails need to be flexible. 

Details can be the most important part of content, just as the tails of some animals are main event. In a park a kilometer from my home in central India, I can watch dozens of peacocks, India’s national bird.  Peacocks show us that tails are not minor details.

When the tail is treated as a secondary aspect of the body, its role gets diminished.  Publishers need to treat data as being just as important as content in the body.  Content management needs to consider both customer-facing data and narrative content as distinct but equally important dimensions.  Data should not be a mere appendage to content. Data has value in its own right.  

With tailless content management, customer-facing data is stored separately from the content using the data.  

The Body and the Details

The distinction between content and data, and between the body and the detail, can be hard to grasp.  The architecture of most CMSs doesn’t make this distinction, so the difference doesn’t seem to exist.  

CMSs typically structure content around database fields.   Each field has a label and an associated value.  Everything that the CMS application needs to know gets stored in this database.  This model emerged when developers realized that HTML pages had regular features and structures, such as having titles and so on. Databases made managing repetitive elements much easier compared to creating each HTML page individually.

The problem is that a single database is trying to many different things at once.  It can be:

  • Holding long “rich” texts that are in the body of an article
  • Holding many internally-used administrative details relating to articles, such as who last revised an article
  • Holding certain audience-facing data, such as the membership services contact telephone number and dates for events

These fields have different roles, and look and behave differently.  Throwing them together in a single database creates complexity.  Because of the complexity, developers are reluctant to add additional structure to how content is managed.  Authors and publishers are told they need to be flexible about what they want, because the central relational database can’t be flexible.  What the CMS offers should be good enough for most people.  After all, all CMSs look and behave the same, so it’s inevitable that content management works this way.

Something perverse happens in this arrangement.  Instead of the publisher structuring the content so it will meet the publisher’s needs, the CMS’s design ends up making decisions about if and how content can be structured.

Most CMSs are attached to a relational database such as mySQL.  These databases are a “kitchen sink” holding any material that the CMS may need to perform its tasks.  

To a CMS, everything is a field.  They don’t distinguish between long text fields that contain paragraphs or narrative content that has limited reuse (such as a teaser or the article body) from data fields with simple values that are relevant across different content items and even outside of the content.  CMSs mix narrative content, administrative data, and editorial data all together. 

A CMS database holds administrative profile information related to each content item (IDs, creation dates, topic tags, etc). The same database is also storing other non-customer facing information that’s more generally administrative such as roles and permission.   In addition to the narrative content and the administrative profile information, the CMS stores customer-facing data that’s not necessarily linked to specific content items. This is information about entities such as products, addresses of offices, event schedules and other details that can be used in many different content items.  Even though entity-focused data can be useful for many kinds of content, these details are often fields of specific content types.  

The design of CMSs reflects various assumptions and priorities.  While everything is a field, some fields are more important than others.  CMSs are optimized to store text, not to store data.  The backend uses a relational database, but it mostly serves as a content repository. 

Everyday Problems with the Status Quo

Content discusses entities.  Those entities involve facts, which are data.  These facts should be described with metadata, though they frequently are not.

A longstanding problem publishers face is that important facts are trapped within paragraphs of content that they create and publish.  When the facts change, they are forced to manually revise all the writing that mentions these facts.  Structuring content into chunks does not solve the problem of making changes within sentences.  Often, factual information is mentioned within unique texts written by various authors, rather than within a single module that is centrally managed.  

Most CMSs don’t support the ability to change information about an entity so that all paragraphs will update that information. 

Let’s consider an example of a scenario that can be anticipated ahead of time.  A number of paragraphs in different content items mention an application deadline date.  The procedure for applying stays the same every year, but the exact date by which someone must apply will change each year.  The application deadline is mentioned by different writers in different kinds of content: various announcement pages, blog posts, reminder emails, etc. In most CMSs today, the author will need to update each unique paragraph where the application is mentioned.  They don’t have the ability to update each mention of the application date from one place.   

Other facts can change, even if not predictably.  Your community organization has for years staged important events in the Jubilee Auditorium at your headquarters.  Lots of content talks about the Jubilee Auditorium.  But suddenly a rich donor has decided to give your organization some money.  To honor the donation, your organization decides to rename Jubilee Auditorium to the Ronald L Plutocrat Auditorium.  After the excitement dies down, you realize that more than the auditorium plaque needs to change.  All kinds of mentions of the auditorium are scattered throughout your online content.  

These examples are inspired by real-life publishing situations.   

Separating Concerns: Data and Content

Contrary to the view of some developers, I believe that content and data are different things, and need to be separated.

Content is more like computer code than it’s like data.  Like computer code, content is about language and expression.  Data is easy to compare and aggregate.  Its values are tidy and predictable.  Content is difficult to compare: it must be diff’d.  Content can’t easily be aggregated, since most items of content are unique.

Each chunk of content is code that will be read by a browser.  The body must indicate what text gets emphasis, what text has links, and what text is a list.  Content is not like data generally stored in databases. It is unpredictable. It doesn’t evaluate to standard data types. Within a database, content can look like a messy glob that happens to have a field name attached to it.

The scripts that a CMS uses must manipulate this messy glob by evaluating each letter character-by-character.  All kinds of meaning are embedded within a content chunk, and some it is hard to access.  

The notion that content is just another form of data that can be stored and managed in a relational database with other data is the original sin of content management.  

It’s considered good practice for developers to separate their data from their code.  Developers though have a habit of co-mingling the two, which is why new software releases can be difficult to upgrade, and why moving between software applications is hard to do.

The inventor of the World Wide Web, Tim Berners-Lee, has lately been talking about the importance of separating data from code, “turning the way the web works upside-down.”  He says: “It’s about separating the apps from the data.”

In a similar vain, content management needs to separate data from content.  

Data Needs Independence

We need to fix the problem with the design of most CMSs, where the tail of data is fused together to the spine of the body.  This makes the tail inflexible.  The tail is dragged along with the body, instead of wagging on its own.  

Data needs to become independent of specific content, so that it can be used flexibly.  Customer-facing data needs to be stored separately from the content that customers view.  There are many reasons why this is a good practice.   And the good news is it’s been done already.

Separating factual data from content is not a new concept.  Many large e-commerce websites have a separate database with all their product details that populates templates that are handled by a CMS.  But this kind of use of specialized backend databases is limited in what it seeks to achieve.  The external database may serve a single purpose: to populate tables within templates.  Because most publishers don’t see themselves as data-driven publishers the way big ecommerce platforms are, they may not see the value of having a separate dedicated backend database.  

Fortunately there’s a newer paradigm for storing data that is much more valuable.  What’s different in the new vision is that data is defined as entity-based information, described with metadata standards.  

The most familiar example of how an independent data store works with content is Wikipedia.  The content we view on Wikipedia is updated by data stored in a separate repository called Wikidata.  The relationship between Wikipedia and Wikidata is bidirectional.  Articles mention factual information, which gets included in Wikidata.  Other articles that mention the same information can draw on the Wikidata to populate the information within articles.

Facts are generally identified with a QID.  The identifier Q95 represents Google.  Google is a data variable.  Depending on the context, Google can be referred to by Google Inc. (as a joint-stock company until 2017) Or Google LLC (as a limited liability company beginning in 2017).  As a data value, the company name can adjust over time.  Editors can also change the value when appropriate.  Google became a subsidiary of Alphabet Inc. (Q20800404) in 2015.  Some content, such as relating to financial performance will address that entity starting in 2015.  Like many entities, companies change names and statuses over time.

How Wikipedia accesses Wikidata. Source: Wikidata

As an independent store of data, Wikidata supports a wide variety of articles, not just one content type.  But its value extends beyond its support for Wikipedia articles.  Wikidata is used by many other third party platforms to supply information.  These include Google, Amazon’s Alexa, and the websites of various museums.

While few publishers operate of the scale of Wikipedia, the benefits of separating data from content can be realized on a small scale as well.  An example is offered by the popular static website generator package called Jekyll, which is used by Github, Shopify, and other publishers.  A plug in for Jekyll lets publishers store their data in the RDF format — a standard that offers significant flexibility.  The data can be inserted into web content, but is a format where it can also be available for access by other platforms. 

Making the Tail Flexible

Data needs to be used within different types of content, and across different channels — including channels not directly controlled by the publisher.

The CMS-centric approach, tethered to a relational database, tries to solve these issues by using APIs.  Unfortunately, headless CMS vendors have interpreted the mantra of “create once, publish everywhere” to mean “enter all your digital information in our system, and the world will come to you, because we offer an API.”  

Audiences need to know simple facts, such as what’s the telephone number for member services, in the case of a membership organization.  They may need to see that information within an article discussing a topic, or they may want to ask Google to tell them while they are making online payments.  Such data doesn’t fit into comfortably into a specific structured content type.  It’s too granular.  One could put it into a larger contact details content type, but that would include lots of other information that’s not immediately relevant.  Chunks of content, unlike data, are difficult to reuse in different scenarios.  Content types, by design, are aligned with specific kinds of scenarios. But defined content structures used to build content types are clumsy supporting general purpose queries or cross-functional uses.    And it wouldn’t help much to make the phone number into an API request.  No ordinary publisher can expect the many third party platforms to read through their API documentation in the event that someone asks their voice bot service about a telephone number.

The only scaleable and flexible way to make data available is to use metadata standards that third party platforms understand.  When using metadata standards, special a API isn’t necessary.  

An independent data store (unlike a tethered database) offers two distinct advantages:

1. The data is multi-use, for both published content and to support other platforms (Google, voice bots, etc.)

2.  The data is multi-source, coming from authors who create/add new data, from other IT systems, and even from outside sources

The ability of the data store to accept new data is also important.  Publishers should grow their data so that they can offer factual information accurately and immediately, wherever it is needed.  When authors mention new facts relating to entities, this information can be added to the database.   In some cases authors will note what’s new and important to include, much like webmasters can note metadata relating to content using Google’s Data Highlighter tool.  In other cases, tools using natural language processing can spot entities, and automatically add metadata.  Metadata provides the mechanism by which data gets connected to content. 

Metadata makes it easier to revise information that’s subject to change, especially information such as prices, dates, and availability.  The latest data is stored in the database, and gets updated there.  Content that mentions such information can indicate the variable abstractly, instead of using a changeable value.  For example: “You must apply by {application date}.”  As a general rule, CMSs don’t make using data variables an easy thing to do.

A separate data store makes it simpler to pull data coming from other sources.  The data store describes information using metadata standards, making is easy to upload information from different sources.  With many CMSs, it is cumbersome to pull information from outside parties.  The CMS is like a bubble.  Everything may work fine as long you as you never want to leave the bubble.  That’s true for simple CMSs such as WordPress, and for even complex component CMSs (CCMSs) that support DITA.  These hosts are self-contained.  They don’t readily accept information from outside sources.  The information needs to be entered into their special format, using their specific conventions.  The information is not independent of the CMS.  The CMS ends of defining the information, rather than simply using it.

A growing number of companies are developing enterprise knowledge graphs — their own sort of Wikidata. These are databases of the key facts that a company needs to refer to.  Companies can use knowledge graphs to enhance the content they publish.  This innovation is possible because these companies don’t rely on their CMS to manage their data.

— Michael Andrews