Categories
Content Engineering

Coping with copies: Strategic dimensions of reuse and duplication

When are copies of content appropriate, and how should you manage copies? Should content ever be repetitive?  Is duplicative content always bad?

Answers to these questions are typically provided by specialists: CMS implementers (developers skilled in PHP or another CMS programming language), SEO experts, or webmasters. Specialists tend to focus on technical effort or performance—the technical penalties—rather than strategic issues of how people interact with messages and information—the users’ goals. Discussions become overly narrow, with important issues taken off the table. 

But if we only consider the technical dimensions, we can lose sight of the human factors at play. Content exists to be read. Authors and readers continually judge content according to whether it seems familiar or different. People often need to see things more than once. They even choose to re-read some content. 

Though technology is important, it’s always in flux. Technology doesn’t impose fixed rules and shouldn’t dictate strategy. 

Acknowledging the repetitiveness of content

A good amount of content repeats itself—and always has. Repetition allows content to be disseminated more widely.  Humans have copied text as long as they’ve been writing. Text reuse is part of the human condition.

Scholars analyze “different types of text reuse, such as jokes, adverts, boilerplates, speeches, or religious texts, but also short stories and reprints of book segments. Each of them is tied to a different logic and motivation.”

As one researcher studying the historical development of news stories notes, “Articles emerge through a process of creative re-use and re-appropriation. Whole fragments, sentences and quotations are often transferred to novel contexts. In this sense, newspaper content emerges through a process of what could be called bricolage, in which content is soldered together from existing fragments and textual patterns. In other words, newspaper content is often harvested from a wide range of available textual material.”

Source: Romanello and Hengchen

Such research can help us to understand consequential issues such as:

  • The virality and spread of narratives 
  • The prevalence of quotations from a particular source
  • The reliance of a publication on external sources

Content propagation in the real world is messy. It happens organically through numerous small decisions made on a decentralized basis.  Some decisions are opportunistic (such as plagiarism or repeating rumors), while others are motivated by a desire to spread credible information.  No solution can be viable if it ignores the complex motivations of people conveying information.

Content professionals are generally wary of repeated content. They caution organizations to “avoid duplication” because “it’s bad.” Their goal is to prevent duplication and remediate it when it occurs.

The content professional’s alternative to duplication is content reuse. Unlike duplication, content reuse is considered virtuous. Duplication and reuse are distinct approaches to repeating text, but they share similarities. They are not exact opposites. It doesn’t follow that one is absolutely bad while the other is always good. 

Before we can consider the merits and behaviors of reuse, it’s important to first understand the various manifestations of duplication, some of which overlap with content reuse.  

Good and Bad reasons for duplicate content

Duplicate web pages on a website are almost always bad. A web page should live in only one place on a website. When the same page exists in several places on a website, it’s fairly easy for software to locate such pages. Numerous tools can scan your website for duplicate pages using a mathematical technique called checksum.  

When the same page exists across distinct web domains, the advisability of having the same content appear in multiple places gets more complicated. Sometimes, such behavior indicates a poorly governed publishing process, where a page is copied to various domains without either tracking this copying or asking if it is necessary.  But not all situations are problems. There are legitimate use cases for publishing the same content on distinct pages on different websites.  Content may be repeated across localized web domains or domains for subbrands of an organization.  

Content syndication allows the same page to be republished on multiple domains to make it available to audiences so they can find it where they are looking for it rather than expecting they’ll be hunting for it on an unfamiliar website.  Organizations syndicate content throughout their own web properties or make it available to third parties.

The audience’s needs should determine whether the content should be placed on multiple websites. 

When identical web pages appear on multiple websites, this can be implemented in several ways.  The pages can be shared either through RSS or an API that other websites can access. But often the original page is copied to a new website. The existence of multiple copies that are independent of one another introduces many content management inefficiencies and risks. 

The copying of webpages is often a consequence of the way CMSs are designed. Traditional CMSs support a single website, relying on folders and sitemaps to organize pages. Each additional website that needs the page must have the page copied into that site’s page organization. While CMSs that support multiple websites have emerged recently, some still don’t allow the original content to be organized independently of where on a website it will appear.  

Duplicated content results from both human decisions and automated ones.  

  • Collateral duplication on a website can happen when pages are autogenerated and are expected to “belong” in multiple places as part of different collections.  
  • Web aggregators duplicate content by republishing some or all of content items from multiple sources. Aggregators are common for news, customer reviews, hotels, food delivery, and other topics.
  • Website mirroring, copying an entire website to another URL, may be set up to ensure the availability of content. Mirrors can enable faster access for users or preserve content that might otherwise be blocked or taken down.

When organizations intend to duplicate content, they can do so for either good or bad faith motives. 

Good faith motivations reflect users’ interests by making content available where they are looking for that content. Republishing of content is allowed and encouraged. The US Department of Health and Human Services encourages the syndication of its content: “Content syndication allows you to place content from HHS websites onto your own site. It allows you to offer high-quality HHS content in the look and feel of your site. The syndicated content is automatically updated in real-time, requiring no effort from your staff to keep the pages up to date.”

Bad faith motivations include the intention to spam the user by blanketing them everywhere they might be. “‘Copypasta’ (a reference to copy-and-paste functionality to duplicate content) is an Internet slang term that refers to an attempt by multiple individuals to duplicate content from an original source and share it widely across social platforms or forums,” noted a well known social media platform that subsequently changed its ownership and name. Of course, people alone aren’t responsible for copypasta–nowadays, bots do most of the work.

In other cases, duplication involves efforts to deceive who the author is or disguise the organization that is publishing the content. Bad actors can steal content and republish it through adversarial proxy mirroring (the wholesale copying of a website that is rebranded) and web scraping (lifting published content and republishing it elsewhere without permission).  Such copy-theft is illegal but technically easy to perform.

Near-duplicates: a pervasive phenomenon

While identical duplicate web pages are not uncommon, an even more pervasive situation is “near dupes” or items that duplicate some content but also contain unique content.

Near duplicate content can be planned or incidental.  Similarity in content items signals thematic repetition across multiple items. Near duplication content often represents variations on a core set of messages or information. 

Templates in e-commerce sites generate many pages of near duplicate content. They combine data feeds of product descriptions with boilerplate copy. Each product page has some identical wording it shares with other pages. 

Unlike checks for exact duplicates, auditing for near-duplicates involves noting both what’s the same and what’s unique. The audit needs to determine where items are dissimilar and whether that is intentional.  Sometimes, copies of items are updated unevenly so that there are different versions of what should be identical text.  Any variations within a copy of near-duplicates should convey distinct information or messages.

Also, note that near-duplicates aren’t necessarily the repetition of exact prose. They may be summarizations or extensions. “A near-duplicate is, in some cases, a mere paraphrasing of a previous article; in other cases, it contains corrections or added content as a follow-up.” Both publishers and readers can find value in extending what’s been previously said.”

Related content: the repetition of fragments

Related content may duplicate strings or passages of text but don’t replicate enough of the body of the content to appear as a near-duplicate. It emerges in various situations. 

Recurring phrases can signal that content items belong to a common content type.  Content style guides may specify patterns for writing headlines, calls-to-action, and other strings.  A recurring pattern might signify that the content item is a help topic or a hero.

Related content is also the product of repeating segments of content across items to support continuity in the user’s content experience. Content chunks might be repeated to provide “signposts,” such as a preview or a takeaway. 

Repeating fragments of content support continuity across content items over time and through a customer journey.

More content management tools are focusing on repeatable content components. An example of this trend is the ubiquitous WordPress platform. WordPress’ updated authoring interface, Gutenberg, manages content chunks it calls “blocks.”  The interface allows authors to “duplicate” or “share” blocks in one item for use in another item.  Shared blocks can be edited in any item where they are used, which will change them everywhere, though users report this behavior can be confusing and result in unanticipated changes. Because the blocks have no independent identity, their messages can be strongly influenced by the context in which they are edited.  

Looking at duplication from internal and external perspectives

Duplicated content can trigger a range of problems and consequences. Duplicated published content may be bad or not. Duplicated unpublished content is almost always problematic.

Let’s start by looking at the internal consequences of duplicative content. Multiple versions of the same item are confusing to authors, editors, and content managers. No one can be sure which is the “right” version. Ironically, the latest version may not be the right one if someone creates a new copy and starts editing it without completing a full review.  Abandoned drafts can also cloud which one is the active one. An unapproved version could be delivered to customers. 

The simple guideline to follow is that you shouldn’t have exact copies of items in your content repository.  Any near duplicates in your content inventory should be managed as content variants.  (For a discussion of the distinction between versions and variants, see my post on content history.)

Now, let’s consider the situation of published content that’s been duplicated. Is it bad for audiences?  It can be, but won’t necessarily be.  

A wrong assumption often made about duplicated published content is that audiences will encounter it all at once. Many organizations rely on web crawls to simulate how audiences encounter their content.  Web crawls often turn up duplicate pages.  It doesn’t follow that an individual will necessarily encounter these duplicates. Ironically, “duplicated pages can even be introduced by the crawler itself, when different links point to the same page.”

An old myth in the SEO industry proclaimed that Google penalized duplicate content. But Google acknowledges that duplicate content, while potentially confusing to users, does not present a problem for Google’s search indexing: “Some duplicate content on a site is normal and it’s not a violation of Google’s spam policies. However, having the same content accessible through many different URLs can be a bad user experience (for example, people might wonder which is the right page and whether there’s a difference between the two), and it may make it harder for you to track how your content performs in search results.”

Duplicate content is often a symptom of other user experience issues, such as poor journey mapping or content labeling. No reader wants multiple links that all lead to the same item. When titles or links look similar, readers can’t be sure whether equivalent options are identical and equally useful or are really different content items. For example, users frequently choose the wrong product support link because they are unable to understand and define distinctions between product variants. 

Reuse: How different is it from duplication?

Content reuse is widely advocated but sometimes loosely defined. It’s often not clear whether it refers to the internal reuse of content prior to publication or the external republication of content. Without making that distinction, it isn’t clear when or whether duplication of content occurs. How does one apply the famous adage in content practice to be “DRY” (Don’t Repeat Yourself)? Should content not be repeated externally or only internally?

People may advocate reuse for a range of reasons: 

  1. Reuse for message and information consistency
  2. Reuse for internal sharing and joint collaboration
  3. Reuse to save content development effort
  4. Reuse to promote messages and information more widely externally

Content reuse implies that one copy of a content item can appear many times in various guises. The reality behind the scenes is more complicated, and it is perhaps more accurate to think about content reuse as managed duplication.

Reuse implies one original content item will serve as the basis for published content that’s delivered in various contexts. When implemented in publishing toolchains, there will likely be more than one copy. If you care about business continuity, your repository will likely have a mirror and backup, and it’s possible an item will be cached in other systems involved in the publishing and delivery process. But while copies may exist, there’ll only be one original. 

The original copy is sometimes referred to as the canonical one. Any changes are made only to the original; the other copies are read-only.  Importantly, all changes are reversible since the copies are dependent on the original or are stored temporarily.  With duplicated copies are unmanaged, by contrast, separate instances would each require updating, which often doesn’t happen.

It’s useful to distinguish delivery reuse (one item delivered to many places) from assembly reuse (one item incorporated into many other items). Most rationales for content reuse focus on internal content management requirements rather than external customer access benefits, but both are valid goals.

A wider perspective on reuse considers its role in contextualizing information and messages. Reused content can change the temporal and topical context.

Sometimes, reused content is standalone items: information or messages that need to be repeated in diverse scenarios. Such reuse allows target messages to be delivered at the right moment.

Other times, reused content is inserted into a larger item. But when reused content is incorporated into larger content items, content reuse can generate near-duplicates. Templated content, for example, repeats wording on multiple pages, making it hard for users to distinguish various items.  From an external user’s perspective, reused content can be indistinguishable from duplicated content. 

Reuse can support content customization. Organizations are expected to generate many variations of core content.  Reuse has its roots in document management, the assembling of long-form documents that are built from both repeated text and customized text.  But as online content moves away from long-form documents like product manuals and becomes more granular and on-demand, content customization is changing. Reuse in content assembly is still important, but more content is now reused directly by delivering standalone snippets or chunks.

The value of de-duplicating content

Detecting duplicate content has become a mini-industry.  Numerous technical approaches can identify duplicated content, and a range of vendors offer de-duplication solutions.  

One vendor focuses on monitoring repetition in what’s published online, asserting, “There’s a wide variety of use cases for duplicate detection in the field of media monitoring, ranging from virality analyses and content distribution tracking to plagiarism detection and web crawling.”

Content aggregators need to filter duplicates. Another vendor sells a “content deduplication/travel content mapping solution” that gives customers “the opportunity to create your own hotel database and write original material.” 

When organizations create content, they need to preclude making redundant content. One firm offers a tool to prevent writers from creating duplicate content on intranets. The problem is not trivial: how do writers know what’s already been created? They may create a new item that doesn’t have the exact wording of an existing one, but with a focus that’s nearly identical. 

Governance based on well-defined content types (indicating a clear purpose for the content) and accurate, descriptive metadata (indicating the content’s scope) is essential to preventing redundant content.  Authors should be prompted to answer what the content is about before starting to create it.  The inventory can check to see what existing content might be similar.

Since near-duplicates are more difficult to identify than exact ones, tools need to do “fuzzy” searches to find overlapping items.  Techniques include “MinHash” and “shingling” that chop up strings to measure similarity thresholds.

While readers don’t want to wade through duplicate items or have to disambiguate them, the same is true for machines – only at a larger scale. Software programs can behave oddly if the inventory of content emphasizes certain items too much.  Duplication can introduce bias in software algorithms because programs are more inclined to select from duplicated information when performing searches or generating answers. Duplication of content has emerged as a concern in large language models.  

Recent research by Amazon suggests that duplication can interfer with the relevancy of answers provided by LLMs.

If many similar items exist, which one should be canonical? In some cases, no one item will be a “best” representative.  LLMs can generative a cross-item summarization of the near duplicates, providing a composite of multiple items that are similar but not identical.

Deduplication is emerging as an important requirement for the internal governance of content.

– Michael Andrews

Categories
Content Engineering

Paradata: where analytics meets governance

Organizations aspire to make data-informed decisions. But can they confidently rely on their data? What does that data really tell them, and how was it derived? Paradata, a specialized form of metadata, can provide answers.

Many disciplines use paradata

You won’t find the word paradata in a household dictionary and the concept is unknown in the content profession.  Yet paradata is highly relevant to content work. It provides context showing how the activities of writers, designers, and readers can influence each other.

Paradata provides a unique and missing perspective. A forthcoming book on paradata defines it as “data on the making and processing of data.” Paradata extends beyond basic metadata — “data about data.” It introduces the dimensions of time and events. It considers the how (process) and the what (analytics).

Think of content as a special kind of data that has a purpose and a human audience. Content paradata can be defined as data on the making and processing of content.

Paradata can answer:

  • Where did this content come from?
  • How has it changed?
  • How is it being used?

Paradata differs from other kinds of metadata in its focus on the interaction of actors (people and software) with information. It provides context that helps planners, designers, and developers interpret how content is working.

Paradata traces activity during various phases of the content lifecycle: how it was assembled, interacted with, and subsequently used. It can explain content from different perspectives:

  • Retrospectively 
  • Contemporaneously
  • Predictively

Paradata provides insights into processes by highlighting the transformation of resources in a pipeline or workflow. By recording the changes, it becomes possible to reproduce those changes. Paradata can provide the basis for generalizing the development of a single work into a reusable workflow for similar works.

Some discussions of paradata refer to it as “processual meta-level information on processes“ (processual here refers to the process of developing processes.) Knowing how activities happen provides the foundation for sound governance.

Contextual information facilities reuse. Paradata can enable the cross-use and reuse of digital resources. A key challenge for reusing any content created by others is understanding its origins and purpose. It’s especially challenging when wanting to encourage collaborative reuse across job roles or disciplines. One study of the benefits of paradata notes: “Meticulous documentation and communication of contextual information are exceedingly critical when (re)users come from diverse disciplinary backgrounds and lack a shared tacit understanding of the priorities and usual practices of obtaining and processing data.“

While paradata isn’t currently utilized in mainstream content work, a number of content-adjacent fields use paradata, pointing to potential opportunities for content developers. 

Content professionals can learn from how paradata is used in:

  • Survey and research data
  • Learning resources
  • AI
  • API-delivered software

Each discipline looks at paradata through different lenses and emphasizes distinct phases of the content or data lifecycle. Some emphasize content assembly, while others emphasize content usage. Some emphasize both, building a feedback loop.

Conceptualizing paradata
Different perspectives of paradata. Source: Isto Huvila

Content professionals should learn from other disciplines, but they should not expect others to talk about paradata in the same way.  Paradata concepts are sometimes discussed using other terms, such as software observability. 

Paradata for surveys and research data

Paradata is most closely associated with developing research data, especially statistical data from surveys. Survey researchers pioneered the field of paradata several decades ago, aware of the sensitivity of survey results to the conditions under which they are administered.

The National Institute of Statistical Sciences describes paradata as “data about the process of survey production” and as “formalized data on methodologies, processes and quality associated with the production and assembly of statistical data.”  

Researchers realize how information is assembled can influence what can be concluded from it. In a survey, confounding factors could be a glitch in a form or a leading question that prompts people to answer in a given way disproportionately. 

The US Census Bureau, which conducts a range of surveys of individuals and businesses, explains: “Paradata is a term used to describe data generated as a by-product of the data collection process. Types of paradata vary from contact attempt history records for interviewer-assisted operations, to form tracing using tracking numbers in mail surveys, to keystroke or mouse-click history for internet self-response surveys.”  For example, the Census Bureau uses paradata to understand and adjust for non-responses to surveys. 

Paradata for surveys
Source: NDDI 

As computers become more prominent in the administration of surveys, they become actors influencing the process. Computers can record an array of interactions between people and software.

 Why should content professionals care about survey processes?

Think about surveys as a structured approach to assembling information about a topic of interest. Paradata can indicate whether users could submit survey answers and under what conditions people were most likely to respond.  Researchers use paradata to measure user burden. Paradata helps illuminate the work required to provide information –a topic relevant to content professionals interested in the authoring experience of structured content.

Paradata supports research of all kinds, including UX research. It’s used in archaeology and archives to describe the process of acquiring and preserving assets and changes that may happen to them through their handling. It’s also used in experimental data in the life sciences.

Paradata supports reuse. It provides information about the context in which information was developed, improving its quality, utility, and reusability.

Researchers in many fields are embracing what is known as the FAIR principles: making data Findable, Accessible, Interoperable, and Reusable. Scientists want the ability to reproduce the results of previous research and build upon new knowledge. Paradata supports the goals of FAIR data.  As one study notes, “understanding and documentation of the contexts of creation, curation and use of research data…make it useful and usable for researchers and other potential users in the future.”

Content developers similarly should aspire to make their content findable, accessible, interoperable, and reusable for the benefit of others. 

Paradata for learning resources

Learning resources are specialized content that needs to adapt to different learners and goals. How resources are used and changed influences the outcomes they achieve. Some education researchers have described paradata as “learning resource analytics.”

Paradata for instructional resources is linked to learning goals. “Paradata is generated through user processes of searching for content, identifying interest for subsequent use, correlating resources to specific learning goals or standards, and integrating content into educational practices,” notes a Wikipedia article. 

Data about usage isn’t represented in traditional metadata. A document prepared for the US Department of Education notes: “Say you want to share the fact that some people clicked on a link on my website that leads to a page describing the book. A verb for that is ‘click.’ You may want to indicate that some people bookmarked a video for a class on literature classics. A verb for that is ‘bookmark.’ In the prior example, a teacher presented resources to a class. The verb used for that is ‘taught.’ Traditional metadata has no mechanism for communicating these kinds of things.”

“Paradata may include individual or aggregate user interactions such as viewing, downloading, sharing to other users, favoriting, and embedding reusable content into derivative works, as well as contextualizing activities such as aligning content to educational standards, adding tags, and incorporating resources into curriculum.” 

Usage data can inform content development.  One article expresses the desire to “establish return feedback loops of data created by the activities of communities around that content—a type of data we have defined as paradata, adapting the term from its application in the social sciences.”

Unlike traditional web analytics, which focuses on web pages or user sessions and doesn’t consider the user context, paradata focuses on the user’s interactions in a content ecosystem over time. The data is linked to content assets to understand their use. It resembles social media metadata that tracks the propagation of events as a graph.

“Paradata provides a mechanism to openly exchange information about how resources are discovered, assessed for utility, and integrated into the processes of designing learning experiences. Each of the individual and collective actions that are the hallmarks of today’s workflow around digital content—favoriting, foldering, rating, sharing, remixing, embedding, and embellishing—are points of paradata that can serve as indicators about resource utility and emerging practices.”

Paradata for learning resources utilizes the Activity Stream JSON, which can track the interaction between actors and objects according to predefined verbs called an “Activity Schema” that can be measured. The approach can be applied to any kind of content.

Paradata for AI

AI has a growing influence over content development and distribution. Paradata is emerging as a strategy for producing “explainable AI” (XAI).  “Explainability, in the context of decision-making in software systems, refers to the ability to provide clear and understandable reasons behind the decisions, recommendations, and predictions made by the software.”

The Association for Intelligent Information Management (AIIM) has suggested that a “cohesive package of paradata may be used to document and explain AI applications employed by an individual or organization.” 

Paradata provides a manifest of the AI training data. AIIM identifies two kinds of paradata: technical and organizational.

Technical paradata includes:

  • The model’s training dataset
  • Versioning information
  • Evaluation and performance metrics
  • Logs generated
  • Existing documentation provided by a vendor

Organizational paradata includes:

  • Design, procurement, or implementation processes
  • Relevant AI policy
  • Ethical reviews conducted
Paradata for AI
Source: Patricia C. Franks

The provenance of AI models and their training has become a governance issue as more organizations use machine learning models and LLMs to develop and deliver content. AI models tend to be ” black boxes” that users are unable to untangle and understand. 

How AI models are constructed has governance implications, given their potential to be biased or contain unlicensed copyrighted or other proprietary data. Developing paradata for AI models will be essential if models expect wide adoption.

Paradata and document observability

Observing the unfolding of behavior helps to debug problems to make systems more resilient.

Fabrizio Ferri-Benedetti, whom I met some years ago in Barcelona at a Confab conference, recently wrote about a concept he calls “document observability” that has parallels to paradata.

Content practices can borrow from software practices. As software becomes more API-focused, firms are monitoring API logs and metrics to understand how various routines interact, a field called observability. The goal is to identify and understand unanticipated occurrences. “Debugging with observability is about preserving as much of the context around any given request as possible, so that you can reconstruct the environment and circumstances that triggered the bug.”

Observability utilizes a profile called MELT: Metrics, Events, Logs, and Traces. MELT is essentially paradata for APIs.

Software observability pattern
Software observability pattern.  Source: Karumuri, Solleza, Zdonik, and Tatbul

Content, like software, is becoming more API-enabled. Content can be tapped from different sources and fetched interactively. The interaction of content pieces in a dynamic context showcases the content’s temporal properties.

When things behave unexpectedly, systems designers need the ability to reverse engine behavior. An article in IEEE Software states: “One of the principles for tackling a complex system, such as a biochemical reaction system, is to obtain observability. Observability means the ability to reconstruct a system’s internal state from its outputs.”  

Ferri-Benedetti notes, “Software observability, or o11y, has many different definitions, but they all emphasize collecting data about the internal states of software components to troubleshoot issues with little prior knowledge.”  

Because documentation is essential to the software’s operation, Ferri-Benedetti  advocates treating “the docs as if they were a technical feature of the product,” where the content is “linked to the product by means of deep linking, session tracking, tracking codes, or similar mechanisms.”

He describes document observability (“do11y”) as “a frame of mind that informs the way you’ll approach the design of content and connected systems, and how you’ll measure success.”

In contrast to observability, which relies on incident-based indexing, paradata is generally defined by a formal schema. A schema allows stakeholders to manage and change the system instead of merely reacting to it and fixing its bugs. 

Applications of paradata to content operations and strategy

Why a new concept most people have never heard of? Content professionals must expand their toolkit.

Content is becoming more complex. It touches many actors: employees in various roles, customers with multiple needs, and IT systems with different responsibilities. Stakeholders need to understand the content’s intended purpose and use in practice and if those orientations diverge. Do people need to adapt content because the original does not meet their needs? Should people be adapting existing content, or should that content be easier to reuse in its original form?

Content continuously evolves and changes shape, acquiring emergent properties. People and AI customize, repurpose, and transform content, making it more challenging to know how these variations affect outcomes. Content decisions involve more people over extended time frames. 

Content professionals need better tools and metrics to understand how content behaves as a system. 

Paradata provides contextual data about the content’s trajectory. It builds on two kinds of metadata that connect content to user action:

  • Administrative metadata capturing the actions of the content creators or authors, intended audiences, approvers, versions, and when last updated
  • Usage metadata capturing the intended and actual uses of the content, both internal (asset role, rights, where item or assets are used) and external (number of views, average user rating)

Paradata also incorporates newer forms of semantic and blockchain-based metadata that address change over time:

  • Provenance metadata
  • Actions schema types

Provenance metadata has become essential for image content, which can be edited and transformed in multiple ways that change what it represents. Organizations need to know the source of the original and what edits have been made to it, especially with the rise of synthetic media. Metadata can indicate on what an image was based or derived from, who made changes, or what software generated changes. Two corporate initiatives focused on provenance metadata are the Content Authenticity Initiative and the Coalition for Content Provenance and Authenticity.

Actions are an established — but underutilized — dimension of metadata. The widely adopted schema.org vocabulary has a class of actions that address both software interactions and physical world actions. The schema.org actions build on the W3C Activity Streams standard, which was upgraded in version 2.0 to semantic standards based on JSON-LD types.

Content paradata can clarify common issues such as:

  • How can content pieces be reused?
  • What was the process for creating the content, and can one reuse that process to create something similar?
  • When and how was this content modified?

Paradata can help overcome operational challenges such as:

  • Content inventories where it is difficult to distinguish similar items or versions
  • Content workflows where it is difficult to model how distinct content types should be managed
  • Content analytics, where the performance of content items is bound up with channel-specific measurement tools

Implementing content paradata must be guided by a vision. The most mature application of paradata – for survey research – has evolved over several decades, prompted by the need to improve survey accuracy. Other research fields are adopting paradata practices as research funders insist that data be “FAIR.” Change is possible, but it doesn’t happen overnight. It requires having a clear objective.

It may seem unlikely that content publishing will embrace paradata anytime soon. However, the explosive growth of AI-generated content may provide the catalyst for introducing paradata elements into content practices. The unmanaged generation of content will be a problem too big to ignore.

The good news is that online content publishing can take advantage of existing metadata standards and frameworks that provide paradata. What’s needed is to incorporate these elements into content models that manage internal systems and external platforms.

Online publishers should introduce paradata into systems they directly manage, such as their digital asset management system or customer portals and apps. Because paradata can encompass a wide range of actions and behaviors, it is best to prioritize tracking actions that are difficult to discern but likely to have long-term consequences. 

Paradata can provide robust signals to reveal how content modifications impact an organization’s employees and customers.  

– Michael Andrews