Author: Michael Andrews

Coping with copies: Strategic dimensions of reuse and duplication

Post author By Michael Andrews
Post date June 25, 2024
1 Comment on Coping with copies: Strategic dimensions of reuse and duplication

When are copies of content appropriate, and how should you manage copies? Should content ever be repetitive? Is duplicative content always bad?

Answers to these questions are typically provided by specialists: CMS implementers (developers skilled in PHP or another CMS programming language), SEO experts, or webmasters. Specialists tend to focus on technical effort or performance—the technical penalties—rather than strategic issues of how people interact with messages and information—the users’ goals. Discussions become overly narrow, with important issues taken off the table.

But if we only consider the technical dimensions, we can lose sight of the human factors at play. Content exists to be read. Authors and readers continually judge content according to whether it seems familiar or different. People often need to see things more than once. They even choose to re-read some content.

Though technology is important, it’s always in flux. Technology doesn’t impose fixed rules and shouldn’t dictate strategy.

Acknowledging the repetitiveness of content

A good amount of content repeats itself—and always has. Repetition allows content to be disseminated more widely. Humans have copied text as long as they’ve been writing. Text reuse is part of the human condition.

Scholars analyze “different types of text reuse, such as jokes, adverts, boilerplates, speeches, or religious texts, but also short stories and reprints of book segments. Each of them is tied to a different logic and motivation.”

As one researcher studying the historical development of news stories notes, “Articles emerge through a process of creative re-use and re-appropriation. Whole fragments, sentences and quotations are often transferred to novel contexts. In this sense, newspaper content emerges through a process of what could be called bricolage, in which content is soldered together from existing fragments and textual patterns. In other words, newspaper content is often harvested from a wide range of available textual material.”

Such research can help us to understand consequential issues such as:

The virality and spread of narratives
The prevalence of quotations from a particular source
The reliance of a publication on external sources

Content propagation in the real world is messy. It happens organically through numerous small decisions made on a decentralized basis. Some decisions are opportunistic (such as plagiarism or repeating rumors), while others are motivated by a desire to spread credible information. No solution can be viable if it ignores the complex motivations of people conveying information.

Content professionals are generally wary of repeated content. They caution organizations to “avoid duplication” because “it’s bad.” Their goal is to prevent duplication and remediate it when it occurs.

The content professional’s alternative to duplication is content reuse. Unlike duplication, content reuse is considered virtuous. Duplication and reuse are distinct approaches to repeating text, but they share similarities. They are not exact opposites. It doesn’t follow that one is absolutely bad while the other is always good.

Before we can consider the merits and behaviors of reuse, it’s important to first understand the various manifestations of duplication, some of which overlap with content reuse.

Good and Bad reasons for duplicate content

Duplicate web pages on a website are almost always bad. A web page should live in only one place on a website. When the same page exists in several places on a website, it’s fairly easy for software to locate such pages. Numerous tools can scan your website for duplicate pages using a mathematical technique called checksum.

When the same page exists across distinct web domains, the advisability of having the same content appear in multiple places gets more complicated. Sometimes, such behavior indicates a poorly governed publishing process, where a page is copied to various domains without either tracking this copying or asking if it is necessary. But not all situations are problems. There are legitimate use cases for publishing the same content on distinct pages on different websites. Content may be repeated across localized web domains or domains for subbrands of an organization.

Content syndication allows the same page to be republished on multiple domains to make it available to audiences so they can find it where they are looking for it rather than expecting they’ll be hunting for it on an unfamiliar website. Organizations syndicate content throughout their own web properties or make it available to third parties.

The audience’s needs should determine whether the content should be placed on multiple websites.

When identical web pages appear on multiple websites, this can be implemented in several ways. The pages can be shared either through RSS or an API that other websites can access. But often the original page is copied to a new website. The existence of multiple copies that are independent of one another introduces many content management inefficiencies and risks.

The copying of webpages is often a consequence of the way CMSs are designed. Traditional CMSs support a single website, relying on folders and sitemaps to organize pages. Each additional website that needs the page must have the page copied into that site’s page organization. While CMSs that support multiple websites have emerged recently, some still don’t allow the original content to be organized independently of where on a website it will appear.

Duplicated content results from both human decisions and automated ones.

Collateral duplication on a website can happen when pages are autogenerated and are expected to “belong” in multiple places as part of different collections.
Web aggregators duplicate content by republishing some or all of content items from multiple sources. Aggregators are common for news, customer reviews, hotels, food delivery, and other topics.
Website mirroring, copying an entire website to another URL, may be set up to ensure the availability of content. Mirrors can enable faster access for users or preserve content that might otherwise be blocked or taken down.

When organizations intend to duplicate content, they can do so for either good or bad faith motives.

Good faith motivations reflect users’ interests by making content available where they are looking for that content. Republishing of content is allowed and encouraged. The US Department of Health and Human Services encourages the syndication of its content: “Content syndication allows you to place content from HHS websites onto your own site. It allows you to offer high-quality HHS content in the look and feel of your site. The syndicated content is automatically updated in real-time, requiring no effort from your staff to keep the pages up to date.”

Bad faith motivations include the intention to spam the user by blanketing them everywhere they might be. “‘Copypasta’ (a reference to copy-and-paste functionality to duplicate content) is an Internet slang term that refers to an attempt by multiple individuals to duplicate content from an original source and share it widely across social platforms or forums,” noted a well known social media platform that subsequently changed its ownership and name. Of course, people alone aren’t responsible for copypasta–nowadays, bots do most of the work.

In other cases, duplication involves efforts to deceive who the author is or disguise the organization that is publishing the content. Bad actors can steal content and republish it through adversarial proxy mirroring (the wholesale copying of a website that is rebranded) and web scraping (lifting published content and republishing it elsewhere without permission). Such copy-theft is illegal but technically easy to perform.

Near-duplicates: a pervasive phenomenon

While identical duplicate web pages are not uncommon, an even more pervasive situation is “near dupes” or items that duplicate some content but also contain unique content.

Near duplicate content can be planned or incidental. Similarity in content items signals thematic repetition across multiple items. Near duplication content often represents variations on a core set of messages or information.

Templates in e-commerce sites generate many pages of near duplicate content. They combine data feeds of product descriptions with boilerplate copy. Each product page has some identical wording it shares with other pages.

Unlike checks for exact duplicates, auditing for near-duplicates involves noting both what’s the same and what’s unique. The audit needs to determine where items are dissimilar and whether that is intentional. Sometimes, copies of items are updated unevenly so that there are different versions of what should be identical text. Any variations within a copy of near-duplicates should convey distinct information or messages.

Also, note that near-duplicates aren’t necessarily the repetition of exact prose. They may be summarizations or extensions. “A near-duplicate is, in some cases, a mere paraphrasing of a previous article; in other cases, it contains corrections or added content as a follow-up.” Both publishers and readers can find value in extending what’s been previously said.”

Looking at duplication from internal and external perspectives

Duplicated content can trigger a range of problems and consequences. Duplicated published content may be bad or not. Duplicated unpublished content is almost always problematic.

Let’s start by looking at the internal consequences of duplicative content. Multiple versions of the same item are confusing to authors, editors, and content managers. No one can be sure which is the “right” version. Ironically, the latest version may not be the right one if someone creates a new copy and starts editing it without completing a full review. Abandoned drafts can also cloud which one is the active one. An unapproved version could be delivered to customers.

The simple guideline to follow is that you shouldn’t have exact copies of items in your content repository. Any near duplicates in your content inventory should be managed as content variants. (For a discussion of the distinction between versions and variants, see my post on content history.)

Now, let’s consider the situation of published content that’s been duplicated. Is it bad for audiences? It can be, but won’t necessarily be.

A wrong assumption often made about duplicated published content is that audiences will encounter it all at once. Many organizations rely on web crawls to simulate how audiences encounter their content. Web crawls often turn up duplicate pages. It doesn’t follow that an individual will necessarily encounter these duplicates. Ironically, “duplicated pages can even be introduced by the crawler itself, when different links point to the same page.”

An old myth in the SEO industry proclaimed that Google penalized duplicate content. But Google acknowledges that duplicate content, while potentially confusing to users, does not present a problem for Google’s search indexing: “Some duplicate content on a site is normal and it’s not a violation of Google’s spam policies. However, having the same content accessible through many different URLs can be a bad user experience (for example, people might wonder which is the right page and whether there’s a difference between the two), and it may make it harder for you to track how your content performs in search results.”

Duplicate content is often a symptom of other user experience issues, such as poor journey mapping or content labeling. No reader wants multiple links that all lead to the same item. When titles or links look similar, readers can’t be sure whether equivalent options are identical and equally useful or are really different content items. For example, users frequently choose the wrong product support link because they are unable to understand and define distinctions between product variants.

Reuse: How different is it from duplication?

Content reuse is widely advocated but sometimes loosely defined. It’s often not clear whether it refers to the internal reuse of content prior to publication or the external republication of content. Without making that distinction, it isn’t clear when or whether duplication of content occurs. How does one apply the famous adage in content practice to be “DRY” (Don’t Repeat Yourself)? Should content not be repeated externally or only internally?

People may advocate reuse for a range of reasons:

Reuse for message and information consistency
Reuse for internal sharing and joint collaboration
Reuse to save content development effort
Reuse to promote messages and information more widely externally

Content reuse implies that one copy of a content item can appear many times in various guises. The reality behind the scenes is more complicated, and it is perhaps more accurate to think about content reuse as managed duplication.

Reuse implies one original content item will serve as the basis for published content that’s delivered in various contexts. When implemented in publishing toolchains, there will likely be more than one copy. If you care about business continuity, your repository will likely have a mirror and backup, and it’s possible an item will be cached in other systems involved in the publishing and delivery process. But while copies may exist, there’ll only be one original.

The original copy is sometimes referred to as the canonical one. Any changes are made only to the original; the other copies are read-only. Importantly, all changes are reversible since the copies are dependent on the original or are stored temporarily. With duplicated copies are unmanaged, by contrast, separate instances would each require updating, which often doesn’t happen.

It’s useful to distinguish delivery reuse (one item delivered to many places) from assembly reuse (one item incorporated into many other items). Most rationales for content reuse focus on internal content management requirements rather than external customer access benefits, but both are valid goals.

A wider perspective on reuse considers its role in contextualizing information and messages. Reused content can change the temporal and topical context.

Sometimes, reused content is standalone items: information or messages that need to be repeated in diverse scenarios. Such reuse allows target messages to be delivered at the right moment.

Other times, reused content is inserted into a larger item. But when reused content is incorporated into larger content items, content reuse can generate near-duplicates. Templated content, for example, repeats wording on multiple pages, making it hard for users to distinguish various items. From an external user’s perspective, reused content can be indistinguishable from duplicated content.

Reuse can support content customization. Organizations are expected to generate many variations of core content. Reuse has its roots in document management, the assembling of long-form documents that are built from both repeated text and customized text. But as online content moves away from long-form documents like product manuals and becomes more granular and on-demand, content customization is changing. Reuse in content assembly is still important, but more content is now reused directly by delivering standalone snippets or chunks.

The value of de-duplicating content

Detecting duplicate content has become a mini-industry. Numerous technical approaches can identify duplicated content, and a range of vendors offer de-duplication solutions.

One vendor focuses on monitoring repetition in what’s published online, asserting, “There’s a wide variety of use cases for duplicate detection in the field of media monitoring, ranging from virality analyses and content distribution tracking to plagiarism detection and web crawling.”

Content aggregators need to filter duplicates. Another vendor sells a “content deduplication/travel content mapping solution” that gives customers “the opportunity to create your own hotel database and write original material.”

When organizations create content, they need to preclude making redundant content. One firm offers a tool to prevent writers from creating duplicate content on intranets. The problem is not trivial: how do writers know what’s already been created? They may create a new item that doesn’t have the exact wording of an existing one, but with a focus that’s nearly identical.

Governance based on well-defined content types (indicating a clear purpose for the content) and accurate, descriptive metadata (indicating the content’s scope) is essential to preventing redundant content. Authors should be prompted to answer what the content is about before starting to create it. The inventory can check to see what existing content might be similar.

Since near-duplicates are more difficult to identify than exact ones, tools need to do “fuzzy” searches to find overlapping items. Techniques include “MinHash” and “shingling” that chop up strings to measure similarity thresholds.

While readers don’t want to wade through duplicate items or have to disambiguate them, the same is true for machines – only at a larger scale. Software programs can behave oddly if the inventory of content emphasizes certain items too much. Duplication can introduce bias in software algorithms because programs are more inclined to select from duplicated information when performing searches or generating answers. Duplication of content has emerged as a concern in large language models.

Recent research by Amazon suggests that duplication can interfer with the relevancy of answers provided by LLMs.

If many similar items exist, which one should be canonical? In some cases, no one item will be a “best” representative. LLMs can generative a cross-item summarization of the near duplicates, providing a composite of multiple items that are similar but not identical.

Deduplication is emerging as an important requirement for the internal governance of content.

– Michael Andrews

Tags content reuse

Content Operations

Tracking content’s history: the key to better content maintenance

Post author By Michael Andrews
Post date May 7, 2024

To control how content changes, teams must be able to track the content’s history. A complete profile of changes in the content’s maintenance and usage can guide how and when to intervene.

Content maintenance isn’t about maintaining the status quo. Maintaining content requires change management.

Maintenance has always been a vexing dimension of content operations. Some forms of content resist change, while others change organically in a messy ad hoc manner.

Previously, I examined the digital transformation of content workflows to improve the accuracy of content as it is created. I also looked at opportunities to develop content paradata to determine, among other things, how content has changed. This post continues the discussion of how to track content changes to improve content maintenance.

The constant of change

The famous 20th-century economist John Maynard Keynes purportedly replied to someone who questioned the consistency of his views: “When the facts change, I change my mind. What do you do, sir?”

Does our content adjust to reflect how we’ve changed our perspectives, or is it frozen at the time it was published? Does it adapt when the facts change?

Change involves both a recognition that circumstances have shifted and a willingness to reconsider a prior position. From a process perspective, that involves two distinct decisions:

1. Determining that the content is not current

2. Deciding to change the content

A body of content items resembles the proverbial forest of trees. If a tree falls without anyone noticing, will anyone know or care to clear the tree trunk blocking a pathway? Often, people notice content is outdated long after it has become so. The lag that has elapsed can influence the perceived urgency to change the content. Outdated content that’s noticed quickly is often more likely to be changed.

Content change management requires awareness of all the changes in circumstances that influence the relevance of content and the ability to prioritize, invest, and execute in making appropriate content changes.

Despite the strong emphasis on delivering consistent content, content is rarely static and will likely change. The challenge is to manage change in a consistent way.

How content changes

Must be discernible
Should be based on defined rules
Will shape what insights and actions are available

Content consistency requires internal consistency, not immutability. While it’s relatively easy to change a single webpage, managing changes at scale is challenging because the triggers and scope of changes are diverse.

Content maintenance gets a short shrift in Content Lifecycle Management

It makes little sense to talk about the lifecycle of content without reference to its lifespan. Ephemeral content tends to be deleted quickly. Lifecycle management often presumes the content will be short-lived and consequently focuses most attention on the content development process.

Content Lifecycle Management (CLM) discussions often lack specifics about what happens to content after publication. They typically suggest that content should be maintained and then retired when it’s no longer needed, advice that is too general to be readily implemented. The advice doesn’t tell us what should be done with published content under what circumstances at what point in time.

Advice about content lifecycle management is often vague. One of Google’s top links related to content lifecycles is this AI-generated LinkedIn stub, which has no human-contributed content.

Consider the basic existential question of whether out-of-date content should be maintained or retired. The question prompts further ones: How valuable would an updated version of the content be? How much effort would be involved to make the content up-to-date, especially if it hasn’t been updated in a while?

Often, the guiding goal of keeping content up-to-date overshadows the practicalities of doing so. Should content have distinct versions or only one version? Should the content only reflect present circumstances, or does it need to state what it has presented previously?

The status or state of content needs specificity

CMSs generally distinguish content items by whether they are in draft or published. While that distinction is essential, it doesn’t tell editors much about what has happened to content in the past.

Even draft content can have a backstory. A surprising amount of content never leaves the draft state. Abandoned drafts are sometimes never deleted. Pre-publication content requires maintenance too.

Conversely, some published content never goes through a draft stage. Autogenerated content (including some AI-generated text) can be automatically published. Even though this content was never human-reviewed prior to publication, it’s possible it will need maintenance after it’s been published if the automation generates errors or the material becomes dated.

Maintenance is a general phase rather than a specific state. Maintenance can have many expressions:

Revision
Updating
Correction
Unpublishing because the item is not currently relevant
Archiving to freeze an older topic no longer current
Deleting superfluous or dated content that doesn’t deserve revision

How does content change?

Despite the importance of content maintenance, few people say they will maintain an item or group of items. Content maintenance is not well-defined or operationalized. Instead, staff talk about changes in generic terms, such as editing items or getting rid of them. They talk about making revisions or updates without distinguishing these concepts.

Content changes involve a range of distinct activities. The following table enumerates distinct states for content items, describing changes.

Status	Description and behavior
Published	Lists publication date. May indicate “new” if recent and not previously published. If content has been reviewed since publication but not changed, it may indicate a “last reviewed” date.
Revised	Stylistic revisions (wording or imagery changes) are not typically announced publicly when they don’t impact the core information in the content. Each revision, however, will generate a new version.
Updated	Updates refer to content changes that add, delete, or change factual information within the content. They can be announced and indicated with an update date that’s separate from the original publication date. Some publishers overwrite the original publication date, which can be confusing if it provides the impression that the content is new.
Corrected	Correction notices state what was previously published that was wrong and provide the correct information. Corrections commonly relate to spellings, attributions of people or dates, and factual statements. They are used when there’s a likelihood that readers will become confused by seeing conflicting statements appearing in an article at different times.
Republished	Content sometimes indicates an item originally published on a certain date or website.
Published archive	Legacy content that needs to remain publicly accessible even though it is not maintained is published as an archive edition. Such content commonly includes a conspicuous banner announcing that it is out-of-date or that the information has not been updated as of a specific date. It also sometimes includes a redirect link if there’s a more current version available.
Scheduled	While scheduled is commonly an internal status, sometimes websites indicate that content is scheduled to appear by stating, “Coming on X date at Y time.” This is most common for announcements, product releases, or sales promotions.
Offline temporarily	When published content is offline to address a bug or problem, it may be noted with a message announcing, “We are working on fixing issues.”
Previously live	Used for recordings of live-streamed content, especially video.
Deleted	When content is deleted and no longer available, many publishers simply provide a generic redirect. But when users expect to find the content item by searching for it specifically, it may be necessary to provide a page announcing the page is no longer available and provide a specific redirect link to the most relevant available content addressing the topic.
Unpublished	Unpublished content is available internally for republishing but externally will resemble deleted content.
Read-only	While most digital content is editable, some will be read only on publication and not human editable. Examples are templated pages of financial data or robot-written stories about weather forecasts. While options for media editing are growing, much media, such as video, is difficult to edit after its publication.

Different content states

After content is published, many changes are possible. Sometimes, corrections are needed.

Updates indicate a date of review and potentially the name of the reviewer.

Detailed update history. Screenshot: Healthline

Retiring old content involves decisions. Sometimes, entire websites are archived but still accessible.

An archived website showing an archived webpage about web archiving. Source: EPA

When canonical content changes, such as standards, it is important to retain copies of prior versions that users may have relied upon.

Content items can transition between various statuses. The diagram below shows the different states or statuses content items can be in. The dashed lines indicate some of the significant ways that content can change its state.

Content states and transitions (dotted lines). Open image in new window to enlarge.

The content’s state reflects the action taken on an item. The current state can influence what future actions are allowed. For example, when published content is taken offline, it is unpublished, though it remains in the repository. An unpublished item can be republished.

Most states are effective immediately, but a few are pending, where the system expects and announces changed content is forthcoming. Some will indicate the date of changes, but other states don’t indicate that publicly.

Maintained content is subject to change

The biggest factor shaping a content item’s status is whether or not it is maintained. Only in a few circumstances will content not require maintenance.

If the organization has opted to publish content and keep it published, it has implicitly decided to maintain it by continuing to make it available. Of course, the publishing organization may do a poor job of maintaining that content. Maintenance should always be intentional, not an unplanned consequence of random choices to change or neglect items. But never confuse poor maintenance with no maintenance: they are separate statuses.

A maintained item can potentially change. Its details are subject to change because the content addresses issues that might change; the item is in a maintained phase whether or not it has been changed, recently–or ever. Some people mistakenly believe that items that haven’t been updated or otherwise changed recently are unmaintained and thus no longer relevant. But unless there is a cause to change the content, there’s no reason to assume the content has lost relevance. Sometimes, the recency of changes will predict current relevance, but not always.

Some published content, such as read-only or published archival content, will not be subject to change. What such content describes or relates to is no longer active. But no-maintenance content is rare.

Content will no longer be subject to change when it has been frozen or removed. Only then will the content be no longer maintained. Depending on the value of such legacy content, it can either remain published for a defined time period or immediately deleted once it is no longer maintained. Like software and other products, content needs an “end-of-life” process.

Why does content change?

When content managers discover content that needs to be changed, they create a task to fix the problem. Content maintenance often involves a backlog of tasks that are managed through routine prioritization.

Content managers would benefit from more visibility into why content items require changes so they can estimate the effort involved with different types of changes. They need a root-cause analysis of their content bugs.

Some changes are planned, but even unplanned changes can be anticipated to some degree. Changes also vary in their urgency and timescale. Some require immediate attention but are quick to fix. Others are more involved but may be less urgent. Unfortunately in many cases, changes that are not considered urgent are deemed unimportant. By understanding the drivers of change, content managers estimate the need and effort involved with various content changes and plan accordingly.

Content change triggers. Open image in new window to enlarge.

Planned changes include those related to product and business announcements, scheduled projects involving content, new initiatives, and substitutions based on current relevance.

Internal errors and external surprises can prompt unplanned changes.

Events generate a gap between the existing content and what is needed, whether planned or unplanned. Details may now be

Missing
Inaccurate
Mismatched with user expectations
No longer conformant with organizational guidelines
Confusing
Obsolete

Changes in items can cascade. More than one cycle of changes may be needed. For example, updating items may introduce new errors. Errors such as misspellings, wrong capitalization and punctuation, and inadvertent deletions are as likely to arise when editing as when drafting. Changes in certain content items may cause the details in other related items to become out of synch, necessitating the need for their change as well.

While content maintenance centers on changing content, it also involves preserving the intent of the content. Maintenance can preserve two critical dimensions:

The item’s traceability
Its value

Poorly managed content is difficult to trace. Many changes happen stealthily – someone fixes a problem in the content after spotting an error without logging this change anywhere. Maybe the author hopes no one else noticed the mistake and decides that it’s no longer a concern because it’s fixed. But suppose a customer took a screenshot of the content before the fix and perhaps shared it on social media. Can the organization trace how the content appeared then? Versioning is essential for content traceability over time, because it provides a timestamped snapshot of content. Autogenerated versions announce that changes have occurred.

Content changes are essential for maintaining the value of published content. Consider so-called evergreen content, which has enduring value and will stay published for an extended time. Despite its name, evergreen content requires maintenance. The lifespan of such content is determined by its traction: whether it is relevant and current. The utility of the content depends on more than whether or not the content needs to be updated. Up-to-date content may no longer be relevant to audiences or the business. Goals age, as does content. If the content no longer supports current goals because those goals have morphed, then the content may need to be unpublished and deleted.

Content variants and ‘content drift’

A shift in the goals for the original content can produce a different kind of change: a pivot in the content’s focus.

How far can the content change before its identity changes so much that it is no longer what was originally published? At what point do revisions and updates result in the content talking about something different from what was initially published?

It’s important to distinguish between content versions and variants. They have different intents and need to be tracked separately.

Versions refer to changes to content items over time that do not change the focus on the content. An item is tracked according to its version.

Variations refer to changes that introduce a pivot in the emphasis of the content by changing its focus or making it more specific. A variation does not simply change wording or images but essentially reconfigures the original content. A variation creates a new draft that is tracked separately.

Unlike versions, which happen serially, variations can occur in multiples simultaneously. Only one version can be current at a given time, but many variants can be current at once.

Variants arise when organizations need to address a different need or change the initial message. Writers often refer to this process as “repurposing” content. With the adoption of GenAI, repurposing existing content has become easy.

However, the unmanaged publication of repurposed content can generate a range of challenges. Content managers can have trouble keeping “derivative content” current when it is unclear on what that content is based.

When pivots happen gradually, content changes are hard to notice. Various writers and editors continually change the item, subtly altering the content’s purpose and goals. The changes behave like revisions, where only one version is current. But they also resemble variations, where the emphasis of the content shifts to the point that it has assumed a separate identity from its initial one. Such single-item fluidity is known as “content drift.”

A recent study by Harvard Law School (“The Paper of Record Meets an Ephemeral Web”) examined the “problem of content drift, or the often-unannounced changes––retractions, additions, replacement––to the content at a particular URL.” The URL is a persistent identifier of the content item, but the details associated with that URL have substantively changed without visitors knowing the changes occurred.

Examining sources cited by the New York Times, the Harvard team “noted two distinct types of drift, each with different implications. First, a number of sites had drifted because the domain containing the linked material had changed hands and been repurposed….More common and less immediately obvious, however, were web pages that had been significantly updated since they were originally included in the article. Such updates are a useful practice for those visiting most web sites – easy access to of-the-moment information is one of the Web’s key offerings. Left entirely static, many web pages would become useless in short order. However, in the context of a news article’s link to a page, updates often erase important evidence and context.”

Watch out for the ever-morphing page. Various authors can change content items over months or years. As old references are deleted and new buzzwords are introduced, the changes produce the illusion that the content is current. But the original message of the content, motivated by a specific purpose at a particular time, is compromised in the process.

The phenomenon of content drift highlights the importance of precisely tracking content changes. Many organizations maintain zombie pages that continually change because the URL is considered more valuable than the content. A better practice is to create new items when the focus shifts.

Practices that content management can learn from data management

Even though content involves many distinct nuances, its maintenance shares challenges facing other digital resources such as data and software code. Content management can learn from data management practices.

Diff checking versions and variants

Diff checking is a common utility for comparing file contents. Although it is most widely used to compare lines of text, it can also compare blocks of text and even images.

While diff checking is most associated with tracking changes in software code, it is also well established in checking content changes as well. Some common diff checking use cases include detecting:

Plagiarism
Alteration of legal text
Omissions
Duplication of text in different files

The primary use of diff checking in content management is to compare two versions of the same content item. The process is easiest to see when presenting two versions side-by-side, clearly showing additions and deletions between the original and subsequent versions.

This is an example of using diff checking to compare different versions of content, which highlights both word changes and the deletion of the second section (“aufgehoben”). Screenshot: Diffcheck.net

Organizations can use diff checking to compare different content items. Cross-item comparisons can help teams identify what parts of content variants should be consistent and which should be unique.

Screenshot of copyscape app comparing two variants of weather reports. Different wording is grey, while common wording is in blue.

Cross-item diff checking can identify:

Duplication
Points of differentiation
The presence of non-standard language in one of the items
Forensic investigation of content provenance

Unfortunately, cross-item comparison is not a standard functionality in CMSs. Yet it is an essential capability for managing the maintenance of content variants. It can determine the degree of similarity between items.

Comparison tools are no longer limited to checking for identical wording. Newer capabilities incorporating AI can identify image differences and spot rephrasing in text. They can compare not only known variants but also locate hidden variants that arose from the copying and rewriting of existing items.

Understanding the pace of changes

Content managers sometimes describe it as either static or dynamic. These concepts help to define the user experience and delivery of the content. Can the content be cached where it is instantly available, or will it need to fetch updates from a server, which takes longer?

The static/dynamic dichotomy alludes to the broader issue. Updates impact not only the technical delivery of the content but also the behavior of content developers and users.

Data managers classify data according to its “temperature”—how actively it is used. They do this to decide how to store the data. Frequently changing data needs to be accessed more quickly, which is more expensive.

Content managers can borrow and adapt the concept of temperature to classify the frequency that content is updated or otherwise changed. Update frequency doesn’t necessarily influence how content is stored, but it does influence operational processes.

Update frequency will shape how content is accessed internally and externally. The demand for content updates is related to the frequency of updating. Publishers push content to users when updating it; the act of updating generates audience demand. Users pull content that has changed. They seek content that offers information or perspectives that are more useful than were available before the change.

We can understand the pace of changes to content by classifying content changes into temperature tiers.

Temperature	Content relevance
Hot	The most “dynamic” content in terms of changes. Includes transactional data (product prices and availability), customer submission of reviews and comments, streaming, and liveblogging. Also covers “fresh” (newly published) content and possibly top content requests – as these items are least stable because they’ve often iterated.
Warm	Content that changes irregularly, such as active recent (rather than just-published) content. Sometimes only a subset of the item is subject to change.
Cold	Content that is infrequently accessed and updated that is nearly static or archival. It may be kept for legal and compliance reasons.

Content temperature tiers

More ephemeral “hot” content will be “post and forget” and won’t require maintenance until it is purged. Other hot content will require vigilant review in the form of updates, corrections, or moderation. What all hot content shares is that it is top of mind and likely easily accessed.

“Warm” content is less at the top of the mind and is sometimes neglected as a result. Given the prioritization of publishing over maintenance, warm content is changed when problems arise, often unexpectedly. The timing and nature of changes are more difficult to predict. Maintenance happens on an ad hoc basis.

“Cold” content is often forgotten. Because it isn’t active, it is often old and may not have an identifiable owner. However, managing such content still requires decisions, although organizations generally have poor processes for managing such content.

Versioning strategies for ‘Slowly Changing Dimensions’

Warm content corresponds to what data managers call slowly changing dimensions (SDC), another concept that can help content managers think about the versioning process.

Wikipedia notes: “a slowly changing dimension (SCD) in data management and data warehousing is a dimension which contains relatively static data which can change slowly but unpredictably, rather than according to a regular schedule.”

While software engineers developed SCD to manage the rows and columns of tabular data, content managers can adapt the concept to address their needs. We can translate the tiering to describe how to manage content changes. Rows are akin to content items, while columns broadly correspond to content elements within an item.

SDC Type	Equivalent content tracking process
Type 0	Static single version. Always retain the original content as is. Never overwrite the original version. When information differs from existing content, create a new content item.
Type 1	Changeable single version. Used for items when there’s only one source of truth that is mutable, for example, the current weather forecast. What’s been stated in the past is no longer relevant, either internally or externally.
Type 2	Create distinct versions. Each change, whether a revision, update, or correction, generates a new version that has a unique version number. Changes overwrite prior content, but status can be rolled back to an earlier version.
Type 3	Version changes within an item. Rather than generating versions of the item overall, the versioning occurs at the component level. The content item will contain a patchwork of new and old, so that authors can see what’s most recently changed.
Type 4	Create a change log that’s independent of the content item. It lists status changes, the scope of impact, and when the change occurred.

Slowly changing dimensions of content — management tiers

Types 0 and 1 don’t involve change tracking, but the higher tiers illustrate alternative approaches to tracking and managing content versions.

CMSs use varied implementations of version comparison.

Kontent.ai illustrates an example of Type 2 version comparison. Their CMS allows an editor to compare any two versions within a single view. It distinguishes added text, removed text, and text with format changes.

An example of comparing two versions (indicated with check marks) of an item in Kontent.ai. The example also shows an older abandoned draft

Optimizely has a feature supporting a Type 3 version comparison. Their CMS has a limited ability to compare properties between versions.

The Wikipedia platform provides content management functionality. Wikipedia’s page history is an example of a table of changes associated with a Type 4 approach. Some of these are automatic edit summaries.

An even more complete summary would transcend being a change log providing a basic timeline to become a complete change history that lists:

When was content changed, and how the timing relates to other events (publication event, corporate event, product development event, marketing campaign event)
Why was it changed (the reason)
What was changed (the delta)

Tracking content’s current and prior states

CMSs are largely indifferent about changes to published content. By default, they only track whether a content item is drafted, published, or archived. From the system’s perspective, this is all they need to know: where to put the content.

Like many CMSs, Drupal tracks content according to wheter it is in draft, published, or archived.

The CMS won’t remember what’s specifically happened. It doesn’t store the nature of changes to published items or reference them in subsequent actions. Its focus is on the content’s current high-level status. The CMS only knows that the content is published, rather than the most recent version was updated.

The cycle of draft-published-archive is known as state transition management. CMSs manage states in a rudimentary way that doesn’t capture important distinctions.

From a human perspective, content transitions are important to making decisions. The current state suggests potential transitions, but previous states can reveal more details about the history of the item and can inform what might be useful to do next.

To help teams make better decisions, the CMS should be more “stateful”: recording the distinctions among different versions instead of only recording that a new version was published on a certain date. Such an approach would allow editors to revert the last updated version or find items that haven’t been updated since a certain date, for example.

A substantive change, such as an update or correction, and a non-substantive change, such as a minor wording revision, can trigger different workflows. For example, minor copyedits shouldn’t trigger a review workflow if the content’s substance doesn’t change and has already been reviewed.

The CMS should know about the prior life of content items. Yet CMSs can treat changes to published content as new drafts that have no workflow history, potentially triggering redundant reviews.

Because simple states don’t capture past actions, the provenience of content items can be murky. For example, how does a writer or editor know that one item is derived from another? Many CMSs prompt writers to create a new draft from an old one, but the writer isn’t always clear when doing so if the new draft is replacing the old one (generating a new version) or creating a new item (generating a new variant). Whenever a new item is created based on an old one, the maintenance burden grows.

Like many CMSs, Drupal allows a draft to be created from published content.

Content transitions are neither strictly linear nor entirely cyclical. Content doesn’t necessarily revert to a previous state. An unpublished item is not the same as a draft. What happened to published items previously can be of interest to editorial teams.

CMSs would benefit from having a nested state mechanism that distinguishes various states within the offline state (draft, unpublished, deleted) from those in the online state (published original [editable], revised, updated, corrected.) In addition, the states should be able to recognize multiple states are possible. Old content can be unpublished and deleted, which may happen simultaneously or at different times. Existing content similarly can be revised for wording and updated for facts at the same or different times.

State transitions must be linked to version dates. The effective dates of changes is essential to understanding both the history of content items and their future disposition. For example, if a previously editable item is converted to read-only (a published archival version), it is helpful to know when that occurred. It is unlikely that an item, once archived, would be edited again.

Even though most CMSs only manage simple states and transitions, IT standards support more complex behaviors.

Statecharts, a W3C standard to describe state changes, can address behaviors such as:

Parallel states, where different transitions are happening concurrently
Compound or nested states, where more specific states exist within broader ones
History states capturing a “stored state configuration” to remember prior actions and statuses

These standards allow for more granular and enduring tracking of content changes. Instead of each edit regressing back to a draft, the content can maintain a history of what actions have happened to it previously. A history state knows the point at which it was last left so that processes don’t need to start over from the beginning.

A ‘Data Historian’ for content

Writers, editors, and content managers have trouble assessing the history of changes to content items, especially for items they didn’t create. CMSs don’t provide an overview of historical changes to items.

Wikipedia, which is collectively written and edited, provides an at-a-glance dashboard showing the history of content items. It shows an overview of edits to a page, even distinguishing minor edits that don’t require review, such as changes in spelling, grammar, or formatting.

Wikipedia page history dashboard (partial view)

Like Wikipedia, software code is collectively developed and changed. Software engineers can see an “activity overview” that summarizes the frequency and type of changes to software code.

It’s a mistake to believe that because systems and people routinely and quickly change digital resources, that the history of those changes isn’t important.

The value of recording status transitions goes beyond indicating whether the content is current. The history of status transitions can help content managers understand how issues arose so they can be prevented or addressed earlier.

Data managers don’t dismiss the value of history – they learn from it. They talk about the concept of historicizing data or “tracking data changes over time.” Data history is the basis of predictive analytics.

Some software hosts a “data historian.” Data historians are most common in industrial operations, which, like content operations, involve many processes and actions happening across teams and systems at various times.

One vendor describes the role of the historian as follows: “A data historian is a software program that records the data of processes running in a computer system….The data that goes into a data historian is time-stamped and cataloged in an organized, machine-readable format. The data is analyzed to compare such things as day vs. night shifts, different work crews, production runs, material lots, and seasons. Organizations use data from data historians to answer many performance and efficiency-related questions. Organizations can gain additional insights through visual presentations of the data analysis called data visualization.”

If automated industrial processes can benefit from having a data historian, then human-driven content processes can as well. History is derived from the same word as story (the Latin historia); history is storytelling. Data historians can support data storytelling. They can communicate the actions that teams have taken.

Toward intelligent change management

Numerous variables can trigger content changes, and a single content item can undergo multiple changes during its lifespan. Editors are expected to use their judgment to make changes. But without well-defined rules, each editor will make different choices.

How far can rules be developed to govern changes?

A widely cited example of archiving rules is the US Department of Health and Human Services archive schedule, which keeps content published for “two full years” unless subject to other rules.

HHS archiving timeline (partial table). Some rules are time- or event-driven, but others require an individual’s judgment. Source: HHS

Even mature frameworks such as HHS still rely on guesswork when the archiving criteria are “outdated and/or no longer relevant.”

It’s useful to distinguish fixed rules from variable ones. Fixed rules have the appeal of being simple and unambiguous. A fixed rule may state: After x months or years following publication, an item will be auto-archived or automatically deleted. But that’s a blunt rule which may not be prudent in all cases. So, the fixed rule becomes a guideline that requires human review on a case-by-case basis, which doesn’t scale, can be inconsistently followed, and limits the capacity to maintain content.

Content teams need variable rules that can cover more nuances yet provide consistency in decisions. Large-scale content operations entrail diversity and require rules that can address complex scenarios.

What can teams learn if content changes become easier to track, and how can they use that information to automate tasks?

Data management practices again suggest possibilities. The concept of change data capture (CDC) is “used to determine and track the data that has changed (the “deltas”) so that action can be taken using the changed data.” If a certain change has occurred, what actions should happen? A mechanism like CDC can help automate the process of reviewing and changing content.

Basic version comparison tools are limited in their ability to distinguish stylistic changes from substantive ones. A misplaced comment or wrongly spelled word is treated as equivalent to a retraction or significant update. Many diff checking utilities simply crunch files without awareness of what they contain.

Ways to automate changes at scale

Terminology and phrasing can be modified at scale using customized style-checking tools, especially ones trained on internal documents that incorporate custom word lists, phrase lists, and rules.

Organizations can use various strategies to improve oversight of substantive statements:

Templated wording, enforced through style guidelines and text models, directs the focus of changes on substance rather than style.
Structured writing can separate factual material from generic descriptions that are used for many facts.
Named entity recognition (NER) tools can identify product names, locations, people, prices, quantities, and dates, to detect if these have been altered between versions or items.

Substantive changes can be tracked by looking at named entities. Suppose the below paragraph was updated to include data from the 2018 Consumer Reports. A NER scan could determine the date used in the ranking cited in the text without requiring someone to read the text.

displaCy Named Entity Visualizer on Wikipedia text

NER can also be used to track brand and product names and determine if content incorporates current usage.

Bots can perform many routine content maintenance operations to fix problems that degrade the quality and utility of content. The experience of Wikipedia shows that bots can be used for a range of remediation:

Copyediting
Adding generic boilerplate
Removing unwanted additions
Adding missing metadata

Ways to decide when content changes are needed

We’ve looked at some intelligent ways to track and change content. But how can teams use intelligence to know when change is needed, particularly in situations that don’t involve predictable events or timelines?

What situation has changed and who now needs to be involved?
What needs to change in the content as a result?

Let’s return to the content change trigger diagram shown earlier. We can identify a range of triggers that aren’t planned and are harder to anticipate. Many of these changes involve shifts in relevance. Some are gradual shifts, while others are sudden but unexpected.

Teams need to connect the changes that need to be done to the changes that are already happening. They must be able to anticipate changes in content relevance.

First, teams need to be able to see the relationships between items that are connected thematically. In my recent post on content workflows, I advocated for adopting semantics that can connect related content items. A less formal option is to adopt the approach used by Wikipedia to provide “page watchers” functionality that allows authors to be notified of changes to pages of interest (which is somewhat similar to pull requests in software.) Downstream content owners want to notice when changes occur to the content they incorporate, link to, or reference.

Second, teams need content usage data to inform the prioritization and scheduling of content changes.

Teams must decide whether updating a content item is worthwhile. This decision is difficult because teams lack data to inform it. They don’t know whether the content was neglected because it was deemed no longer useful or whether the content hasn’t been effective because it was neglected. They need to cross-reference data on the internal history of the content with external usage, using content paradata to make decisions.

Content paradata supporting decisions relating to content changes

Maintenance decisions depend on two kinds of insights:

The cadence of changes to the content over time, such as whether the content has received sustained attention, erratic attention, or no attention at all
The trends in the content’s usage, such as whether usage has flatlined, declined, grown, or been consistently trivial

Historical data clarifies whether problems emerged at some point after the organization published the item or if they have been present from the beginning. It distinguishes poor maintenance due to lapsed oversight from cases where items were never reviewed or changed. It differentiates persistent poor engagement (content attracting no views or conversions at all) from faltering engagement, where views or conversions have declined.

Knowing the origin of problems is critical to fixing them. Did the content ever spark an ember of interest? Perhaps the original idea wasn’t quite right, but it was near enough to attract some interest. Should an alternative variant be tried? If an item once enjoyed robust engagement but suffers from declining views now, should it be revived? When is it best to cut losses?

Decisions about fixing long-term issues can’t be automated. Yet better paradata can help staff to make more informed and consistent decisions.