Categories
Content Operations

Tracking content’s history: the key to better content maintenance

To control how content changes, teams must be able to track the content’s history. A complete profile of changes in the content’s maintenance and usage can guide how and when to intervene. 

Content maintenance isn’t about maintaining the status quo.  Maintaining content requires change management. 

Maintenance has always been a vexing dimension of content operations. Some forms of content resist change, while others change organically in a messy ad hoc manner. 

Previously, I examined the digital transformation of content workflows to improve the accuracy of content as it is created.  I also looked at opportunities to develop content paradata to determine, among other things, how content has changed.  This post continues the discussion of how to track content changes to improve content maintenance.

The constant of change

The famous 20th-century economist John Maynard Keynes purportedly replied to someone who questioned the consistency of his views: “When the facts change, I change my mind. What do you do, sir?”

Does our content adjust to reflect how we’ve changed our perspectives, or is it frozen at the time it was published? Does it adapt when the facts change? 

Change involves both a recognition that circumstances have shifted and a willingness to reconsider a prior position. From a process perspective, that involves two distinct decisions: 

1. Determining that the content is not current 

2. Deciding to change the content

A body of content items resembles the proverbial forest of trees. If a tree falls without anyone noticing, will anyone know or care to clear the tree trunk blocking a pathway? Often, people notice content is outdated long after it has become so. The lag that has elapsed can influence the perceived urgency to change the content. Outdated content that’s noticed quickly is often more likely to be changed. 

Content change management requires awareness of all the changes in circumstances that influence the relevance of content and the ability to prioritize, invest, and execute in making appropriate content changes. 

Despite the strong emphasis on delivering consistent content, content is rarely static and will likely change. The challenge is to manage change in a consistent way.  

How content changes 

  • Must be discernible
  • Should be based on defined rules
  • Will shape what insights and actions are available

Content consistency requires internal consistency, not immutability. While it’s relatively easy to change a single webpage, managing changes at scale is challenging because the triggers and scope of changes are diverse. 

Content maintenance gets a short shrift in Content Lifecycle Management

It makes little sense to talk about the lifecycle of content without reference to its lifespan. Ephemeral content tends to be deleted quickly. Lifecycle management often presumes the content will be short-lived and consequently focuses most attention on the content development process.

Content Lifecycle Management (CLM) discussions often lack specifics about what happens to content after publication.  They typically suggest that content should be maintained and then retired when it’s no longer needed, advice that is too general to be readily implemented. The advice doesn’t tell us what should be done with published content under what circumstances at what point in time.

Advice about content lifecycle management is often vague. One of Google’s top links related to content lifecycles is this AI-generated LinkedIn stub, which has no human-contributed content. 

Consider the basic existential question of whether out-of-date content should be maintained or retired. The question prompts further ones: How valuable would an updated version of the content be? How much effort would be involved to make the content up-to-date, especially if it hasn’t been updated in a while? 

Often, the guiding goal of keeping content up-to-date overshadows the practicalities of doing so. Should content have distinct versions or only one version? Should the content only reflect present circumstances, or does it need to state what it has presented previously?

The status or state of content needs specificity 

CMSs generally distinguish content items by whether they are in draft or published. While that distinction is essential, it doesn’t tell editors much about what has happened to content in the past. 

Even draft content can have a backstory. A surprising amount of content never leaves the draft state. Abandoned drafts are sometimes never deleted. Pre-publication content requires maintenance too.

Conversely, some published content never goes through a draft stage.  Autogenerated content (including some AI-generated text) can be automatically published. Even though this content was never human-reviewed prior to publication, it’s possible it will need maintenance after it’s been published if the automation generates errors or the material becomes dated.

Maintenance is a general phase rather than a specific state.  Maintenance can have many expressions:

  • Revision
  • Updating
  • Correction
  • Unpublishing because the item is not currently relevant
  • Archiving to freeze an older topic no longer current
  • Deleting superfluous or dated content that doesn’t deserve revision

How does content change?

Despite the importance of content maintenance, few people say they will maintain an item or group of items.  Content maintenance is not well-defined or operationalized. Instead, staff talk about changes in generic terms, such as editing items or getting rid of them.  They talk about making revisions or updates without distinguishing these concepts. 

Content changes involve a range of distinct activities. The following table enumerates distinct states for content items, describing changes.  

StatusDescription and behavior
Published Lists publication date.  May indicate “new” if recent and not previously published.  If content has been reviewed since publication but not changed, it may indicate a “last reviewed” date. 
RevisedStylistic revisions (wording or imagery changes) are not typically announced publicly when they don’t impact the core information in the content.  Each revision, however, will generate a new version.
UpdatedUpdates refer to content changes that add, delete, or change factual information within the content. They can be announced and indicated with an update date that’s separate from the original publication date. Some publishers overwrite the original publication date, which can be confusing if it provides the impression that the content is new. 
CorrectedCorrection notices state what was previously published that was wrong and provide the correct information. Corrections commonly relate to spellings, attributions of people or dates, and factual statements.  They are used when there’s a likelihood that readers will become confused by seeing conflicting statements appearing in an article at different times.
Republished Content sometimes indicates an item originally published on a certain date or website.
Published archiveLegacy content that needs to remain publicly accessible even though it is not maintained is published as an archive edition. Such content commonly includes a conspicuous banner announcing that it is out-of-date or that the information has not been updated as of a specific date. It also sometimes includes a redirect link if there’s a more current version available.
ScheduledWhile scheduled is commonly an internal status, sometimes websites indicate that content is scheduled to appear by stating, “Coming on X date at Y time.” This is most common for announcements, product releases, or sales promotions. 
Offline temporarilyWhen published content is offline to address a bug or problem, it may be noted with a message announcing, “We are working on fixing issues.”
Previously liveUsed for recordings of live-streamed content, especially video.  
DeletedWhen content is deleted and no longer available, many publishers simply provide a generic redirect. But when users expect to find the content item by searching for it specifically, it may be necessary to provide a page announcing the page is no longer available and provide a specific redirect link to the most relevant available content addressing the topic.
UnpublishedUnpublished content is available internally for republishing but externally will resemble deleted content.
Read-onlyWhile most digital content is editable, some will be read only on publication and not human editable.  Examples are templated pages of financial data or robot-written stories about weather forecasts. While options for media editing are growing, much media, such as video, is difficult to edit after its publication.
Different content states

After content is published, many changes are possible. Sometimes, corrections are needed.

Screenshot: New York Times

Updates indicate a date of review and potentially the name of the reviewer.

Detailed update history. Screenshot: Healthline

Retiring old content involves decisions. Sometimes, entire websites are archived but still accessible.

An archived website showing an archived webpage about web archiving. Source: EPA 

When canonical content changes, such as standards, it is important to retain copies of prior versions that users may have relied upon.

Screenshot: W3C

Content items can transition between various statuses. The diagram below shows the different states or statuses content items can be in.  The dashed lines indicate some of the significant ways that content can change its state.

Content states and transitions (dotted lines). Open image in new window to enlarge.

The content’s state reflects the action taken on an item. The current state can influence what future actions are allowed. For example, when published content is taken offline, it is unpublished, though it remains in the repository. An unpublished item can be republished. 

Most states are effective immediately, but a few are pending, where the system expects and announces changed content is forthcoming. Some will indicate the date of changes, but other states don’t indicate that publicly. 

Maintained content is subject to change

The biggest factor shaping a content item’s status is whether or not it is maintained. Only in a few circumstances will content not require maintenance.  

If the organization has opted to publish content and keep it published, it has implicitly decided to maintain it by continuing to make it available. Of course, the publishing organization may do a poor job of maintaining that content. Maintenance should always be intentional, not an unplanned consequence of random choices to change or neglect items. But never confuse poor maintenance with no maintenance: they are separate statuses. 

A maintained item can potentially change. Its details are subject to change because the content addresses issues that might change; the item is in a maintained phase whether or not it has been changed, recently–or ever. Some people mistakenly believe that items that haven’t been updated or otherwise changed recently are unmaintained and thus no longer relevant. But unless there is a cause to change the content, there’s no reason to assume the content has lost relevance. Sometimes, the recency of changes will predict current relevance, but not always.

Some published content, such as read-only or published archival content, will not be subject to change. What such content describes or relates to is no longer active.  But no-maintenance content is rare.

Content will no longer be subject to change when it has been frozen or removed.  Only then will the content be no longer maintained.  Depending on the value of such legacy content, it can either remain published for a defined time period or immediately deleted once it is no longer maintained. Like software and other products, content needs an “end-of-life” process

Why does content change?

When content managers discover content that needs to be changed, they create a task to fix the problem. Content maintenance often involves a backlog of tasks that are managed through routine prioritization.  

Content managers would benefit from more visibility into why content items require changes so they can estimate the effort involved with different types of changes.  They need a root-cause analysis of their content bugs.

Some changes are planned, but even unplanned changes can be anticipated to some degree. Changes also vary in their urgency and timescale. Some require immediate attention but are quick to fix.  Others are more involved but may be less urgent.  Unfortunately in many cases, changes that are not considered urgent are deemed unimportant.  By understanding the drivers of change, content managers estimate the need and effort involved with various content changes and plan accordingly.

Content change triggers. Open image in new window to enlarge.

Planned changes include those related to product and business announcements, scheduled projects involving content, new initiatives, and substitutions based on current relevance.

Internal errors and external surprises can prompt unplanned changes.  

Events generate a gap between the existing content and what is needed, whether planned or unplanned.  Details may now be

  • Missing
  • Inaccurate
  • Mismatched with user expectations
  • No longer conformant with organizational guidelines
  • Confusing
  • Obsolete

Changes in items can cascade. More than one cycle of changes may be needed. For example, updating items may introduce new errors. Errors such as misspellings, wrong capitalization and punctuation, and inadvertent deletions are as likely to arise when editing as when drafting.  Changes in certain content items may cause the details in other related items to become out of synch, necessitating the need for their change as well. 

While content maintenance centers on changing content, it also involves preserving the intent of the content.  Maintenance can preserve two critical dimensions:

  1. The item’s traceability
  2. Its value

Poorly managed content is difficult to trace.  Many changes happen stealthily – someone fixes a problem in the content after spotting an error without logging this change anywhere.  Maybe the author hopes no one else noticed the mistake and decides that it’s no longer a concern because it’s fixed. But suppose a customer took a screenshot of the content before the fix and perhaps shared it on social media. Can the organization trace how the content appeared then? Versioning is essential for content traceability over time, because it provides a timestamped snapshot of content.  Autogenerated versions announce that changes have occurred. 

Content changes are essential for maintaining the value of published content. Consider so-called evergreen content, which has enduring value and will stay published for an extended time. Despite its name, evergreen content requires maintenance.  The lifespan of such content is determined by its traction: whether it is relevant and current. The utility of the content depends on more than whether or not the content needs to be updated. Up-to-date content may no longer be relevant to audiences or the business. Goals age, as does content. If the content no longer supports current goals because those goals have morphed, then the content may need to be unpublished and deleted. 

Content variants and ‘content drift’  

A shift in the goals for the original content can produce a different kind of change: a pivot in the content’s focus.  

How far can the content change before its identity changes so much that it is no longer what was originally published? At what point do revisions and updates result in the content talking about something different from what was initially published?

It’s important to distinguish between content versions and variants. They have different intents and need to be tracked separately.

Versions refer to changes to content items over time that do not change the focus on the content. An item is tracked according to its version. 

Variations refer to changes that introduce a pivot in the emphasis of the content by changing its focus or making it more specific. A variation does not simply change wording or images but essentially reconfigures the original content. A variation creates a new draft that is tracked separately. 

Unlike versions, which happen serially, variations can occur in multiples simultaneously.  Only one version can be current at a given time, but many variants can be current at once. 

Variants arise when organizations need to address a different need or change the initial message. Writers often refer to this process as “repurposing” content. With the adoption of GenAI, repurposing existing content has become easy.

However, the unmanaged publication of repurposed content can generate a range of challenges. Content managers can have trouble keeping “derivative content” current when it is unclear on what that content is based.  

When pivots happen gradually, content changes are hard to notice. Various writers and editors continually change the item, subtly altering the content’s purpose and goals. The changes behave like revisions, where only one version is current. But they also resemble variations, where the emphasis of the content shifts to the point that it has assumed a separate identity from its initial one. Such single-item fluidity is known as “content drift.”

A recent study by Harvard Law School (“The Paper of Record Meets an Ephemeral Web”) examined the “problem of content drift, or the often-unannounced changes––retractions, additions, replacement––to the content at a particular URL.”  The URL is a persistent identifier of the content item, but the details associated with that URL have substantively changed without visitors knowing the changes occurred. 

Examining sources cited by the New York Times, the Harvard team “noted two distinct types of drift, each with different implications. First, a number of sites had drifted because the domain containing the linked material had changed hands and been repurposed….More common and less immediately obvious, however, were web pages that had been significantly updated since they were originally included in the article. Such updates are a useful practice for those visiting most web sites – easy access to of-the-moment information is one of the Web’s key offerings. Left entirely static, many web pages would become useless in short order. However, in the context of a news article’s link to a page, updates often erase important evidence and context.”  

Watch out for the ever-morphing page. Various authors can change content items over months or years. As old references are deleted and new buzzwords are introduced, the changes produce the illusion that the content is current. But the original message of the content, motivated by a specific purpose at a particular time, is compromised in the process. 

The phenomenon of content drift highlights the importance of precisely tracking content changes. Many organizations maintain zombie pages that continually change because the URL is considered more valuable than the content. A better practice is to create new items when the focus shifts. 

Practices that content management can learn from data management

Even though content involves many distinct nuances, its maintenance shares challenges facing other digital resources such as data and software code. Content management can learn from data management practices.

Diff checking versions and variants

Diff checking is a common utility for comparing file contents. Although it is most widely used to compare lines of text, it can also compare blocks of text and even images. 

While diff checking is most associated with tracking changes in software code, it is also well established in checking content changes as well. Some common diff checking use cases include detecting:

  • Plagiarism 
  • Alteration of legal text
  • Omissions
  • Duplication of text in different files

The primary use of diff checking in content management is to compare two versions of the same content item. The process is easiest to see when presenting two versions side-by-side, clearly showing additions and deletions between the original and subsequent versions.

This is an example of using diff checking to compare different versions of content, which highlights both word changes and the deletion of the second section (“aufgehoben”). Screenshot: Diffcheck.net

Organizations can use diff checking to compare different content items.  Cross-item comparisons can help teams identify what parts of content variants should be consistent and which should be unique. 

Screenshot of copyscape app comparing two variants of weather reports. Different wording is grey, while common wording is in blue.

Cross-item diff checking can identify:

  • Duplication
  • Points of differentiation
  • The presence of non-standard language in one of the items
  • Forensic investigation of content provenance

Unfortunately, cross-item comparison is not a standard functionality in CMSs. Yet it is an essential capability for managing the maintenance of content variants. It can determine the degree of similarity between items. 

Comparison tools are no longer limited to checking for identical wording.  Newer capabilities incorporating AI can identify image differences and spot rephrasing in text.  They can compare not only known variants but also locate hidden variants that arose from the copying and rewriting of existing items. 

Understanding the pace of changes

Content managers sometimes describe it as either static or dynamic. These concepts help to define the user experience and delivery of the content. Can the content be cached where it is instantly available, or will it need to fetch updates from a server, which takes longer?   

The static/dynamic dichotomy alludes to the broader issue.  Updates impact not only the technical delivery of the content but also the behavior of content developers and users.

Data managers classify data according to its “temperature”—how actively it is used. They do this to decide how to store the data. Frequently changing data needs to be accessed more quickly, which is more expensive. 

Content managers can borrow and adapt the concept of temperature to classify the frequency that content is updated or otherwise changed. Update frequency doesn’t necessarily influence how content is stored, but it does influence operational processes.

Update frequency will shape how content is accessed internally and externally. The demand for content updates is related to the frequency of updating. Publishers push content to users when updating it; the act of updating generates audience demand. Users pull content that has changed. They seek content that offers information or perspectives that are more useful than were available before the change.

We can understand the pace of changes to content by classifying content changes into temperature tiers.

TemperatureContent relevance
HotThe most “dynamic” content in terms of changes. Includes transactional data (product prices and availability), customer submission of reviews and comments, streaming, and liveblogging. Also covers “fresh” (newly published) content and possibly top content requests – as these items are least stable because they’ve often iterated.   
WarmContent that changes irregularly, such as active recent (rather than just-published) content. Sometimes only a subset of the item is subject to change.  
ColdContent that is infrequently accessed and updated that is nearly static or archival. It may be kept for legal and compliance reasons.
Content temperature tiers

More ephemeral “hot” content will be “post and forget” and won’t require maintenance until it is purged. Other hot content will require vigilant review in the form of updates, corrections, or moderation. What all hot content shares is that it is top of mind and likely easily accessed.

“Warm” content is less at the top of the mind and is sometimes neglected as a result. Given the prioritization of publishing over maintenance, warm content is changed when problems arise, often unexpectedly.  The timing and nature of changes are more difficult to predict. Maintenance happens on an ad hoc basis. 

“Cold” content is often forgotten. Because it isn’t active, it is often old and may not have an identifiable owner. However, managing such content still requires decisions, although organizations generally have poor processes for managing such content.

Versioning strategies for ‘Slowly Changing Dimensions’

Warm content corresponds to what data managers call slowly changing dimensions (SDC), another concept that can help content managers think about the versioning process. 

Wikipedia notes: “a slowly changing dimension (SCD) in data management and data warehousing is a dimension which contains relatively static data which can change slowly but unpredictably, rather than according to a regular schedule.” 

While software engineers developed SCD to manage the rows and columns of tabular data, content managers can adapt the concept to address their needs.  We can translate the tiering to describe how to manage content changes. Rows are akin to content items, while columns broadly correspond to content elements within an item.

SDC TypeEquivalent content tracking process
Type 0Static single version. Always retain the original content as is. Never overwrite the original version. When information differs from existing content, create a new content item.
Type 1Changeable single version. Used for items when there’s only one source of truth that is mutable, for example, the current weather forecast. What’s been stated in the past is no longer relevant, either internally or externally. 
Type 2Create distinct versions.  Each change, whether a revision, update, or correction, generates a new version that has a unique version number. Changes overwrite prior content, but status can be rolled back to an earlier version.
Type 3Version changes within an item. Rather than generating versions of the item overall, the versioning occurs at the component level. The content item will contain a patchwork of new and old, so that authors can see what’s most recently changed. 
Type 4Create a change log that’s independent of the content item. It lists status changes, the scope of impact, and when the change occurred. 
Slowly changing dimensions of content — management tiers

Types 0 and 1 don’t involve change tracking, but the higher tiers illustrate alternative approaches to tracking and managing content versions.

CMSs use varied implementations of version comparison.  

Kontent.ai illustrates an example of Type 2 version comparison. Their CMS allows an editor to compare any two versions within a single view. It distinguishes added text, removed text, and text with format changes. 

An example of comparing two versions (indicated with check marks) of an item in Kontent.ai. The example also shows an older abandoned draft

Optimizely has a feature supporting a Type 3 version comparison.  Their CMS has a limited ability to compare properties between versions.  

Optimizely interface

The Wikipedia platform provides content management functionality. Wikipedia’s page history is an example of a table of changes associated with a Type 4 approach. Some of these are automatic edit summaries. 

Wikipedia page history

An even more complete summary would transcend being a change log providing a basic timeline to become a complete change history that lists:

  • When was content changed, and how the timing relates to other events (publication event, corporate event, product development event, marketing campaign event)
  • Why was it changed (the reason)
  • What was changed (the delta)

Tracking content’s current and prior states

CMSs are largely indifferent about changes to published content. By default, they only track whether a content item is drafted, published, or archived. From the system’s perspective, this is all they need to know: where to put the content.

Like many CMSs, Drupal tracks content according to wheter it is in draft, published, or archived.

The CMS won’t remember what’s specifically happened. It doesn’t store the nature of changes to published items or reference them in subsequent actions. Its focus is on the content’s current high-level status.  The CMS only knows that the content is published, rather than the most recent version was updated.

The cycle of draft-published-archive is known as state transition management. CMSs manage states in a rudimentary way that doesn’t capture important distinctions.

From a human perspective, content transitions are important to making decisions. The current state suggests potential transitions, but previous states can reveal more details about the history of the item and can inform what might be useful to do next. 

To help teams make better decisions, the CMS should be more “stateful”: recording the distinctions among different versions instead of only recording that a new version was published on a certain date. Such an approach would allow editors to revert the last updated version or find items that haven’t been updated since a certain date, for example. 

A substantive change, such as an update or correction, and a non-substantive change, such as a minor wording revision, can trigger different workflows. For example, minor copyedits shouldn’t trigger a review workflow if the content’s substance doesn’t change and has already been reviewed.  

The CMS should know about the prior life of content items.  Yet CMSs can treat changes to published content as new drafts that have no workflow history, potentially triggering redundant reviews.  

Because simple states don’t capture past actions, the provenience of content items can be murky. For example, how does a writer or editor know that one item is derived from another?  Many CMSs prompt writers to create a new draft from an old one, but the writer isn’t always clear when doing so if the new draft is replacing the old one (generating a new version) or creating a new item (generating a new variant). Whenever a new item is created based on an old one, the maintenance burden grows.  

Like many CMSs, Drupal allows a draft to be created from published content.

Content transitions are neither strictly linear nor entirely cyclical. Content doesn’t necessarily revert to a previous state. An unpublished item is not the same as a draft. What happened to published items previously can be of interest to editorial teams. 

CMSs would benefit from having a nested state mechanism that distinguishes various states within the offline state (draft, unpublished, deleted) from those in the online state (published original [editable], revised, updated, corrected.)  In addition, the states should be able to recognize multiple states are possible.  Old content can be unpublished and deleted, which may happen simultaneously or at different times. Existing content similarly can be revised for wording and updated for facts at the same or different times. 

State transitions must be linked to version dates. The effective dates of changes is essential to understanding both the history of content items and their future disposition. For example, if a previously editable item is converted to read-only (a published archival version), it is helpful to know when that occurred.  It is unlikely that an item, once archived, would be edited again.

Even though most CMSs only manage simple states and transitions, IT standards support more complex behaviors. 

Statecharts, a W3C standard to describe state changes, can address behaviors such as:

  • Parallel states, where different transitions are happening concurrently
  • Compound or nested states, where more specific states exist within broader ones
  • History states capturing a “stored state configuration” to remember prior actions and statuses

These standards allow for more granular and enduring tracking of content changes. Instead of each edit regressing back to a draft, the content can maintain a history of what actions have happened to it previously.  A history state knows the point at which it was last left so that processes don’t need to start over from the beginning.  

A ‘Data Historian’ for content

Writers, editors, and content managers have trouble assessing the history of changes to content items, especially for items they didn’t create.  CMSs don’t provide an overview of historical changes to items.

Wikipedia, which is collectively written and edited, provides an at-a-glance dashboard showing the history of content items. It shows an overview of edits to a page, even distinguishing minor edits that don’t require review, such as changes in spelling, grammar, or formatting. 

Wikipedia page history dashboard (partial view)

Like Wikipedia, software code is collectively developed and changed. Software engineers can see an “activity overview” that summarizes the frequency and type of changes to software code. 

Github history dashboard

It’s a mistake to believe that because systems and people routinely and quickly change digital resources, that the history of those changes isn’t important. 

The value of recording status transitions goes beyond indicating whether the content is current.  The history of status transitions can help content managers understand how issues arose so they can be prevented or addressed earlier. 

Data managers don’t dismiss the value of history – they learn from it. They talk about the concept of historicizing data or “tracking data changes over time.”  Data history is the basis of predictive analytics. 

Some software hosts a “data historian.” Data historians are most common in industrial operations, which, like content operations, involve many processes and actions happening across teams and systems at various times. 

One vendor describes the role of the historian as follows: “A data historian is a software program that records the data of processes running in a computer system….The data that goes into a data historian is time-stamped and cataloged in an organized, machine-readable format. The data is analyzed to compare such things as day vs. night shifts, different work crews, production runs, material lots, and seasons. Organizations use data from data historians to answer many performance and efficiency-related questions. Organizations can gain additional insights through visual presentations of the data analysis called data visualization.”  

If automated industrial processes can benefit from having a data historian, then human-driven content processes can as well. History is derived from the same word as story (the Latin historia); history is storytelling. Data historians can support data storytelling. They can communicate the actions that teams have taken. 

Toward intelligent change management

Numerous variables can trigger content changes, and a single content item can undergo multiple changes during its lifespan.  Editors are expected to use their judgment to make changes.  But without well-defined rules, each editor will make different choices. 

How far can rules be developed to govern changes? 

A widely cited example of archiving rules is the US Department of Health and Human Services archive schedule, which keeps content published for “two full years” unless subject to other rules. 

HHS archiving timeline (partial table). Some rules are time- or event-driven, but others require an individual’s judgment.  Source: HHS

Even mature frameworks such as HHS still rely on guesswork when the archiving criteria are “outdated and/or no longer relevant.” 

It’s useful to distinguish fixed rules from variable ones. Fixed rules have the appeal of being simple and unambiguous. A fixed rule may state: After x months or years following publication, an item will be auto-archived or automatically deleted. But that’s a blunt rule which may not be prudent in all cases. So, the fixed rule becomes a guideline that requires human review on a case-by-case basis, which doesn’t scale, can be inconsistently followed, and limits the capacity to maintain content.

Content teams need variable rules that can cover more nuances yet provide consistency in decisions. Large-scale content operations entrail diversity and require rules that can address complex scenarios.

What can teams learn if content changes become easier to track, and how can they use that information to automate tasks? 

Data management practices again suggest possibilities.  The concept of change data capture (CDC) is “used to determine and track the data that has changed (the “deltas”) so that action can be taken using the changed data.” If a certain change has occurred, what actions should happen?  A mechanism like CDC can help automate the process of reviewing and changing content.

Basic version comparison tools are limited in their ability to distinguish stylistic changes from substantive ones. A misplaced comment or wrongly spelled word is treated as equivalent to a retraction or significant update. Many diff checking utilities simply crunch files without awareness of what they contain.

Ways to automate changes at scale

Terminology and phrasing can be modified at scale using customized style-checking tools, especially ones trained on internal documents that incorporate custom word lists, phrase lists, and rules. 

Organizations can use various strategies to improve oversight of substantive statements:

  • Templated wording, enforced through style guidelines and text models, directs the focus of changes on substance rather than style.
  • Structured writing can separate factual material from generic descriptions that are used for many facts.
  • Named entity recognition (NER) tools can identify product names, locations, people, prices, quantities, and dates, to detect if these have been altered between versions or items.

Substantive changes can be tracked by looking at named entities. Suppose the below paragraph was updated to include data from the 2018 Consumer Reports. A NER scan could determine the date used in the ranking cited in the text without requiring someone to read the text.  

displaCy Named Entity Visualizer on Wikipedia text

NER can also be used to track brand and product names and determine if content incorporates current usage.

Bots can perform many routine content maintenance operations to fix problems that degrade the quality and utility of content. The experience of Wikipedia shows that bots can be used for a range of remediation:

  • Copyediting
  • Adding generic boilerplate
  • Removing unwanted additions
  • Adding missing metadata

Ways to decide when content changes are needed

We’ve looked at some intelligent ways to track and change content. But how can teams use intelligence to know when change is needed, particularly in situations that don’t involve predictable events or timelines?

  • What situation has changed and who now needs to be involved?
  • What needs to change in the content as a result?

Let’s return to the content change trigger diagram shown earlier. We can identify a range of triggers that aren’t planned and are harder to anticipate. Many of these changes involve shifts in relevance. Some are gradual shifts, while others are sudden but unexpected. 

Teams need to connect the changes that need to be done to the changes that are already happening. They must be able to anticipate changes in content relevance. 

First, teams need to be able to see the relationships between items that are connected thematically.  In my recent post on content workflows, I advocated for adopting semantics that can connect related content items.  A less formal option is to adopt the approach used by Wikipedia to provide “page watchers” functionality that allows authors to be notified of changes to pages of interest (which is somewhat similar to pull requests in software.)  Downstream content owners want to notice when changes occur to the content they incorporate, link to, or reference. 

Second, teams need content usage data to inform the prioritization and scheduling of content changes. 

Teams must decide whether updating a content item is worthwhile. This decision is difficult because teams lack data to inform it. They don’t know whether the content was neglected because it was deemed no longer useful or whether the content hasn’t been effective because it was neglected. They need to cross-reference data on the internal history of the content with external usage, using content paradata to make decisions. 

Content paradata supporting decisions relating to content changes

Maintenance decisions depend on two kinds of insights:

  1. The cadence of changes to the content over time, such as whether the content has received sustained attention, erratic attention, or no attention at all
  2. The trends in the content’s usage, such as whether usage has flatlined, declined, grown, or been consistently trivial

Historical data clarifies whether problems emerged at some point after the organization published the item or if they have been present from the beginning. It distinguishes poor maintenance due to lapsed oversight from cases where items were never reviewed or changed. It differentiates persistent poor engagement (content attracting no views or conversions at all) from faltering engagement, where views or conversions have declined.

Knowing the origin of problems is critical to fixing them. Did the content ever spark an ember of interest? Perhaps the original idea wasn’t quite right, but it was near enough to attract some interest. Should an alternative variant be tried? If an item once enjoyed robust engagement but suffers from declining views now, should it be revived?  When is it best to cut losses? 

Decisions about fixing long-term issues can’t be automated. Yet better paradata can help staff to make more informed and consistent decisions. 

– Michael Andrews