Categories
Content Operations

Tracking content’s history: the key to better content maintenance

To control how content changes, teams must be able to track the content’s history. A complete profile of changes in the content’s maintenance and usage can guide how and when to intervene. 

Content maintenance isn’t about maintaining the status quo.  Maintaining content requires change management. 

Maintenance has always been a vexing dimension of content operations. Some forms of content resist change, while others change organically in a messy ad hoc manner. 

Previously, I examined the digital transformation of content workflows to improve the accuracy of content as it is created.  I also looked at opportunities to develop content paradata to determine, among other things, how content has changed.  This post continues the discussion of how to track content changes to improve content maintenance.

The constant of change

The famous 20th-century economist John Maynard Keynes purportedly replied to someone who questioned the consistency of his views: “When the facts change, I change my mind. What do you do, sir?”

Does our content adjust to reflect how we’ve changed our perspectives, or is it frozen at the time it was published? Does it adapt when the facts change? 

Change involves both a recognition that circumstances have shifted and a willingness to reconsider a prior position. From a process perspective, that involves two distinct decisions: 

1. Determining that the content is not current 

2. Deciding to change the content

A body of content items resembles the proverbial forest of trees. If a tree falls without anyone noticing, will anyone know or care to clear the tree trunk blocking a pathway? Often, people notice content is outdated long after it has become so. The lag that has elapsed can influence the perceived urgency to change the content. Outdated content that’s noticed quickly is often more likely to be changed. 

Content change management requires awareness of all the changes in circumstances that influence the relevance of content and the ability to prioritize, invest, and execute in making appropriate content changes. 

Despite the strong emphasis on delivering consistent content, content is rarely static and will likely change. The challenge is to manage change in a consistent way.  

How content changes 

  • Must be discernible
  • Should be based on defined rules
  • Will shape what insights and actions are available

Content consistency requires internal consistency, not immutability. While it’s relatively easy to change a single webpage, managing changes at scale is challenging because the triggers and scope of changes are diverse. 

Content maintenance gets a short shrift in Content Lifecycle Management

It makes little sense to talk about the lifecycle of content without reference to its lifespan. Ephemeral content tends to be deleted quickly. Lifecycle management often presumes the content will be short-lived and consequently focuses most attention on the content development process.

Content Lifecycle Management (CLM) discussions often lack specifics about what happens to content after publication.  They typically suggest that content should be maintained and then retired when it’s no longer needed, advice that is too general to be readily implemented. The advice doesn’t tell us what should be done with published content under what circumstances at what point in time.

Advice about content lifecycle management is often vague. One of Google’s top links related to content lifecycles is this AI-generated LinkedIn stub, which has no human-contributed content. 

Consider the basic existential question of whether out-of-date content should be maintained or retired. The question prompts further ones: How valuable would an updated version of the content be? How much effort would be involved to make the content up-to-date, especially if it hasn’t been updated in a while? 

Often, the guiding goal of keeping content up-to-date overshadows the practicalities of doing so. Should content have distinct versions or only one version? Should the content only reflect present circumstances, or does it need to state what it has presented previously?

The status or state of content needs specificity 

CMSs generally distinguish content items by whether they are in draft or published. While that distinction is essential, it doesn’t tell editors much about what has happened to content in the past. 

Even draft content can have a backstory. A surprising amount of content never leaves the draft state. Abandoned drafts are sometimes never deleted. Pre-publication content requires maintenance too.

Conversely, some published content never goes through a draft stage.  Autogenerated content (including some AI-generated text) can be automatically published. Even though this content was never human-reviewed prior to publication, it’s possible it will need maintenance after it’s been published if the automation generates errors or the material becomes dated.

Maintenance is a general phase rather than a specific state.  Maintenance can have many expressions:

  • Revision
  • Updating
  • Correction
  • Unpublishing because the item is not currently relevant
  • Archiving to freeze an older topic no longer current
  • Deleting superfluous or dated content that doesn’t deserve revision

How does content change?

Despite the importance of content maintenance, few people say they will maintain an item or group of items.  Content maintenance is not well-defined or operationalized. Instead, staff talk about changes in generic terms, such as editing items or getting rid of them.  They talk about making revisions or updates without distinguishing these concepts. 

Content changes involve a range of distinct activities. The following table enumerates distinct states for content items, describing changes.  

StatusDescription and behavior
Published Lists publication date.  May indicate “new” if recent and not previously published.  If content has been reviewed since publication but not changed, it may indicate a “last reviewed” date. 
RevisedStylistic revisions (wording or imagery changes) are not typically announced publicly when they don’t impact the core information in the content.  Each revision, however, will generate a new version.
UpdatedUpdates refer to content changes that add, delete, or change factual information within the content. They can be announced and indicated with an update date that’s separate from the original publication date. Some publishers overwrite the original publication date, which can be confusing if it provides the impression that the content is new. 
CorrectedCorrection notices state what was previously published that was wrong and provide the correct information. Corrections commonly relate to spellings, attributions of people or dates, and factual statements.  They are used when there’s a likelihood that readers will become confused by seeing conflicting statements appearing in an article at different times.
Republished Content sometimes indicates an item originally published on a certain date or website.
Published archiveLegacy content that needs to remain publicly accessible even though it is not maintained is published as an archive edition. Such content commonly includes a conspicuous banner announcing that it is out-of-date or that the information has not been updated as of a specific date. It also sometimes includes a redirect link if there’s a more current version available.
ScheduledWhile scheduled is commonly an internal status, sometimes websites indicate that content is scheduled to appear by stating, “Coming on X date at Y time.” This is most common for announcements, product releases, or sales promotions. 
Offline temporarilyWhen published content is offline to address a bug or problem, it may be noted with a message announcing, “We are working on fixing issues.”
Previously liveUsed for recordings of live-streamed content, especially video.  
DeletedWhen content is deleted and no longer available, many publishers simply provide a generic redirect. But when users expect to find the content item by searching for it specifically, it may be necessary to provide a page announcing the page is no longer available and provide a specific redirect link to the most relevant available content addressing the topic.
UnpublishedUnpublished content is available internally for republishing but externally will resemble deleted content.
Read-onlyWhile most digital content is editable, some will be read only on publication and not human editable.  Examples are templated pages of financial data or robot-written stories about weather forecasts. While options for media editing are growing, much media, such as video, is difficult to edit after its publication.
Different content states

After content is published, many changes are possible. Sometimes, corrections are needed.

Screenshot: New York Times

Updates indicate a date of review and potentially the name of the reviewer.

Detailed update history. Screenshot: Healthline

Retiring old content involves decisions. Sometimes, entire websites are archived but still accessible.

An archived website showing an archived webpage about web archiving. Source: EPA 

When canonical content changes, such as standards, it is important to retain copies of prior versions that users may have relied upon.

Screenshot: W3C

Content items can transition between various statuses. The diagram below shows the different states or statuses content items can be in.  The dashed lines indicate some of the significant ways that content can change its state.

Content states and transitions (dotted lines). Open image in new window to enlarge.

The content’s state reflects the action taken on an item. The current state can influence what future actions are allowed. For example, when published content is taken offline, it is unpublished, though it remains in the repository. An unpublished item can be republished. 

Most states are effective immediately, but a few are pending, where the system expects and announces changed content is forthcoming. Some will indicate the date of changes, but other states don’t indicate that publicly. 

Maintained content is subject to change

The biggest factor shaping a content item’s status is whether or not it is maintained. Only in a few circumstances will content not require maintenance.  

If the organization has opted to publish content and keep it published, it has implicitly decided to maintain it by continuing to make it available. Of course, the publishing organization may do a poor job of maintaining that content. Maintenance should always be intentional, not an unplanned consequence of random choices to change or neglect items. But never confuse poor maintenance with no maintenance: they are separate statuses. 

A maintained item can potentially change. Its details are subject to change because the content addresses issues that might change; the item is in a maintained phase whether or not it has been changed, recently–or ever. Some people mistakenly believe that items that haven’t been updated or otherwise changed recently are unmaintained and thus no longer relevant. But unless there is a cause to change the content, there’s no reason to assume the content has lost relevance. Sometimes, the recency of changes will predict current relevance, but not always.

Some published content, such as read-only or published archival content, will not be subject to change. What such content describes or relates to is no longer active.  But no-maintenance content is rare.

Content will no longer be subject to change when it has been frozen or removed.  Only then will the content be no longer maintained.  Depending on the value of such legacy content, it can either remain published for a defined time period or immediately deleted once it is no longer maintained. Like software and other products, content needs an “end-of-life” process

Why does content change?

When content managers discover content that needs to be changed, they create a task to fix the problem. Content maintenance often involves a backlog of tasks that are managed through routine prioritization.  

Content managers would benefit from more visibility into why content items require changes so they can estimate the effort involved with different types of changes.  They need a root-cause analysis of their content bugs.

Some changes are planned, but even unplanned changes can be anticipated to some degree. Changes also vary in their urgency and timescale. Some require immediate attention but are quick to fix.  Others are more involved but may be less urgent.  Unfortunately in many cases, changes that are not considered urgent are deemed unimportant.  By understanding the drivers of change, content managers estimate the need and effort involved with various content changes and plan accordingly.

Content change triggers. Open image in new window to enlarge.

Planned changes include those related to product and business announcements, scheduled projects involving content, new initiatives, and substitutions based on current relevance.

Internal errors and external surprises can prompt unplanned changes.  

Events generate a gap between the existing content and what is needed, whether planned or unplanned.  Details may now be

  • Missing
  • Inaccurate
  • Mismatched with user expectations
  • No longer conformant with organizational guidelines
  • Confusing
  • Obsolete

Changes in items can cascade. More than one cycle of changes may be needed. For example, updating items may introduce new errors. Errors such as misspellings, wrong capitalization and punctuation, and inadvertent deletions are as likely to arise when editing as when drafting.  Changes in certain content items may cause the details in other related items to become out of synch, necessitating the need for their change as well. 

While content maintenance centers on changing content, it also involves preserving the intent of the content.  Maintenance can preserve two critical dimensions:

  1. The item’s traceability
  2. Its value

Poorly managed content is difficult to trace.  Many changes happen stealthily – someone fixes a problem in the content after spotting an error without logging this change anywhere.  Maybe the author hopes no one else noticed the mistake and decides that it’s no longer a concern because it’s fixed. But suppose a customer took a screenshot of the content before the fix and perhaps shared it on social media. Can the organization trace how the content appeared then? Versioning is essential for content traceability over time, because it provides a timestamped snapshot of content.  Autogenerated versions announce that changes have occurred. 

Content changes are essential for maintaining the value of published content. Consider so-called evergreen content, which has enduring value and will stay published for an extended time. Despite its name, evergreen content requires maintenance.  The lifespan of such content is determined by its traction: whether it is relevant and current. The utility of the content depends on more than whether or not the content needs to be updated. Up-to-date content may no longer be relevant to audiences or the business. Goals age, as does content. If the content no longer supports current goals because those goals have morphed, then the content may need to be unpublished and deleted. 

Content variants and ‘content drift’  

A shift in the goals for the original content can produce a different kind of change: a pivot in the content’s focus.  

How far can the content change before its identity changes so much that it is no longer what was originally published? At what point do revisions and updates result in the content talking about something different from what was initially published?

It’s important to distinguish between content versions and variants. They have different intents and need to be tracked separately.

Versions refer to changes to content items over time that do not change the focus on the content. An item is tracked according to its version. 

Variations refer to changes that introduce a pivot in the emphasis of the content by changing its focus or making it more specific. A variation does not simply change wording or images but essentially reconfigures the original content. A variation creates a new draft that is tracked separately. 

Unlike versions, which happen serially, variations can occur in multiples simultaneously.  Only one version can be current at a given time, but many variants can be current at once. 

Variants arise when organizations need to address a different need or change the initial message. Writers often refer to this process as “repurposing” content. With the adoption of GenAI, repurposing existing content has become easy.

However, the unmanaged publication of repurposed content can generate a range of challenges. Content managers can have trouble keeping “derivative content” current when it is unclear on what that content is based.  

When pivots happen gradually, content changes are hard to notice. Various writers and editors continually change the item, subtly altering the content’s purpose and goals. The changes behave like revisions, where only one version is current. But they also resemble variations, where the emphasis of the content shifts to the point that it has assumed a separate identity from its initial one. Such single-item fluidity is known as “content drift.”

A recent study by Harvard Law School (“The Paper of Record Meets an Ephemeral Web”) examined the “problem of content drift, or the often-unannounced changes––retractions, additions, replacement––to the content at a particular URL.”  The URL is a persistent identifier of the content item, but the details associated with that URL have substantively changed without visitors knowing the changes occurred. 

Examining sources cited by the New York Times, the Harvard team “noted two distinct types of drift, each with different implications. First, a number of sites had drifted because the domain containing the linked material had changed hands and been repurposed….More common and less immediately obvious, however, were web pages that had been significantly updated since they were originally included in the article. Such updates are a useful practice for those visiting most web sites – easy access to of-the-moment information is one of the Web’s key offerings. Left entirely static, many web pages would become useless in short order. However, in the context of a news article’s link to a page, updates often erase important evidence and context.”  

Watch out for the ever-morphing page. Various authors can change content items over months or years. As old references are deleted and new buzzwords are introduced, the changes produce the illusion that the content is current. But the original message of the content, motivated by a specific purpose at a particular time, is compromised in the process. 

The phenomenon of content drift highlights the importance of precisely tracking content changes. Many organizations maintain zombie pages that continually change because the URL is considered more valuable than the content. A better practice is to create new items when the focus shifts. 

Practices that content management can learn from data management

Even though content involves many distinct nuances, its maintenance shares challenges facing other digital resources such as data and software code. Content management can learn from data management practices.

Diff checking versions and variants

Diff checking is a common utility for comparing file contents. Although it is most widely used to compare lines of text, it can also compare blocks of text and even images. 

While diff checking is most associated with tracking changes in software code, it is also well established in checking content changes as well. Some common diff checking use cases include detecting:

  • Plagiarism 
  • Alteration of legal text
  • Omissions
  • Duplication of text in different files

The primary use of diff checking in content management is to compare two versions of the same content item. The process is easiest to see when presenting two versions side-by-side, clearly showing additions and deletions between the original and subsequent versions.

This is an example of using diff checking to compare different versions of content, which highlights both word changes and the deletion of the second section (“aufgehoben”). Screenshot: Diffcheck.net

Organizations can use diff checking to compare different content items.  Cross-item comparisons can help teams identify what parts of content variants should be consistent and which should be unique. 

Screenshot of copyscape app comparing two variants of weather reports. Different wording is grey, while common wording is in blue.

Cross-item diff checking can identify:

  • Duplication
  • Points of differentiation
  • The presence of non-standard language in one of the items
  • Forensic investigation of content provenance

Unfortunately, cross-item comparison is not a standard functionality in CMSs. Yet it is an essential capability for managing the maintenance of content variants. It can determine the degree of similarity between items. 

Comparison tools are no longer limited to checking for identical wording.  Newer capabilities incorporating AI can identify image differences and spot rephrasing in text.  They can compare not only known variants but also locate hidden variants that arose from the copying and rewriting of existing items. 

Understanding the pace of changes

Content managers sometimes describe it as either static or dynamic. These concepts help to define the user experience and delivery of the content. Can the content be cached where it is instantly available, or will it need to fetch updates from a server, which takes longer?   

The static/dynamic dichotomy alludes to the broader issue.  Updates impact not only the technical delivery of the content but also the behavior of content developers and users.

Data managers classify data according to its “temperature”—how actively it is used. They do this to decide how to store the data. Frequently changing data needs to be accessed more quickly, which is more expensive. 

Content managers can borrow and adapt the concept of temperature to classify the frequency that content is updated or otherwise changed. Update frequency doesn’t necessarily influence how content is stored, but it does influence operational processes.

Update frequency will shape how content is accessed internally and externally. The demand for content updates is related to the frequency of updating. Publishers push content to users when updating it; the act of updating generates audience demand. Users pull content that has changed. They seek content that offers information or perspectives that are more useful than were available before the change.

We can understand the pace of changes to content by classifying content changes into temperature tiers.

TemperatureContent relevance
HotThe most “dynamic” content in terms of changes. Includes transactional data (product prices and availability), customer submission of reviews and comments, streaming, and liveblogging. Also covers “fresh” (newly published) content and possibly top content requests – as these items are least stable because they’ve often iterated.   
WarmContent that changes irregularly, such as active recent (rather than just-published) content. Sometimes only a subset of the item is subject to change.  
ColdContent that is infrequently accessed and updated that is nearly static or archival. It may be kept for legal and compliance reasons.
Content temperature tiers

More ephemeral “hot” content will be “post and forget” and won’t require maintenance until it is purged. Other hot content will require vigilant review in the form of updates, corrections, or moderation. What all hot content shares is that it is top of mind and likely easily accessed.

“Warm” content is less at the top of the mind and is sometimes neglected as a result. Given the prioritization of publishing over maintenance, warm content is changed when problems arise, often unexpectedly.  The timing and nature of changes are more difficult to predict. Maintenance happens on an ad hoc basis. 

“Cold” content is often forgotten. Because it isn’t active, it is often old and may not have an identifiable owner. However, managing such content still requires decisions, although organizations generally have poor processes for managing such content.

Versioning strategies for ‘Slowly Changing Dimensions’

Warm content corresponds to what data managers call slowly changing dimensions (SDC), another concept that can help content managers think about the versioning process. 

Wikipedia notes: “a slowly changing dimension (SCD) in data management and data warehousing is a dimension which contains relatively static data which can change slowly but unpredictably, rather than according to a regular schedule.” 

While software engineers developed SCD to manage the rows and columns of tabular data, content managers can adapt the concept to address their needs.  We can translate the tiering to describe how to manage content changes. Rows are akin to content items, while columns broadly correspond to content elements within an item.

SDC TypeEquivalent content tracking process
Type 0Static single version. Always retain the original content as is. Never overwrite the original version. When information differs from existing content, create a new content item.
Type 1Changeable single version. Used for items when there’s only one source of truth that is mutable, for example, the current weather forecast. What’s been stated in the past is no longer relevant, either internally or externally. 
Type 2Create distinct versions.  Each change, whether a revision, update, or correction, generates a new version that has a unique version number. Changes overwrite prior content, but status can be rolled back to an earlier version.
Type 3Version changes within an item. Rather than generating versions of the item overall, the versioning occurs at the component level. The content item will contain a patchwork of new and old, so that authors can see what’s most recently changed. 
Type 4Create a change log that’s independent of the content item. It lists status changes, the scope of impact, and when the change occurred. 
Slowly changing dimensions of content — management tiers

Types 0 and 1 don’t involve change tracking, but the higher tiers illustrate alternative approaches to tracking and managing content versions.

CMSs use varied implementations of version comparison.  

Kontent.ai illustrates an example of Type 2 version comparison. Their CMS allows an editor to compare any two versions within a single view. It distinguishes added text, removed text, and text with format changes. 

An example of comparing two versions (indicated with check marks) of an item in Kontent.ai. The example also shows an older abandoned draft

Optimizely has a feature supporting a Type 3 version comparison.  Their CMS has a limited ability to compare properties between versions.  

Optimizely interface

The Wikipedia platform provides content management functionality. Wikipedia’s page history is an example of a table of changes associated with a Type 4 approach. Some of these are automatic edit summaries. 

Wikipedia page history

An even more complete summary would transcend being a change log providing a basic timeline to become a complete change history that lists:

  • When was content changed, and how the timing relates to other events (publication event, corporate event, product development event, marketing campaign event)
  • Why was it changed (the reason)
  • What was changed (the delta)

Tracking content’s current and prior states

CMSs are largely indifferent about changes to published content. By default, they only track whether a content item is drafted, published, or archived. From the system’s perspective, this is all they need to know: where to put the content.

Like many CMSs, Drupal tracks content according to wheter it is in draft, published, or archived.

The CMS won’t remember what’s specifically happened. It doesn’t store the nature of changes to published items or reference them in subsequent actions. Its focus is on the content’s current high-level status.  The CMS only knows that the content is published, rather than the most recent version was updated.

The cycle of draft-published-archive is known as state transition management. CMSs manage states in a rudimentary way that doesn’t capture important distinctions.

From a human perspective, content transitions are important to making decisions. The current state suggests potential transitions, but previous states can reveal more details about the history of the item and can inform what might be useful to do next. 

To help teams make better decisions, the CMS should be more “stateful”: recording the distinctions among different versions instead of only recording that a new version was published on a certain date. Such an approach would allow editors to revert the last updated version or find items that haven’t been updated since a certain date, for example. 

A substantive change, such as an update or correction, and a non-substantive change, such as a minor wording revision, can trigger different workflows. For example, minor copyedits shouldn’t trigger a review workflow if the content’s substance doesn’t change and has already been reviewed.  

The CMS should know about the prior life of content items.  Yet CMSs can treat changes to published content as new drafts that have no workflow history, potentially triggering redundant reviews.  

Because simple states don’t capture past actions, the provenience of content items can be murky. For example, how does a writer or editor know that one item is derived from another?  Many CMSs prompt writers to create a new draft from an old one, but the writer isn’t always clear when doing so if the new draft is replacing the old one (generating a new version) or creating a new item (generating a new variant). Whenever a new item is created based on an old one, the maintenance burden grows.  

Like many CMSs, Drupal allows a draft to be created from published content.

Content transitions are neither strictly linear nor entirely cyclical. Content doesn’t necessarily revert to a previous state. An unpublished item is not the same as a draft. What happened to published items previously can be of interest to editorial teams. 

CMSs would benefit from having a nested state mechanism that distinguishes various states within the offline state (draft, unpublished, deleted) from those in the online state (published original [editable], revised, updated, corrected.)  In addition, the states should be able to recognize multiple states are possible.  Old content can be unpublished and deleted, which may happen simultaneously or at different times. Existing content similarly can be revised for wording and updated for facts at the same or different times. 

State transitions must be linked to version dates. The effective dates of changes is essential to understanding both the history of content items and their future disposition. For example, if a previously editable item is converted to read-only (a published archival version), it is helpful to know when that occurred.  It is unlikely that an item, once archived, would be edited again.

Even though most CMSs only manage simple states and transitions, IT standards support more complex behaviors. 

Statecharts, a W3C standard to describe state changes, can address behaviors such as:

  • Parallel states, where different transitions are happening concurrently
  • Compound or nested states, where more specific states exist within broader ones
  • History states capturing a “stored state configuration” to remember prior actions and statuses

These standards allow for more granular and enduring tracking of content changes. Instead of each edit regressing back to a draft, the content can maintain a history of what actions have happened to it previously.  A history state knows the point at which it was last left so that processes don’t need to start over from the beginning.  

A ‘Data Historian’ for content

Writers, editors, and content managers have trouble assessing the history of changes to content items, especially for items they didn’t create.  CMSs don’t provide an overview of historical changes to items.

Wikipedia, which is collectively written and edited, provides an at-a-glance dashboard showing the history of content items. It shows an overview of edits to a page, even distinguishing minor edits that don’t require review, such as changes in spelling, grammar, or formatting. 

Wikipedia page history dashboard (partial view)

Like Wikipedia, software code is collectively developed and changed. Software engineers can see an “activity overview” that summarizes the frequency and type of changes to software code. 

Github history dashboard

It’s a mistake to believe that because systems and people routinely and quickly change digital resources, that the history of those changes isn’t important. 

The value of recording status transitions goes beyond indicating whether the content is current.  The history of status transitions can help content managers understand how issues arose so they can be prevented or addressed earlier. 

Data managers don’t dismiss the value of history – they learn from it. They talk about the concept of historicizing data or “tracking data changes over time.”  Data history is the basis of predictive analytics. 

Some software hosts a “data historian.” Data historians are most common in industrial operations, which, like content operations, involve many processes and actions happening across teams and systems at various times. 

One vendor describes the role of the historian as follows: “A data historian is a software program that records the data of processes running in a computer system….The data that goes into a data historian is time-stamped and cataloged in an organized, machine-readable format. The data is analyzed to compare such things as day vs. night shifts, different work crews, production runs, material lots, and seasons. Organizations use data from data historians to answer many performance and efficiency-related questions. Organizations can gain additional insights through visual presentations of the data analysis called data visualization.”  

If automated industrial processes can benefit from having a data historian, then human-driven content processes can as well. History is derived from the same word as story (the Latin historia); history is storytelling. Data historians can support data storytelling. They can communicate the actions that teams have taken. 

Toward intelligent change management

Numerous variables can trigger content changes, and a single content item can undergo multiple changes during its lifespan.  Editors are expected to use their judgment to make changes.  But without well-defined rules, each editor will make different choices. 

How far can rules be developed to govern changes? 

A widely cited example of archiving rules is the US Department of Health and Human Services archive schedule, which keeps content published for “two full years” unless subject to other rules. 

HHS archiving timeline (partial table). Some rules are time- or event-driven, but others require an individual’s judgment.  Source: HHS

Even mature frameworks such as HHS still rely on guesswork when the archiving criteria are “outdated and/or no longer relevant.” 

It’s useful to distinguish fixed rules from variable ones. Fixed rules have the appeal of being simple and unambiguous. A fixed rule may state: After x months or years following publication, an item will be auto-archived or automatically deleted. But that’s a blunt rule which may not be prudent in all cases. So, the fixed rule becomes a guideline that requires human review on a case-by-case basis, which doesn’t scale, can be inconsistently followed, and limits the capacity to maintain content.

Content teams need variable rules that can cover more nuances yet provide consistency in decisions. Large-scale content operations entrail diversity and require rules that can address complex scenarios.

What can teams learn if content changes become easier to track, and how can they use that information to automate tasks? 

Data management practices again suggest possibilities.  The concept of change data capture (CDC) is “used to determine and track the data that has changed (the “deltas”) so that action can be taken using the changed data.” If a certain change has occurred, what actions should happen?  A mechanism like CDC can help automate the process of reviewing and changing content.

Basic version comparison tools are limited in their ability to distinguish stylistic changes from substantive ones. A misplaced comment or wrongly spelled word is treated as equivalent to a retraction or significant update. Many diff checking utilities simply crunch files without awareness of what they contain.

Ways to automate changes at scale

Terminology and phrasing can be modified at scale using customized style-checking tools, especially ones trained on internal documents that incorporate custom word lists, phrase lists, and rules. 

Organizations can use various strategies to improve oversight of substantive statements:

  • Templated wording, enforced through style guidelines and text models, directs the focus of changes on substance rather than style.
  • Structured writing can separate factual material from generic descriptions that are used for many facts.
  • Named entity recognition (NER) tools can identify product names, locations, people, prices, quantities, and dates, to detect if these have been altered between versions or items.

Substantive changes can be tracked by looking at named entities. Suppose the below paragraph was updated to include data from the 2018 Consumer Reports. A NER scan could determine the date used in the ranking cited in the text without requiring someone to read the text.  

displaCy Named Entity Visualizer on Wikipedia text

NER can also be used to track brand and product names and determine if content incorporates current usage.

Bots can perform many routine content maintenance operations to fix problems that degrade the quality and utility of content. The experience of Wikipedia shows that bots can be used for a range of remediation:

  • Copyediting
  • Adding generic boilerplate
  • Removing unwanted additions
  • Adding missing metadata

Ways to decide when content changes are needed

We’ve looked at some intelligent ways to track and change content. But how can teams use intelligence to know when change is needed, particularly in situations that don’t involve predictable events or timelines?

  • What situation has changed and who now needs to be involved?
  • What needs to change in the content as a result?

Let’s return to the content change trigger diagram shown earlier. We can identify a range of triggers that aren’t planned and are harder to anticipate. Many of these changes involve shifts in relevance. Some are gradual shifts, while others are sudden but unexpected. 

Teams need to connect the changes that need to be done to the changes that are already happening. They must be able to anticipate changes in content relevance. 

First, teams need to be able to see the relationships between items that are connected thematically.  In my recent post on content workflows, I advocated for adopting semantics that can connect related content items.  A less formal option is to adopt the approach used by Wikipedia to provide “page watchers” functionality that allows authors to be notified of changes to pages of interest (which is somewhat similar to pull requests in software.)  Downstream content owners want to notice when changes occur to the content they incorporate, link to, or reference. 

Second, teams need content usage data to inform the prioritization and scheduling of content changes. 

Teams must decide whether updating a content item is worthwhile. This decision is difficult because teams lack data to inform it. They don’t know whether the content was neglected because it was deemed no longer useful or whether the content hasn’t been effective because it was neglected. They need to cross-reference data on the internal history of the content with external usage, using content paradata to make decisions. 

Content paradata supporting decisions relating to content changes

Maintenance decisions depend on two kinds of insights:

  1. The cadence of changes to the content over time, such as whether the content has received sustained attention, erratic attention, or no attention at all
  2. The trends in the content’s usage, such as whether usage has flatlined, declined, grown, or been consistently trivial

Historical data clarifies whether problems emerged at some point after the organization published the item or if they have been present from the beginning. It distinguishes poor maintenance due to lapsed oversight from cases where items were never reviewed or changed. It differentiates persistent poor engagement (content attracting no views or conversions at all) from faltering engagement, where views or conversions have declined.

Knowing the origin of problems is critical to fixing them. Did the content ever spark an ember of interest? Perhaps the original idea wasn’t quite right, but it was near enough to attract some interest. Should an alternative variant be tried? If an item once enjoyed robust engagement but suffers from declining views now, should it be revived?  When is it best to cut losses? 

Decisions about fixing long-term issues can’t be automated. Yet better paradata can help staff to make more informed and consistent decisions. 

– Michael Andrews 

Categories
Content Engineering

Paradata: where analytics meets governance

Organizations aspire to make data-informed decisions. But can they confidently rely on their data? What does that data really tell them, and how was it derived? Paradata, a specialized form of metadata, can provide answers.

Many disciplines use paradata

You won’t find the word paradata in a household dictionary and the concept is unknown in the content profession.  Yet paradata is highly relevant to content work. It provides context showing how the activities of writers, designers, and readers can influence each other.

Paradata provides a unique and missing perspective. A forthcoming book on paradata defines it as “data on the making and processing of data.” Paradata extends beyond basic metadata — “data about data.” It introduces the dimensions of time and events. It considers the how (process) and the what (analytics).

Think of content as a special kind of data that has a purpose and a human audience. Content paradata can be defined as data on the making and processing of content.

Paradata can answer:

  • Where did this content come from?
  • How has it changed?
  • How is it being used?

Paradata differs from other kinds of metadata in its focus on the interaction of actors (people and software) with information. It provides context that helps planners, designers, and developers interpret how content is working.

Paradata traces activity during various phases of the content lifecycle: how it was assembled, interacted with, and subsequently used. It can explain content from different perspectives:

  • Retrospectively 
  • Contemporaneously
  • Predictively

Paradata provides insights into processes by highlighting the transformation of resources in a pipeline or workflow. By recording the changes, it becomes possible to reproduce those changes. Paradata can provide the basis for generalizing the development of a single work into a reusable workflow for similar works.

Some discussions of paradata refer to it as “processual meta-level information on processes“ (processual here refers to the process of developing processes.) Knowing how activities happen provides the foundation for sound governance.

Contextual information facilities reuse. Paradata can enable the cross-use and reuse of digital resources. A key challenge for reusing any content created by others is understanding its origins and purpose. It’s especially challenging when wanting to encourage collaborative reuse across job roles or disciplines. One study of the benefits of paradata notes: “Meticulous documentation and communication of contextual information are exceedingly critical when (re)users come from diverse disciplinary backgrounds and lack a shared tacit understanding of the priorities and usual practices of obtaining and processing data.“

While paradata isn’t currently utilized in mainstream content work, a number of content-adjacent fields use paradata, pointing to potential opportunities for content developers. 

Content professionals can learn from how paradata is used in:

  • Survey and research data
  • Learning resources
  • AI
  • API-delivered software

Each discipline looks at paradata through different lenses and emphasizes distinct phases of the content or data lifecycle. Some emphasize content assembly, while others emphasize content usage. Some emphasize both, building a feedback loop.

Conceptualizing paradata
Different perspectives of paradata. Source: Isto Huvila

Content professionals should learn from other disciplines, but they should not expect others to talk about paradata in the same way.  Paradata concepts are sometimes discussed using other terms, such as software observability. 

Paradata for surveys and research data

Paradata is most closely associated with developing research data, especially statistical data from surveys. Survey researchers pioneered the field of paradata several decades ago, aware of the sensitivity of survey results to the conditions under which they are administered.

The National Institute of Statistical Sciences describes paradata as “data about the process of survey production” and as “formalized data on methodologies, processes and quality associated with the production and assembly of statistical data.”  

Researchers realize how information is assembled can influence what can be concluded from it. In a survey, confounding factors could be a glitch in a form or a leading question that prompts people to answer in a given way disproportionately. 

The US Census Bureau, which conducts a range of surveys of individuals and businesses, explains: “Paradata is a term used to describe data generated as a by-product of the data collection process. Types of paradata vary from contact attempt history records for interviewer-assisted operations, to form tracing using tracking numbers in mail surveys, to keystroke or mouse-click history for internet self-response surveys.”  For example, the Census Bureau uses paradata to understand and adjust for non-responses to surveys. 

Paradata for surveys
Source: NDDI 

As computers become more prominent in the administration of surveys, they become actors influencing the process. Computers can record an array of interactions between people and software.

 Why should content professionals care about survey processes?

Think about surveys as a structured approach to assembling information about a topic of interest. Paradata can indicate whether users could submit survey answers and under what conditions people were most likely to respond.  Researchers use paradata to measure user burden. Paradata helps illuminate the work required to provide information –a topic relevant to content professionals interested in the authoring experience of structured content.

Paradata supports research of all kinds, including UX research. It’s used in archaeology and archives to describe the process of acquiring and preserving assets and changes that may happen to them through their handling. It’s also used in experimental data in the life sciences.

Paradata supports reuse. It provides information about the context in which information was developed, improving its quality, utility, and reusability.

Researchers in many fields are embracing what is known as the FAIR principles: making data Findable, Accessible, Interoperable, and Reusable. Scientists want the ability to reproduce the results of previous research and build upon new knowledge. Paradata supports the goals of FAIR data.  As one study notes, “understanding and documentation of the contexts of creation, curation and use of research data…make it useful and usable for researchers and other potential users in the future.”

Content developers similarly should aspire to make their content findable, accessible, interoperable, and reusable for the benefit of others. 

Paradata for learning resources

Learning resources are specialized content that needs to adapt to different learners and goals. How resources are used and changed influences the outcomes they achieve. Some education researchers have described paradata as “learning resource analytics.”

Paradata for instructional resources is linked to learning goals. “Paradata is generated through user processes of searching for content, identifying interest for subsequent use, correlating resources to specific learning goals or standards, and integrating content into educational practices,” notes a Wikipedia article. 

Data about usage isn’t represented in traditional metadata. A document prepared for the US Department of Education notes: “Say you want to share the fact that some people clicked on a link on my website that leads to a page describing the book. A verb for that is ‘click.’ You may want to indicate that some people bookmarked a video for a class on literature classics. A verb for that is ‘bookmark.’ In the prior example, a teacher presented resources to a class. The verb used for that is ‘taught.’ Traditional metadata has no mechanism for communicating these kinds of things.”

“Paradata may include individual or aggregate user interactions such as viewing, downloading, sharing to other users, favoriting, and embedding reusable content into derivative works, as well as contextualizing activities such as aligning content to educational standards, adding tags, and incorporating resources into curriculum.” 

Usage data can inform content development.  One article expresses the desire to “establish return feedback loops of data created by the activities of communities around that content—a type of data we have defined as paradata, adapting the term from its application in the social sciences.”

Unlike traditional web analytics, which focuses on web pages or user sessions and doesn’t consider the user context, paradata focuses on the user’s interactions in a content ecosystem over time. The data is linked to content assets to understand their use. It resembles social media metadata that tracks the propagation of events as a graph.

“Paradata provides a mechanism to openly exchange information about how resources are discovered, assessed for utility, and integrated into the processes of designing learning experiences. Each of the individual and collective actions that are the hallmarks of today’s workflow around digital content—favoriting, foldering, rating, sharing, remixing, embedding, and embellishing—are points of paradata that can serve as indicators about resource utility and emerging practices.”

Paradata for learning resources utilizes the Activity Stream JSON, which can track the interaction between actors and objects according to predefined verbs called an “Activity Schema” that can be measured. The approach can be applied to any kind of content.

Paradata for AI

AI has a growing influence over content development and distribution. Paradata is emerging as a strategy for producing “explainable AI” (XAI).  “Explainability, in the context of decision-making in software systems, refers to the ability to provide clear and understandable reasons behind the decisions, recommendations, and predictions made by the software.”

The Association for Intelligent Information Management (AIIM) has suggested that a “cohesive package of paradata may be used to document and explain AI applications employed by an individual or organization.” 

Paradata provides a manifest of the AI training data. AIIM identifies two kinds of paradata: technical and organizational.

Technical paradata includes:

  • The model’s training dataset
  • Versioning information
  • Evaluation and performance metrics
  • Logs generated
  • Existing documentation provided by a vendor

Organizational paradata includes:

  • Design, procurement, or implementation processes
  • Relevant AI policy
  • Ethical reviews conducted
Paradata for AI
Source: Patricia C. Franks

The provenance of AI models and their training has become a governance issue as more organizations use machine learning models and LLMs to develop and deliver content. AI models tend to be ” black boxes” that users are unable to untangle and understand. 

How AI models are constructed has governance implications, given their potential to be biased or contain unlicensed copyrighted or other proprietary data. Developing paradata for AI models will be essential if models expect wide adoption.

Paradata and document observability

Observing the unfolding of behavior helps to debug problems to make systems more resilient.

Fabrizio Ferri-Benedetti, whom I met some years ago in Barcelona at a Confab conference, recently wrote about a concept he calls “document observability” that has parallels to paradata.

Content practices can borrow from software practices. As software becomes more API-focused, firms are monitoring API logs and metrics to understand how various routines interact, a field called observability. The goal is to identify and understand unanticipated occurrences. “Debugging with observability is about preserving as much of the context around any given request as possible, so that you can reconstruct the environment and circumstances that triggered the bug.”

Observability utilizes a profile called MELT: Metrics, Events, Logs, and Traces. MELT is essentially paradata for APIs.

Software observability pattern
Software observability pattern.  Source: Karumuri, Solleza, Zdonik, and Tatbul

Content, like software, is becoming more API-enabled. Content can be tapped from different sources and fetched interactively. The interaction of content pieces in a dynamic context showcases the content’s temporal properties.

When things behave unexpectedly, systems designers need the ability to reverse engine behavior. An article in IEEE Software states: “One of the principles for tackling a complex system, such as a biochemical reaction system, is to obtain observability. Observability means the ability to reconstruct a system’s internal state from its outputs.”  

Ferri-Benedetti notes, “Software observability, or o11y, has many different definitions, but they all emphasize collecting data about the internal states of software components to troubleshoot issues with little prior knowledge.”  

Because documentation is essential to the software’s operation, Ferri-Benedetti  advocates treating “the docs as if they were a technical feature of the product,” where the content is “linked to the product by means of deep linking, session tracking, tracking codes, or similar mechanisms.”

He describes document observability (“do11y”) as “a frame of mind that informs the way you’ll approach the design of content and connected systems, and how you’ll measure success.”

In contrast to observability, which relies on incident-based indexing, paradata is generally defined by a formal schema. A schema allows stakeholders to manage and change the system instead of merely reacting to it and fixing its bugs. 

Applications of paradata to content operations and strategy

Why a new concept most people have never heard of? Content professionals must expand their toolkit.

Content is becoming more complex. It touches many actors: employees in various roles, customers with multiple needs, and IT systems with different responsibilities. Stakeholders need to understand the content’s intended purpose and use in practice and if those orientations diverge. Do people need to adapt content because the original does not meet their needs? Should people be adapting existing content, or should that content be easier to reuse in its original form?

Content continuously evolves and changes shape, acquiring emergent properties. People and AI customize, repurpose, and transform content, making it more challenging to know how these variations affect outcomes. Content decisions involve more people over extended time frames. 

Content professionals need better tools and metrics to understand how content behaves as a system. 

Paradata provides contextual data about the content’s trajectory. It builds on two kinds of metadata that connect content to user action:

  • Administrative metadata capturing the actions of the content creators or authors, intended audiences, approvers, versions, and when last updated
  • Usage metadata capturing the intended and actual uses of the content, both internal (asset role, rights, where item or assets are used) and external (number of views, average user rating)

Paradata also incorporates newer forms of semantic and blockchain-based metadata that address change over time:

  • Provenance metadata
  • Actions schema types

Provenance metadata has become essential for image content, which can be edited and transformed in multiple ways that change what it represents. Organizations need to know the source of the original and what edits have been made to it, especially with the rise of synthetic media. Metadata can indicate on what an image was based or derived from, who made changes, or what software generated changes. Two corporate initiatives focused on provenance metadata are the Content Authenticity Initiative and the Coalition for Content Provenance and Authenticity.

Actions are an established — but underutilized — dimension of metadata. The widely adopted schema.org vocabulary has a class of actions that address both software interactions and physical world actions. The schema.org actions build on the W3C Activity Streams standard, which was upgraded in version 2.0 to semantic standards based on JSON-LD types.

Content paradata can clarify common issues such as:

  • How can content pieces be reused?
  • What was the process for creating the content, and can one reuse that process to create something similar?
  • When and how was this content modified?

Paradata can help overcome operational challenges such as:

  • Content inventories where it is difficult to distinguish similar items or versions
  • Content workflows where it is difficult to model how distinct content types should be managed
  • Content analytics, where the performance of content items is bound up with channel-specific measurement tools

Implementing content paradata must be guided by a vision. The most mature application of paradata – for survey research – has evolved over several decades, prompted by the need to improve survey accuracy. Other research fields are adopting paradata practices as research funders insist that data be “FAIR.” Change is possible, but it doesn’t happen overnight. It requires having a clear objective.

It may seem unlikely that content publishing will embrace paradata anytime soon. However, the explosive growth of AI-generated content may provide the catalyst for introducing paradata elements into content practices. The unmanaged generation of content will be a problem too big to ignore.

The good news is that online content publishing can take advantage of existing metadata standards and frameworks that provide paradata. What’s needed is to incorporate these elements into content models that manage internal systems and external platforms.

Online publishers should introduce paradata into systems they directly manage, such as their digital asset management system or customer portals and apps. Because paradata can encompass a wide range of actions and behaviors, it is best to prioritize tracking actions that are difficult to discern but likely to have long-term consequences. 

Paradata can provide robust signals to reveal how content modifications impact an organization’s employees and customers.  

– Michael Andrews

Categories
Content Efficiency

Supporting content compliance using Generative AI

Content compliance is challenging and time-consuming. Surprisingly, one of the most interesting use cases for Generative AI in content operations is to support compliance.

Compliance shouldn’t be scary

Compliance can seem scary. Authors must use the right wording lest things go haywire later, be it bad press or social media exposure, regulatory scrutiny, or even lawsuits. Even when the odds of mistakes are low because the compliance process is rigorous, satisfying compliance requirements can seem arduous. It can involve rounds of rejections and frustration.

Competing demands. Enterprises recognize that compliance is essential and touches more content areas, but scaling compliance is hard. Lawyers or other experts know what’s compliant but often lack knowledge of what writers will be creating. Compliance is also challenging for compliance teams. 

Both writers and reviewers need better tools to make compliance easier and more predictable.

Compliance is risk management for content

Because words are important, words carry risks. The wrong phrasing or missing wording can expose firms to legal liability. The growing volume of content places big demands on legal and compliance teams that must review that content. 

A major issue in compliance is consistency. Inconsistent content is risky. Compliance teams want consistent phrasing so that the message complies with regulatory requirements while aligning with business objectives.

Compliant content is especially critical in fields such as finance, insurance, pharmaceuticals, medical devices, and the safety of consumer and industrial goods. Content about software faces more regulatory scrutiny as well, such as privacy disclosures and data rights. All kinds of products can be required to disclose information relating to health, safety, and environmental impacts.  

Compliance involves both what’s said and what’s left unsaid. Broadly, compliance looks at four thematic areas:

  1. Truthfulness
    1. Factual precision and accuracy 
    2. Statements would not reasonably be misinterpreted
    3. Not misleading about benefits, risks, or who is making a claim
    4. Product claims backed by substantial evidence
  2. Completeness
    1. Everything material is mentioned
    2. Nothing is undisclosed or hidden
    3. Restrictions or limitations are explained
  3. Whether impacts are noted
    1. Anticipated outcomes (future obligations and benefits, timing of future events)
    2. Potential risks (for example, potential financial or health harms)
    3. Known side effects or collateral consequences
  4. Whether the rights and obligations of parties are explained
    1. Contractual terms of parties
    2. Supplier’s responsibilities
    3. Legal liabilities 
    4. Voiding of terms
    5. Opting out
Example of a proposed rule from the Federal Trade Commission source: Federal Register

Content compliance affects more than legal boilerplate. Many kinds of content can require compliance review, from promotional messages to labels on UI checkboxes. Compliance can be a concern for any content type that expresses promises, guarantees, disclaimers, or terms and conditions.  It can also affect content that influences the safe use of a product or service, such as instructions or decision guidance. 

Compliance requirements will depend on the topic and intent of the content, as well as the jurisdiction of the publisher and audience.  Some content may be subject to rules from multiple bodies, both governmental regulatory agencies and “voluntary” industry standards or codes of conduct.

“Create once, reuse everywhere” is not always feasible. Historically, complaince teams have relied on prevetted legal statements that appear at the footer of web pages or in terms and conditions linked from a web page. Such content is comparatively easy to lock down and reuse where needed.

Governance, risk, and compliance (GRC) teams want consistent language, which helps them keep tabs on what’s been said and where it’s been presented. Reusing the same exact language everywhere provides control.

But as the scope of content subject to compliance concerns has widened and touches more types of content, the ability to quarantine compliance-related statements in separate content items is reduced. Compliance-touching content must match the context in which it appears and be integrated into the content experience. Not all such content fits a standardized template, even though the issues discussed are repeated. 

Compliance decisions rely on nuanced judgment. Authors may not think a statement appears deceptive, but regulators might have other views about what constitutes “false claims.” Compliance teams have expertise in how regulators might interpret statements.  They draw on guidance in statutes, regulations, policies, and elaborations given in supplementary comments that clarify what is compliant or not. This is too much information for authors to know.

Content and compliance teams need ways to address recurring issues that need to be addressed in contextually relevant ways.

Generative AI points to possibilities to automate some tasks to accelerate the review process. 

Strengths of Generative AI for compliance

Generative AI may seem like an unlikely technology to support compliance. It’s best known for its stochastic behavior, which can produce hallucinations – the stuff of compliance nightmares.  

Compliance tasks reframe how GenAI is used.  GenAI’s potential role in compliance is not to generate content but to review human-developed content. 

Because content generation produces so many hallucinations, researchers have been exploring ways to use LLMs to check GenAI outputs to reduce errors. These same techniques can be applied to the checking of human-developed content to empower writers and reduce workloads on compliance teams.

Generative AI can find discrepancies and deviations from expected practices. It trains its attention on patterns in text and other forms of content. 

While GenAI doesn’t understand the meaning of the text, it can locate places in the text that match other examples–a useful capability for authors and compliance teams needing to make sure noncompliant language doesn’t slip through.  Moreover, LLMs can process large volumes of text. 

GenAI focuses on wording and phrasing.  Generative AI processes sequences of text strings called tokens. Tokens aren’t necessarily full words or phrases but subparts of words or phrases. They are more granular than larger content units such as sentences or paragraphs. That granularity allows LLMs to process text at a deep level.

LLMs can compare sequences of strings and determine whether two pairs are similar or not. Tokenization allows GenAI to identify patterns in wording. It can spot similar phrasing even when different verb tenses or pronouns are used. 

LLMs can support compliance by comparing text and determining whether a string of text is similar to other texts. They can compare the drafted text to either a good example to follow or a bad example to avoid. Since the wording is highly contextual, similarities may not be exact matches, though they consist of highly similar text patterns.

GenAI can provide an X-ray view of content. Not all words are equally important. Some words carry more significance due to their implied meaning. But it can be easy to overlook special words embedded in the larger text or not realize their significance.

Generative AI can identify words or phrases within the text that carry very specific meanings from a compliance perspective. These terms can then be flagged and linked to canonical authoritative definitions so that writers understand how these words are understood from a compliance perspective. 

Generative AI can also flag vague or ambiguous words that have no reference defining what the words mean in the context. For example, if the text mentions the word “party,” there needs to be a definition of what is meant by that term that’s available in the immediate context where the term is used.

GenAI’s “multimodal” capabilities help evaluate the context in which the content appears. Generative AI is not limited to processing text strings. It is becoming more multimodal, allowing it to “read” images. This is helpful when reviewing visual content for compliance, given that regulators insist that disclosures must be “conspicuous” and located near the claim to which they relate.

GenAI is incorporating large vision models (LVMs) that can process images that contain text and layout. LVMs accept images as input prompts and identify elements. Multimodal evaluations can evaluate three critical compliance factors relating to how content is displayed:

  1. Placement
  2. Proximity
  3. Prominence

Two writing tools suggest how GenAI can improve compliance.  The first, the Draft Analyzer from Bloomberg Law, can compare clauses in text. The second, from Writer, shows how GenAI might help teams assess compliance with regulatory standards.

Use Case: Clause comparison

Clauses are the atomic units of content compliance–the most basic units that convey meaning. When read by themselves, clauses don’t always represent a complete sentence or a complete standalone idea. However, they convey a concept that makes a claim about the organization, its products, or what customers can expect. 

While structured content management tends to focus on whole chunks of content, such as sentences and paragraphs, compliance staff focus on clauses–phrases within sentences and paragraphs.  Clauses are tokens.

Clauses carry legal implications. Compliance teams want to verify the incorporation of required clauses and to reuse approved wording.

While the use of certain words or phrases may be forbidden, in other cases, words can be used only in particular circumstances.  Rules exist around when it’s permitted to refer to something as “new” or “free,” for example.  GenAI tools can help writers compare their proposed language with examples of approved usage.

Giving writers a pre-compliance vetting of their draft. Bloomberg Law has created a generative AI plugin called Draft Analyzer that works inside Microsoft Word. While the product is geared toward lawyers drafting long-form contracts, its technology principles are relevant to anyone who drafts content that requires compliance review.

Draft Analyzer provides “semantic analysis tools” to “identify and flag potential risks and obligations.”   It looks for:

  • Obligations (what’s promised)
  • Dates (when obligations are effective)
  • Trigger language (under what circumstances the obligation is effective)

For clauses of interest, the tool compares the text to other examples, known as “precedents.”  Precedents are examples of similar language extracted from prior language used within an organization or extracted examples of “market standard” language used by other organizations.  It can even generate a composite standard example based on language your organization has used previously. Precedents serve as a “benchmark” to compare draft text with conforming examples.

Importantly, writers can compare draft clauses with multiple precedents since the words needed may not match exactly with any single example. Bloomberg Law notes: “When you run Draft Analyzer over your text, it presents the Most Common and Closest Match clusters of linguistically similar paragraphs.”  By showing examples based on both similarity and salience, writers can see if what they want to write deviates from norms or is simply less commonly written.

Bloomberg Law cites four benefits of their tool.  It can:

  • Reveal how “standard” some language is.
  • Reveal if language is uncommon with few or no source documents and thus a unique expression of a message.
  • Promote learning by allowing writers to review similar wording used in precedents, enabling them to draft new text that avoids weaknesses and includes strengths.
  • Spot “missing” language, especially when precedents include language not included in the draft. 

While clauses often deal with future promises, other statements that must be reviewed by compliance teams relate to factual claims. Teams need to check whether the statements made are true. 

Use Case: Claims checking

Organizations want to put a positive spin on what they’ve done and what they offer. But sometimes, they make claims that are subject to debate or even false. 

Writers need to be aware of when they make a contestable claim and whether they offer proof to support such claims.

For example, how can a drug maker use the phrase “drug of choice”? The FDA notes: “The phrase ‘drug of choice,’ or any similar phrase or presentation, used in an advertisement or promotional labeling would make a superiority claim and, therefore, the advertisement or promotional labeling would require evidence to support that claim.” 

The phrase “drug of choice” may seem like a rhetorical device to a writer, but to a compliance officer, it represents a factual claim. Rhetorical phrases can often not stand out as facts because they are used widely and casually. Fortunately, GenAI can help check the presence of claims in text.

Using GenAI to spot factual claims. The development of AI fact-checking techniques has been motivated by the need to see where generative AI may have introduced misinformation or hallucinations. These techniques can be also applied to human written content.

The discipline of prompt engineering has developed a prompt that can check if statements make claims that should be factually verified.  The prompt is known as the “Fact Check List Pattern.”  A team at Vanderbilt University describes the pattern as a way to “generate a set of facts that are contained in the output.” They note: “The user may have expertise in some topics related to the question but not others. The fact check list can be tailored to topics that the user is not as experienced in or where there is the most risk.” They add: “The Fact Check List pattern should be employed whenever users are not experts in the domain for which they are generating output.”  

The fact check list pattern helps writers identify risky claims, especially ones about issues for which they aren’t experts.

The fact check list pattern is implemented in a commercial tool from the firm Writer. The firm states that its product “eliminates [the] risk of ‘plausible BS’ in highly regulated industries” and “ensures accuracy with fact checks on every claim.”

Screenshot of Writer screen
Writer functionality evaluating claims in an ad image. Source: VentureBeat

Writer illustrates claim checking with a multimodal example, where a “vision LLM” assesses visual images such as pharmaceutical ads. The LLM can assess the text in the ad and determine if it is making a claim. 

GenAI’s role as a support tool

Generative AI doesn’t replace writers or compliance reviewers.  But it can help make the process smoother and faster for all by spotting issues early in the process and accelerating the development of compliant copy.

While GenAI won’t write compliant copy, it can be used to rewrite copy to make it more compliant. Writer advertises that their tool can allow users to transform copy and “rewrite in a way that’s consistent with an act” such as the Military Lending Act

While Regulatory Technology tools (RegTech) have been around for a few years now, we are in the early days of using GenAI to support compliance. Because of compliance’s importance, we may see options emerge targeting specific industries. 

Screenshot Federal Register formats menu
Formats for Federal Register notices

It’s encouraging that regulators and their publishers, such as the Federal Register in the US, provide regulations in developer-friendly formats such as JSON or XML. The same is happening in the EU. This open access will encourage the development of more applications.

– Michael Andrews