Categories
Content Engineering

Coping with copies: Strategic dimensions of reuse and duplication

When are copies of content appropriate, and how should you manage copies? Should content ever be repetitive?  Is duplicative content always bad?

Answers to these questions are typically provided by specialists: CMS implementers (developers skilled in PHP or another CMS programming language), SEO experts, or webmasters. Specialists tend to focus on technical effort or performance—the technical penalties—rather than strategic issues of how people interact with messages and information—the users’ goals. Discussions become overly narrow, with important issues taken off the table. 

But if we only consider the technical dimensions, we can lose sight of the human factors at play. Content exists to be read. Authors and readers continually judge content according to whether it seems familiar or different. People often need to see things more than once. They even choose to re-read some content. 

Though technology is important, it’s always in flux. Technology doesn’t impose fixed rules and shouldn’t dictate strategy. 

Acknowledging the repetitiveness of content

A good amount of content repeats itself—and always has. Repetition allows content to be disseminated more widely.  Humans have copied text as long as they’ve been writing. Text reuse is part of the human condition.

Scholars analyze “different types of text reuse, such as jokes, adverts, boilerplates, speeches, or religious texts, but also short stories and reprints of book segments. Each of them is tied to a different logic and motivation.”

As one researcher studying the historical development of news stories notes, “Articles emerge through a process of creative re-use and re-appropriation. Whole fragments, sentences and quotations are often transferred to novel contexts. In this sense, newspaper content emerges through a process of what could be called bricolage, in which content is soldered together from existing fragments and textual patterns. In other words, newspaper content is often harvested from a wide range of available textual material.”

Source: Romanello and Hengchen

Such research can help us to understand consequential issues such as:

  • The virality and spread of narratives 
  • The prevalence of quotations from a particular source
  • The reliance of a publication on external sources

Content propagation in the real world is messy. It happens organically through numerous small decisions made on a decentralized basis.  Some decisions are opportunistic (such as plagiarism or repeating rumors), while others are motivated by a desire to spread credible information.  No solution can be viable if it ignores the complex motivations of people conveying information.

Content professionals are generally wary of repeated content. They caution organizations to “avoid duplication” because “it’s bad.” Their goal is to prevent duplication and remediate it when it occurs.

The content professional’s alternative to duplication is content reuse. Unlike duplication, content reuse is considered virtuous. Duplication and reuse are distinct approaches to repeating text, but they share similarities. They are not exact opposites. It doesn’t follow that one is absolutely bad while the other is always good. 

Before we can consider the merits and behaviors of reuse, it’s important to first understand the various manifestations of duplication, some of which overlap with content reuse.  

Good and Bad reasons for duplicate content

Duplicate web pages on a website are almost always bad. A web page should live in only one place on a website. When the same page exists in several places on a website, it’s fairly easy for software to locate such pages. Numerous tools can scan your website for duplicate pages using a mathematical technique called checksum.  

When the same page exists across distinct web domains, the advisability of having the same content appear in multiple places gets more complicated. Sometimes, such behavior indicates a poorly governed publishing process, where a page is copied to various domains without either tracking this copying or asking if it is necessary.  But not all situations are problems. There are legitimate use cases for publishing the same content on distinct pages on different websites.  Content may be repeated across localized web domains or domains for subbrands of an organization.  

Content syndication allows the same page to be republished on multiple domains to make it available to audiences so they can find it where they are looking for it rather than expecting they’ll be hunting for it on an unfamiliar website.  Organizations syndicate content throughout their own web properties or make it available to third parties.

The audience’s needs should determine whether the content should be placed on multiple websites. 

When identical web pages appear on multiple websites, this can be implemented in several ways.  The pages can be shared either through RSS or an API that other websites can access. But often the original page is copied to a new website. The existence of multiple copies that are independent of one another introduces many content management inefficiencies and risks. 

The copying of webpages is often a consequence of the way CMSs are designed. Traditional CMSs support a single website, relying on folders and sitemaps to organize pages. Each additional website that needs the page must have the page copied into that site’s page organization. While CMSs that support multiple websites have emerged recently, some still don’t allow the original content to be organized independently of where on a website it will appear.  

Duplicated content results from both human decisions and automated ones.  

  • Collateral duplication on a website can happen when pages are autogenerated and are expected to “belong” in multiple places as part of different collections.  
  • Web aggregators duplicate content by republishing some or all of content items from multiple sources. Aggregators are common for news, customer reviews, hotels, food delivery, and other topics.
  • Website mirroring, copying an entire website to another URL, may be set up to ensure the availability of content. Mirrors can enable faster access for users or preserve content that might otherwise be blocked or taken down.

When organizations intend to duplicate content, they can do so for either good or bad faith motives. 

Good faith motivations reflect users’ interests by making content available where they are looking for that content. Republishing of content is allowed and encouraged. The US Department of Health and Human Services encourages the syndication of its content: “Content syndication allows you to place content from HHS websites onto your own site. It allows you to offer high-quality HHS content in the look and feel of your site. The syndicated content is automatically updated in real-time, requiring no effort from your staff to keep the pages up to date.”

Bad faith motivations include the intention to spam the user by blanketing them everywhere they might be. “‘Copypasta’ (a reference to copy-and-paste functionality to duplicate content) is an Internet slang term that refers to an attempt by multiple individuals to duplicate content from an original source and share it widely across social platforms or forums,” noted a well known social media platform that subsequently changed its ownership and name. Of course, people alone aren’t responsible for copypasta–nowadays, bots do most of the work.

In other cases, duplication involves efforts to deceive who the author is or disguise the organization that is publishing the content. Bad actors can steal content and republish it through adversarial proxy mirroring (the wholesale copying of a website that is rebranded) and web scraping (lifting published content and republishing it elsewhere without permission).  Such copy-theft is illegal but technically easy to perform.

Near-duplicates: a pervasive phenomenon

While identical duplicate web pages are not uncommon, an even more pervasive situation is “near dupes” or items that duplicate some content but also contain unique content.

Near duplicate content can be planned or incidental.  Similarity in content items signals thematic repetition across multiple items. Near duplication content often represents variations on a core set of messages or information. 

Templates in e-commerce sites generate many pages of near duplicate content. They combine data feeds of product descriptions with boilerplate copy. Each product page has some identical wording it shares with other pages. 

Unlike checks for exact duplicates, auditing for near-duplicates involves noting both what’s the same and what’s unique. The audit needs to determine where items are dissimilar and whether that is intentional.  Sometimes, copies of items are updated unevenly so that there are different versions of what should be identical text.  Any variations within a copy of near-duplicates should convey distinct information or messages.

Also, note that near-duplicates aren’t necessarily the repetition of exact prose. They may be summarizations or extensions. “A near-duplicate is, in some cases, a mere paraphrasing of a previous article; in other cases, it contains corrections or added content as a follow-up.” Both publishers and readers can find value in extending what’s been previously said.”

Related content: the repetition of fragments

Related content may duplicate strings or passages of text but don’t replicate enough of the body of the content to appear as a near-duplicate. It emerges in various situations. 

Recurring phrases can signal that content items belong to a common content type.  Content style guides may specify patterns for writing headlines, calls-to-action, and other strings.  A recurring pattern might signify that the content item is a help topic or a hero.

Related content is also the product of repeating segments of content across items to support continuity in the user’s content experience. Content chunks might be repeated to provide “signposts,” such as a preview or a takeaway. 

Repeating fragments of content support continuity across content items over time and through a customer journey.

More content management tools are focusing on repeatable content components. An example of this trend is the ubiquitous WordPress platform. WordPress’ updated authoring interface, Gutenberg, manages content chunks it calls “blocks.”  The interface allows authors to “duplicate” or “share” blocks in one item for use in another item.  Shared blocks can be edited in any item where they are used, which will change them everywhere, though users report this behavior can be confusing and result in unanticipated changes. Because the blocks have no independent identity, their messages can be strongly influenced by the context in which they are edited.  

Looking at duplication from internal and external perspectives

Duplicated content can trigger a range of problems and consequences. Duplicated published content may be bad or not. Duplicated unpublished content is almost always problematic.

Let’s start by looking at the internal consequences of duplicative content. Multiple versions of the same item are confusing to authors, editors, and content managers. No one can be sure which is the “right” version. Ironically, the latest version may not be the right one if someone creates a new copy and starts editing it without completing a full review.  Abandoned drafts can also cloud which one is the active one. An unapproved version could be delivered to customers. 

The simple guideline to follow is that you shouldn’t have exact copies of items in your content repository.  Any near duplicates in your content inventory should be managed as content variants.  (For a discussion of the distinction between versions and variants, see my post on content history.)

Now, let’s consider the situation of published content that’s been duplicated. Is it bad for audiences?  It can be, but won’t necessarily be.  

A wrong assumption often made about duplicated published content is that audiences will encounter it all at once. Many organizations rely on web crawls to simulate how audiences encounter their content.  Web crawls often turn up duplicate pages.  It doesn’t follow that an individual will necessarily encounter these duplicates. Ironically, “duplicated pages can even be introduced by the crawler itself, when different links point to the same page.”

An old myth in the SEO industry proclaimed that Google penalized duplicate content. But Google acknowledges that duplicate content, while potentially confusing to users, does not present a problem for Google’s search indexing: “Some duplicate content on a site is normal and it’s not a violation of Google’s spam policies. However, having the same content accessible through many different URLs can be a bad user experience (for example, people might wonder which is the right page and whether there’s a difference between the two), and it may make it harder for you to track how your content performs in search results.”

Duplicate content is often a symptom of other user experience issues, such as poor journey mapping or content labeling. No reader wants multiple links that all lead to the same item. When titles or links look similar, readers can’t be sure whether equivalent options are identical and equally useful or are really different content items. For example, users frequently choose the wrong product support link because they are unable to understand and define distinctions between product variants. 

Reuse: How different is it from duplication?

Content reuse is widely advocated but sometimes loosely defined. It’s often not clear whether it refers to the internal reuse of content prior to publication or the external republication of content. Without making that distinction, it isn’t clear when or whether duplication of content occurs. How does one apply the famous adage in content practice to be “DRY” (Don’t Repeat Yourself)? Should content not be repeated externally or only internally?

People may advocate reuse for a range of reasons: 

  1. Reuse for message and information consistency
  2. Reuse for internal sharing and joint collaboration
  3. Reuse to save content development effort
  4. Reuse to promote messages and information more widely externally

Content reuse implies that one copy of a content item can appear many times in various guises. The reality behind the scenes is more complicated, and it is perhaps more accurate to think about content reuse as managed duplication.

Reuse implies one original content item will serve as the basis for published content that’s delivered in various contexts. When implemented in publishing toolchains, there will likely be more than one copy. If you care about business continuity, your repository will likely have a mirror and backup, and it’s possible an item will be cached in other systems involved in the publishing and delivery process. But while copies may exist, there’ll only be one original. 

The original copy is sometimes referred to as the canonical one. Any changes are made only to the original; the other copies are read-only.  Importantly, all changes are reversible since the copies are dependent on the original or are stored temporarily.  With duplicated copies are unmanaged, by contrast, separate instances would each require updating, which often doesn’t happen.

It’s useful to distinguish delivery reuse (one item delivered to many places) from assembly reuse (one item incorporated into many other items). Most rationales for content reuse focus on internal content management requirements rather than external customer access benefits, but both are valid goals.

A wider perspective on reuse considers its role in contextualizing information and messages. Reused content can change the temporal and topical context.

Sometimes, reused content is standalone items: information or messages that need to be repeated in diverse scenarios. Such reuse allows target messages to be delivered at the right moment.

Other times, reused content is inserted into a larger item. But when reused content is incorporated into larger content items, content reuse can generate near-duplicates. Templated content, for example, repeats wording on multiple pages, making it hard for users to distinguish various items.  From an external user’s perspective, reused content can be indistinguishable from duplicated content. 

Reuse can support content customization. Organizations are expected to generate many variations of core content.  Reuse has its roots in document management, the assembling of long-form documents that are built from both repeated text and customized text.  But as online content moves away from long-form documents like product manuals and becomes more granular and on-demand, content customization is changing. Reuse in content assembly is still important, but more content is now reused directly by delivering standalone snippets or chunks.

The value of de-duplicating content

Detecting duplicate content has become a mini-industry.  Numerous technical approaches can identify duplicated content, and a range of vendors offer de-duplication solutions.  

One vendor focuses on monitoring repetition in what’s published online, asserting, “There’s a wide variety of use cases for duplicate detection in the field of media monitoring, ranging from virality analyses and content distribution tracking to plagiarism detection and web crawling.”

Content aggregators need to filter duplicates. Another vendor sells a “content deduplication/travel content mapping solution” that gives customers “the opportunity to create your own hotel database and write original material.” 

When organizations create content, they need to preclude making redundant content. One firm offers a tool to prevent writers from creating duplicate content on intranets. The problem is not trivial: how do writers know what’s already been created? They may create a new item that doesn’t have the exact wording of an existing one, but with a focus that’s nearly identical. 

Governance based on well-defined content types (indicating a clear purpose for the content) and accurate, descriptive metadata (indicating the content’s scope) is essential to preventing redundant content.  Authors should be prompted to answer what the content is about before starting to create it.  The inventory can check to see what existing content might be similar.

Since near-duplicates are more difficult to identify than exact ones, tools need to do “fuzzy” searches to find overlapping items.  Techniques include “MinHash” and “shingling” that chop up strings to measure similarity thresholds.

While readers don’t want to wade through duplicate items or have to disambiguate them, the same is true for machines – only at a larger scale. Software programs can behave oddly if the inventory of content emphasizes certain items too much.  Duplication can introduce bias in software algorithms because programs are more inclined to select from duplicated information when performing searches or generating answers. Duplication of content has emerged as a concern in large language models.  

Recent research by Amazon suggests that duplication can interfer with the relevancy of answers provided by LLMs.

If many similar items exist, which one should be canonical? In some cases, no one item will be a “best” representative.  LLMs can generative a cross-item summarization of the near duplicates, providing a composite of multiple items that are similar but not identical.

Deduplication is emerging as an important requirement for the internal governance of content.

– Michael Andrews

Categories
Content Integration

The Benefits of Hacking Your Own Content

How can content strategy help organizations break down the silos that bottle up their content?  The first move may be to encourage organizations to hack their own content.

Silos are the villains of content strategists. To slay the villain, the hero or heroine must follow three steps to enlightenment:

  1. Transcend organizational silos that hinder the coordination and execution of content
  2. Adopt an omnichannel approach that provides customers with content wherever and however they need it, so that they aren’t hostage to incoherent internal organizational processes and separately managed channels that fragment their journey and experience
  3. Reuse content across the organization to achieve a more cost-effective and revenue-enhancing utilization of content

The path that connects these steps is structured content. Each of these rationales is a powerful argument to change fractured activities.  Taken together, they form a compelling motivation to de-silo content.

“Content silo trap: Situation created by authors working in isolation from other authors within the organization. Walls are erected among content areas and even with in content areas, which leads to content being created and recreated and recreated, often with changes or differences in each iteration.”  Ann Rockley and Charles Cooper in Managing Enterprise Content: Unified Content Strategy.

The definition of a content silo trap emphasizes the duplication of effort.  But the problems can manifest in other ways.  When groups don’t share content with each other, it results in a content situation that divides the haves and the have-nots.  Those who must create content with finite resources need to prioritize what content to create.  They may forego providing their target audiences with content relating to a facet of a topic, if it involves more work than the staff available can handle.  Often organizational units devote most of their time to revising existing content rather than creating new content, so what they offer to audiences is highly dependent on what they already have.  Even when it seems like a good idea to incorporate content related to one’s own area of responsibility that’s being used elsewhere, it can be difficult to get it in a timely manner.  It may not be clear if it is be worth the effort to re-produce this content oneself.

What Silos Look Like from the Inside

Let’s imagine a fictional company that serves two kinds of customers: consumers, and businesses.  The products that the firm offers to consumers and businesses are nearly identical, but are packaged differently, with slightly different prices, sales channels, warranties, etc.  Importantly, the consumer and B2B businesses are run as separate operating units, each responsible for their own expenses and revenues.  The consumer unit has a higher profit margin and is growing faster, and decided a couple of years ago to upgrade its CMS to a new system that’s not compatible with the legacy system the entire company had used.  The B2B division is still on the old CMS, hoping to upgrade in the near future.

A while ago, a product manager in the B2B division asked her counterpart in the consumer division if she’d be able to get some of the punchy creative copy that the consumer division’s digital agency was producing.  It seemed like it could enhance the attractiveness of the B2B offering as well.   Obviously only parts were relevant, but the product manager asked to receive the consumer product copy as it was being produced, so it could be incorporated into the B2B product pages.  After some discussion, the consumer division product manager realized that sharing the content involved too much work for his team.  It would suck up valuable time from his staff, and hinder his team’s ability to meet its objectives.  In fact, making the effort to do the laborious work of sending each item of content on a regular basis wouldn’t bring any tangible benefit to his team’s performance metrics.

This scenario may seem like a caricature of a dysfunctional company.  But many firms face these kinds of internal frictions, even if the most prevalent cases happen more subtly.

Many organizations know on a visceral level that silos are a burden and hinder their capability to serve customers and grow revenues. But they may not have a vivid understanding of what specific frictions exist, and the costs associated with these frictions. Sometimes they’ve outlined a generic high-level business case for adopting structured content across their organization that talks in terms of big themes such as delivery to mobile devices and personalization.  But they often don’t have a granular understanding of what exact content to prioritize for structuring.

The Dilemma of Moving to Structured Content

Many organizations that try to adopt structured content in a wholesale manner find the process more involved than they anticipated.  It can be complex and time-consuming, involving much organizational process change, and can seem to jeopardize their ability to meet other, more immediate goals.  Some early, earnest attempts at structured content failed, when the enthusiasm for a game-changing future collided with the enormity of the task.  De-siloing projects also run the risk of being ruthlessly de-scoped and scaled-back, to the point where the original goal looses its potency.  When the effort involved comes to the foreground, the benefits may seem abstract and distant, receding to the background. Consultant Joe Pairman speaks about “structured content management project failure” as a problem that arises when the expectations driving the effort are fuzzy.

Achieving a unified content strategy based on coordinated, structured content involves a fundamental dilemma.  Firms  with the most organizational complexity and that stand to benefit most are the ones that have the most silos to overcome.  They frequently have the most difficulty transitioning to a unified structured content approach.  The more diverse your content, the more challenging it is to do a total redesign of it based on modular components.

“The big bang approach can be difficult,” Rebecca Schneider, President of Azzard Consulting, noted during the panel discussion [at the Content Strategy Applied conference]. “But small successes can yield broad results,”  according to a Content Science blog post

Content Hacking as an Alternative to Wholesale Restructuring

If wholesale content restructuring is difficult to do quickly in a complex organization, what is the alternative?  One approach is to borrow ideas from the Create Once, Publish Everywhere (COPE) paradigm by using APIs to get content to more places.

Over the past two years, a number of new tools have emerged that make shifting content easier.  First, there are simple web scraping tools, some browser-based, that can lift content from sections of a page.  Second, there are build-your-own API services such as IFTTT and Zapier that require little or no programming knowledge.

Particularly interesting are newer services such as Import.IO and Kimono that combine web scraping with API creation.  Both these services suggest that programming is not required, though the services of a competent developer are useful to get their full benefits.  Whereas previously developers needed to hand-code using say, PHP, to scrape a web page, and then translate these results into an API, now much of this background work can be done by third party services.  That means that scraping and republishing content is now easier, faster and cheaper.  This opens new applications.

Screenshots of kimono
Screenshots of Kimono (via Kimono Labs)

Lowering the Barriers to Sharing Content

The goal for the B2B division product manager is to be able to reuse content from the consumer division without having to rely on that division’s staff, or on access to their systems.  Ideally, she wants to be able to scrape the parts she needs, and insert them in her content.  Tools that combine web scraping and API creation can help.

Generic process of web scraping/content extraction and API tools
Generic process of web scraping/content extraction and API tools

The process for scraping content involves highlighting sections of pages you want to scrape, labeling these sections, then training the scraper to identify the same sorts of items on related pages you want to scrape.  The results are stored in a simple database table.  These results are then available to an API that can be created to pull elements and insert them onto other pages.  The training can sometimes be fiddly, depending on the original content characteristics.  But once the content is scraped, it can be filtered and otherwise refined (such as given a defined data type) before republishing.  The API can specify what content to use and its source in a range of coding languages compatible with different content delivery set-ups.

The scrape + API approach mimics some of the behavior of structured content.  The party needing the content identifies what they need, and essentially tags it.  They define the meaning of specific elements.   (The machine learning in the background still needs the original source to have some recognizable, repeating markup or layout to learn the elements to scrape, even if it doesn’t yet know what the elements represent.)

While a common use case would be scraping content from another organizational unit, it might also have applications to reuse content within one’s own organizational unit.  If a unit publishing content doesn’t have well-defined content themselves, they are likely having trouble reusing their own content in different contexts.  They may want to reuse elements for content that address different stages of a customer journey, or different audience variations.

Benefits of Content Hacking

This approach can benefit a party that needs to use content published elsewhere in the organization.  It can help bridge organizational silos, technical silos, and channel silos that customers encounter when accessing content.  The approach can even be used to jump across the boundaries that separate different firms.  The creators of Import.IO, for example, are targeting app developers who make price comparison apps.  While scraping and republishing other firms’ content without permission may not be welcomed, there could be cases where two firms agree to share content as part of a joint business project, and a scraping + API approach could be a quick and pragmatic way to amplify a common message.

As a fast, cheap, and dirty method, the scrape + API approach excels at highlighting what content problems need to be solved in a more rigorous way, with true content structuring and a common, well-defined governance process.  One of the biggest hurdles to adopting a unified, structured approach to content is knowing where to start, and knowing what the real value of the effort will be.  By prototyping content reuse through a scrape + API approach, organizations can get tangible data on the potential scope and utilization of content elements.  APIs make it possible for content elements to be sprinkled in different contexts.  One can test if content additions enhance outcomes: for example, driving more conversions. One can A/B test content with and without different elements to learn their value to different segments in different scenarios.

Ultimately, prototyping content reuse can provide a mapping of what elements should be structured, and prioritize when to do that.  It can identify use cases where content reuse (and supporting content structure) is needed, which can be associated with specific audience segments (revenue-generating customers) and internal organizational sponsors (product owners).

Why Content Hacking is a Tactic and not a Strategy

If content hacking sounds easy, then why bother with a more methodical and time-consuming approach to formal content structuring?  The answer is that though content hacking may provide short-term benefits, it can be brittle — it’s a duct tape fix.  Relying on it too much can eventually cause issues.  It’s not a best practice: it’s a tactic, a way to use “lean” thinking to cut through the Gordian knot of siloed content.

Content hacking may not be efficient for content that needs frequent, quick revision, since it needs to go through extra steps of being scraped and stored. It also may not be efficient if multiple parties need the same content but want to do different things with the content — a single API might not serve all stakeholder needs.  Unlike semantically structured content, scraped content doesn’t enable semantic manipulation, such as the advanced application of business logic against metadata, or detailed analytics tracking of semantic entities. And importantly, even a duck tape approach requires coordination between the content producer and the person who reuses the content, so that the party reusing content doesn’t get an unwelcome surprise concerning the nature and timing of content available.

But as a tactic, content hacking may provide the needed proof of value for content reuse to get your organization to embark on dismantling silos and embracing a unified approach.

— Michael Andrews