Content Engineering Archives - Page 4 of 18

When are copies of content appropriate, and how should you manage copies? Should content ever be repetitive? Is duplicative content always bad?

Answers to these questions are typically provided by specialists: CMS implementers (developers skilled in PHP or another CMS programming language), SEO experts, or webmasters. Specialists tend to focus on technical effort or performance—the technical penalties—rather than strategic issues of how people interact with messages and information—the users’ goals. Discussions become overly narrow, with important issues taken off the table.

But if we only consider the technical dimensions, we can lose sight of the human factors at play. Content exists to be read. Authors and readers continually judge content according to whether it seems familiar or different. People often need to see things more than once. They even choose to re-read some content.

Though technology is important, it’s always in flux. Technology doesn’t impose fixed rules and shouldn’t dictate strategy.

Acknowledging the repetitiveness of content

A good amount of content repeats itself—and always has. Repetition allows content to be disseminated more widely. Humans have copied text as long as they’ve been writing. Text reuse is part of the human condition.

Scholars analyze “different types of text reuse, such as jokes, adverts, boilerplates, speeches, or religious texts, but also short stories and reprints of book segments. Each of them is tied to a different logic and motivation.”

As one researcher studying the historical development of news stories notes, “Articles emerge through a process of creative re-use and re-appropriation. Whole fragments, sentences and quotations are often transferred to novel contexts. In this sense, newspaper content emerges through a process of what could be called bricolage, in which content is soldered together from existing fragments and textual patterns. In other words, newspaper content is often harvested from a wide range of available textual material.”

Such research can help us to understand consequential issues such as:

The virality and spread of narratives
The prevalence of quotations from a particular source
The reliance of a publication on external sources

Content propagation in the real world is messy. It happens organically through numerous small decisions made on a decentralized basis. Some decisions are opportunistic (such as plagiarism or repeating rumors), while others are motivated by a desire to spread credible information. No solution can be viable if it ignores the complex motivations of people conveying information.

Content professionals are generally wary of repeated content. They caution organizations to “avoid duplication” because “it’s bad.” Their goal is to prevent duplication and remediate it when it occurs.

The content professional’s alternative to duplication is content reuse. Unlike duplication, content reuse is considered virtuous. Duplication and reuse are distinct approaches to repeating text, but they share similarities. They are not exact opposites. It doesn’t follow that one is absolutely bad while the other is always good.

Before we can consider the merits and behaviors of reuse, it’s important to first understand the various manifestations of duplication, some of which overlap with content reuse.

Good and Bad reasons for duplicate content

Duplicate web pages on a website are almost always bad. A web page should live in only one place on a website. When the same page exists in several places on a website, it’s fairly easy for software to locate such pages. Numerous tools can scan your website for duplicate pages using a mathematical technique called checksum.

When the same page exists across distinct web domains, the advisability of having the same content appear in multiple places gets more complicated. Sometimes, such behavior indicates a poorly governed publishing process, where a page is copied to various domains without either tracking this copying or asking if it is necessary. But not all situations are problems. There are legitimate use cases for publishing the same content on distinct pages on different websites. Content may be repeated across localized web domains or domains for subbrands of an organization.

Content syndication allows the same page to be republished on multiple domains to make it available to audiences so they can find it where they are looking for it rather than expecting they’ll be hunting for it on an unfamiliar website. Organizations syndicate content throughout their own web properties or make it available to third parties.

The audience’s needs should determine whether the content should be placed on multiple websites.

When identical web pages appear on multiple websites, this can be implemented in several ways. The pages can be shared either through RSS or an API that other websites can access. But often the original page is copied to a new website. The existence of multiple copies that are independent of one another introduces many content management inefficiencies and risks.

The copying of webpages is often a consequence of the way CMSs are designed. Traditional CMSs support a single website, relying on folders and sitemaps to organize pages. Each additional website that needs the page must have the page copied into that site’s page organization. While CMSs that support multiple websites have emerged recently, some still don’t allow the original content to be organized independently of where on a website it will appear.

Duplicated content results from both human decisions and automated ones.

Collateral duplication on a website can happen when pages are autogenerated and are expected to “belong” in multiple places as part of different collections.
Web aggregators duplicate content by republishing some or all of content items from multiple sources. Aggregators are common for news, customer reviews, hotels, food delivery, and other topics.
Website mirroring, copying an entire website to another URL, may be set up to ensure the availability of content. Mirrors can enable faster access for users or preserve content that might otherwise be blocked or taken down.

When organizations intend to duplicate content, they can do so for either good or bad faith motives.

Good faith motivations reflect users’ interests by making content available where they are looking for that content. Republishing of content is allowed and encouraged. The US Department of Health and Human Services encourages the syndication of its content: “Content syndication allows you to place content from HHS websites onto your own site. It allows you to offer high-quality HHS content in the look and feel of your site. The syndicated content is automatically updated in real-time, requiring no effort from your staff to keep the pages up to date.”

Bad faith motivations include the intention to spam the user by blanketing them everywhere they might be. “‘Copypasta’ (a reference to copy-and-paste functionality to duplicate content) is an Internet slang term that refers to an attempt by multiple individuals to duplicate content from an original source and share it widely across social platforms or forums,” noted a well known social media platform that subsequently changed its ownership and name. Of course, people alone aren’t responsible for copypasta–nowadays, bots do most of the work.

In other cases, duplication involves efforts to deceive who the author is or disguise the organization that is publishing the content. Bad actors can steal content and republish it through adversarial proxy mirroring (the wholesale copying of a website that is rebranded) and web scraping (lifting published content and republishing it elsewhere without permission). Such copy-theft is illegal but technically easy to perform.

Near-duplicates: a pervasive phenomenon

While identical duplicate web pages are not uncommon, an even more pervasive situation is “near dupes” or items that duplicate some content but also contain unique content.

Near duplicate content can be planned or incidental. Similarity in content items signals thematic repetition across multiple items. Near duplication content often represents variations on a core set of messages or information.

Templates in e-commerce sites generate many pages of near duplicate content. They combine data feeds of product descriptions with boilerplate copy. Each product page has some identical wording it shares with other pages.

Unlike checks for exact duplicates, auditing for near-duplicates involves noting both what’s the same and what’s unique. The audit needs to determine where items are dissimilar and whether that is intentional. Sometimes, copies of items are updated unevenly so that there are different versions of what should be identical text. Any variations within a copy of near-duplicates should convey distinct information or messages.

Also, note that near-duplicates aren’t necessarily the repetition of exact prose. They may be summarizations or extensions. “A near-duplicate is, in some cases, a mere paraphrasing of a previous article; in other cases, it contains corrections or added content as a follow-up.” Both publishers and readers can find value in extending what’s been previously said.”

Looking at duplication from internal and external perspectives

Duplicated content can trigger a range of problems and consequences. Duplicated published content may be bad or not. Duplicated unpublished content is almost always problematic.

Let’s start by looking at the internal consequences of duplicative content. Multiple versions of the same item are confusing to authors, editors, and content managers. No one can be sure which is the “right” version. Ironically, the latest version may not be the right one if someone creates a new copy and starts editing it without completing a full review. Abandoned drafts can also cloud which one is the active one. An unapproved version could be delivered to customers.

The simple guideline to follow is that you shouldn’t have exact copies of items in your content repository. Any near duplicates in your content inventory should be managed as content variants. (For a discussion of the distinction between versions and variants, see my post on content history.)

Now, let’s consider the situation of published content that’s been duplicated. Is it bad for audiences? It can be, but won’t necessarily be.

A wrong assumption often made about duplicated published content is that audiences will encounter it all at once. Many organizations rely on web crawls to simulate how audiences encounter their content. Web crawls often turn up duplicate pages. It doesn’t follow that an individual will necessarily encounter these duplicates. Ironically, “duplicated pages can even be introduced by the crawler itself, when different links point to the same page.”

An old myth in the SEO industry proclaimed that Google penalized duplicate content. But Google acknowledges that duplicate content, while potentially confusing to users, does not present a problem for Google’s search indexing: “Some duplicate content on a site is normal and it’s not a violation of Google’s spam policies. However, having the same content accessible through many different URLs can be a bad user experience (for example, people might wonder which is the right page and whether there’s a difference between the two), and it may make it harder for you to track how your content performs in search results.”

Duplicate content is often a symptom of other user experience issues, such as poor journey mapping or content labeling. No reader wants multiple links that all lead to the same item. When titles or links look similar, readers can’t be sure whether equivalent options are identical and equally useful or are really different content items. For example, users frequently choose the wrong product support link because they are unable to understand and define distinctions between product variants.

Reuse: How different is it from duplication?

Content reuse is widely advocated but sometimes loosely defined. It’s often not clear whether it refers to the internal reuse of content prior to publication or the external republication of content. Without making that distinction, it isn’t clear when or whether duplication of content occurs. How does one apply the famous adage in content practice to be “DRY” (Don’t Repeat Yourself)? Should content not be repeated externally or only internally?

People may advocate reuse for a range of reasons:

Reuse for message and information consistency
Reuse for internal sharing and joint collaboration
Reuse to save content development effort
Reuse to promote messages and information more widely externally

Content reuse implies that one copy of a content item can appear many times in various guises. The reality behind the scenes is more complicated, and it is perhaps more accurate to think about content reuse as managed duplication.

Reuse implies one original content item will serve as the basis for published content that’s delivered in various contexts. When implemented in publishing toolchains, there will likely be more than one copy. If you care about business continuity, your repository will likely have a mirror and backup, and it’s possible an item will be cached in other systems involved in the publishing and delivery process. But while copies may exist, there’ll only be one original.

The original copy is sometimes referred to as the canonical one. Any changes are made only to the original; the other copies are read-only. Importantly, all changes are reversible since the copies are dependent on the original or are stored temporarily. With duplicated copies are unmanaged, by contrast, separate instances would each require updating, which often doesn’t happen.

It’s useful to distinguish delivery reuse (one item delivered to many places) from assembly reuse (one item incorporated into many other items). Most rationales for content reuse focus on internal content management requirements rather than external customer access benefits, but both are valid goals.

A wider perspective on reuse considers its role in contextualizing information and messages. Reused content can change the temporal and topical context.

Sometimes, reused content is standalone items: information or messages that need to be repeated in diverse scenarios. Such reuse allows target messages to be delivered at the right moment.

Other times, reused content is inserted into a larger item. But when reused content is incorporated into larger content items, content reuse can generate near-duplicates. Templated content, for example, repeats wording on multiple pages, making it hard for users to distinguish various items. From an external user’s perspective, reused content can be indistinguishable from duplicated content.

Reuse can support content customization. Organizations are expected to generate many variations of core content. Reuse has its roots in document management, the assembling of long-form documents that are built from both repeated text and customized text. But as online content moves away from long-form documents like product manuals and becomes more granular and on-demand, content customization is changing. Reuse in content assembly is still important, but more content is now reused directly by delivering standalone snippets or chunks.

The value of de-duplicating content

Detecting duplicate content has become a mini-industry. Numerous technical approaches can identify duplicated content, and a range of vendors offer de-duplication solutions.

One vendor focuses on monitoring repetition in what’s published online, asserting, “There’s a wide variety of use cases for duplicate detection in the field of media monitoring, ranging from virality analyses and content distribution tracking to plagiarism detection and web crawling.”

Content aggregators need to filter duplicates. Another vendor sells a “content deduplication/travel content mapping solution” that gives customers “the opportunity to create your own hotel database and write original material.”

When organizations create content, they need to preclude making redundant content. One firm offers a tool to prevent writers from creating duplicate content on intranets. The problem is not trivial: how do writers know what’s already been created? They may create a new item that doesn’t have the exact wording of an existing one, but with a focus that’s nearly identical.

Governance based on well-defined content types (indicating a clear purpose for the content) and accurate, descriptive metadata (indicating the content’s scope) is essential to preventing redundant content. Authors should be prompted to answer what the content is about before starting to create it. The inventory can check to see what existing content might be similar.

Since near-duplicates are more difficult to identify than exact ones, tools need to do “fuzzy” searches to find overlapping items. Techniques include “MinHash” and “shingling” that chop up strings to measure similarity thresholds.

While readers don’t want to wade through duplicate items or have to disambiguate them, the same is true for machines – only at a larger scale. Software programs can behave oddly if the inventory of content emphasizes certain items too much. Duplication can introduce bias in software algorithms because programs are more inclined to select from duplicated information when performing searches or generating answers. Duplication of content has emerged as a concern in large language models.

Recent research by Amazon suggests that duplication can interfer with the relevancy of answers provided by LLMs.

If many similar items exist, which one should be canonical? In some cases, no one item will be a “best” representative. LLMs can generative a cross-item summarization of the near duplicates, providing a composite of multiple items that are similar but not identical.

Deduplication is emerging as an important requirement for the internal governance of content.

– Michael Andrews