Categories
Intelligent Content

Identifiers in Content

One of the central challenges of content strategy is tracking all the content being created.  So much content is available about so many different things.  If you’ve ever done a content inventory, you know that different URLs may refer to the same content. It’s even possible for the same content to exist with two different titles.  And sometimes it isn’t clear if two items of content are talking about the same thing, or simply talking about things that sound similar.

Identifiers are the solution to this chaos. Identifiers are alphanumeric strings associated with an item. They don’t seem very exciting, but they will play an increasingly important role in content moving forward. We are finding that relying on titles and URLs to identify content is not enough. We need something more robust.

It’s hard to relate to something as abstract as an alphanumeric string.  Fortunately, some real world examples point to how identifiers can support content. Real world identifiers show how they can indicate such important things as:

  • The provenance of an item
  • A persistent way to refer to something
  • Whether something is unique or a copy
  • A way to listen to changes about something described.

Who Moved My Cheese?

One basic need is to know where content comes from.  There is much pilfering of content online these days: it’s become a big industry to rip off other people’s content and republish it as one’s own.

The problem of impostors and lookalikes is not limited to web content.  People who produce cheese worry about the confusion that can arise from similar looking and sounding products. Parmigiano Reggiano is a famous Italian cheese, colloquially known in English as parmesan.  It can be very expensive: a wheel of Parmigiano Reggiano typically weighs 38 kilos and will cost several hundred dollars. Parmigiano Reggiano is similar to other another Italian cheese called Gran Padano, and is the original inspiration for various cheeses called parmesan made outside Italy.  The makers of Parmigiano Reggiano work to distinguish their cheese from the rest through identifiers.  Each cheese house (caseificio) has a unique number that they apply to the outside rind of a cheese wheel, together with the month and year of production. These identifiers let the consumer know the provenance of the cheese.

A wheel of Parmigian-Reggiano with identifiers indicating cheese house and production date. Image via Wikipedia.
A wheel of Parmigiano-Reggiano with identifiers indicating cheese house and production date. Image via Wikipedia.

At the supermarket it can be hard to figure out where products come from.  Online it can be hard to know where content comes from. Increasingly people get content not from the producer, but indirectly through a channel like Facebook.  As content gets promoted and aggregated across a growing range of platforms and channels, the provenance of the content will be increasingly important to track. Content requires identifiers that can reveal the originator of the content. The Federal Trade Commission issued guidance recently rejecting vague statements that content is “sponsored”. Publishers need a process that can track and identify who that sponsor is.

Deposed Content

Another challenge for content arises when it is remixed.  Titles and URLs are designed to identify pages, not content components that might show up in a multitude of delivered content.

The challenge of remixed content is similar to a situation facing trial lawyers. As part of the pretrial discovery process, lawyers collect volumes of information. This information needs to be shared between opposing parties, and may not have any intrinsic order to it. Lawyers solved how to identify all these random bits with something called a Bates number.  Originally a Bates number was produced by an elaborate mechanical ink stamp, that would sequentially number each page of any documentation with a unique alphanumeric string.  Today, lawyers will scan documents into PDFs, which can render Bates numbers for each page automatically.

A Bates Numbering Machine. Image via US Patent Office.
A Bates Numbering Machine. Image via US Patent Office.

The elegance of the Bates number is that it provides a persistent identifier for a piece of information that is independent of its source and its context.  No matter how different items of content are shuffled around, a specific item can be located by any party according to its unique Bates number.

Having persistent identifiers for content components is valuable when content is assembled from different components, and components are reused in many contexts.

In the Matrix

Another inevitable dimension of content is that there can be many versions of a content item.  Sometimes this is unintended: organizations have generated duplicate content. But other times organizations have purposefully made different versions of the same underlying content to meet slightly different needs. Either way, it can be hard to sort out what is master content, and what is the derivative.

Distinguishing what’s the original content is an old problem. Enthusiasts of early jazz recordings faced this problem when they wanted to trace the recordings of a famous musician such as Louis Armstrong. Early recordings on 78 records didn’t supply much information about the full orchestra.  And sometimes the masters of these recordings were rented to other record companies, who released the recording on their own label.  Licensees even sometimes put false information on record labels to disguise that they were re-releasing an existing recording (done sometimes to get around labor contracts).  To complicate matters even more, the same artist might release several versions of the same tune. Jazz is after all about improvisation, and each different version can be interesting in its own right.  So even knowing the song title and the artist wasn’t sufficient to know if the recording was unique or not.

Fans who developed discographies of early jazz found a key to solving the problem of unreliable information on the labels on records. They tracked recordings according to their matrix number.  Each matrix used to press records contained a hand inscribed number indicating the master recording.  No matter who subsequently used the master to release the recording, the same number was stamped into the record.  As a result, one could see that a French record was the same recording as an American one, because they shared the same matrix number, while two records with the same title and performers were in fact different recordings.

Content variation is a phenomenon driven by the desire of audiences to have choice.  People want versions of content that match their needs: that are shorter or longer depending on their interests, or are formatted for a larger or smaller screen depending on their device. To track all these variations, organizations need identifiers that can let them know how content is being repurposed, and where.

Tuning In

Broadcast radio stations often identify themselves by number.  They broadcast at a certain frequency, and use that frequency as an identifier: “101.3 FM” or whatever.  RFID is a different kind of radio broadcast, one specifically designed to identify objects. Identifiers have morphed into stickers that we can listen to.

Last year I visited an exhibit at Expo Milan featuring an MIT prototype of the supermarket of the future. The premise of the exhibit was that RFID tags can track produce and other food items, to give consumers information about where the products are from, when they were harvested, how they were shipped, and so forth.  What’s intriguing about this vision is that products can now have biographies. No longer does one need to talk about the product generically.  One can now talk about a specific instance of the product: this orange, or this batch of pesto. Products now have real stories that can be told.

RFID allows us to listen to things: to know what’s been going on with them. We are starting to move toward creating specific content that tells stories about specific instances of items. To do this, we will need the ability to be very specific about what we refer to.

Conclusion

Identifiers give us the ability to make statements about things. They allow us to distinguish what specifically we are saying, and about what specifically we are making a statement.  That capability will be important as content and products become more varied and customized.  Identifiers support accountability in the face of growing complexity.

— Michael Andrews

4 replies on “Identifiers in Content”

Kudos for tying the awesome power of (globally) unique identification to components of content. There are so many examples where the establishment and acceptance of reference identifiers achieves a kind of ‘symbol grounding’ that enables new, higher, levels of information management to take place – think URI’s, Facebook’s single IDs, Social Security Numbers, zip codes, and even language itself – that it’s easy to imagine uniquely identified content components enabling truly revolutionary smart content systems and products.

To me a logical extreme of this argument is global URI’s on defined/structured semantic units – a kind of URI scheme for meaningful units of semantic information. For example the fact of, say, ‘Twitter appointing Jack Dorsey as CEO’ becomes a single defined URI, with defined structure, properties and roles. Not easy, but if we got to established, accepted ‘symbol grounding’ at that level what would that enable?

Thanks very much David for your comments. I didn’t get much into the wonky details of how to place identifiers in content, and you are right that there are many approaches available, including URIs. Because there can be many layers of identification needed, I think a hybrid strategy makes sense. We need machine-intelligible identifiers (URIs), human-usable unique identifiers for authors (ID numbers), and human viewable and understandable identifiers for audiences (reasonably unambiguous names/titles). These can be tied together of course. The other need is to offer identifiers for different levels of information. We need to identify the source, the complete statement (which may be complex), and specific entities and properties within the statement (which can be numerous, and not necessarily a 1:1:1 relationship).

In terms of identifying statements, the approach endorsed by nanopublication would seem to offer what you are aiming for. It seems complementary to your work on structured stories.

Thank you – the nanopublication approach is very relevant to my work, and, I think, to structured journalism more generally. I had not been aware of this project, and will study it carefully.

Comments are closed.