A recent post on Google’s webmaster blog illustrates how metadata needs to address both the structure of web content, and the meaning of that content.
People who work in SEO talk about structured data a lot, while those who work in content strategy talk about structured content. These topics are obviously related, but the terminology used by each party obscures how each topic relates to the other. My take: both structured data and structured content are different dimensions of metadata. Structured data is generally descriptive metadata identifying entities discussed in the content. Structured content provides the foundation for structural metadata that indicates the logic and organization of the content. Both descriptive and structural metadata are important in content, and they should ideally be integrated together.
The Google blog advises publishers to include structured data in their content. The below screenshot shows how this advice is presented.
The advice presented follows a pattern:
Advice to follow
Best practices to implement advice (shown in green)
Actions not to do (shown in pink)
Some other items of advice in the post include another element:
Practices to avoid when implementing advice (shown in yellow)
We can see that the post follows good structure that is easy to scan and understand, and provides a foundation to reuse the information in other contexts. Now, let’s look at the post’s source code. This is where we’d expect to see the structured data associated with the content.
Disappointingly, no structured data is associated with the specific items of advice. The details of the advice are marked up with “class” attributes intended to style the content, but not to identify the meaning of the content. The only structured data on the page relates to the blog post in general (such as its author).
Imagine how the content could be reused if structured data identified the meaning of the advice. Someone might type a search looking for tips on “mistakes when using schema.org,” “why use schema.org,” or “schema.org best practices” and get specific bullets of content relating to their query.
In this example, the post’s author has done nothing wrong, though an opportunity has been missed nonetheless. Currently, schema.org doesn’t have any entity types that address advice statements that would contain sub-elements such as Rationale, Do, Avoid, and Don’t. The closest types are related to Questions and Answers, which are slightly different in their structure.
Because the structured data used in SEO, particularly schema.org, tends to focus on descriptive metadata, it has less coverage of other dimensions of metadata such as structural metadata indicating the role of content elements, or technical, administrative and rights metadata. All these kinds of metadata are important to address, to allow content to be shared and reused across different platforms and in different contexts. Fortunately, schema.org has been evolving quickly, and its coverage is improving every month. This expansion will allow for genuinely integrated metadata that indicates both the meaning and the structure of the content.
Metadata is a rich and important topic for everyone concerned with content published on the web. If you are interested in learning more about the many dimensions of metadata, you may be interested in my forthcoming book, Metadata Basics for Web Content, which will be available in early 2017 from XML Press.
One of the central challenges of content strategy is tracking all the content being created. So much content is available about so many different things. If you’ve ever done a content inventory, you know that different URLs may refer to the same content. It’s even possible for the same content to exist with two different titles. And sometimes it isn’t clear if two items of content are talking about the same thing, or simply talking about things that sound similar.
Identifiers are the solution to this chaos. Identifiers are alphanumeric strings associated with an item. They don’t seem very exciting, but they will play an increasingly important role in content moving forward. We are finding that relying on titles and URLs to identify content is not enough. We need something more robust.
It’s hard to relate to something as abstract as an alphanumeric string. Fortunately, some real world examples point to how identifiers can support content. Real world identifiers show how they can indicate such important things as:
The provenance of an item
A persistent way to refer to something
Whether something is unique or a copy
A way to listen to changes about something described.
Who Moved My Cheese?
One basic need is to know where content comes from. There is much pilfering of content online these days: it’s become a big industry to rip off other people’s content and republish it as one’s own.
The problem of impostors and lookalikes is not limited to web content. People who produce cheese worry about the confusion that can arise from similar looking and sounding products. Parmigiano Reggiano is a famous Italian cheese, colloquially known in English as parmesan. It can be very expensive: a wheel of Parmigiano Reggiano typically weighs 38 kilos and will cost several hundred dollars. Parmigiano Reggiano is similar to other another Italian cheese called Gran Padano, and is the original inspiration for various cheeses called parmesan made outside Italy. The makers of Parmigiano Reggiano work to distinguish their cheese from the rest through identifiers. Each cheese house (caseificio) has a unique number that they apply to the outside rind of a cheese wheel, together with the month and year of production. These identifiers let the consumer know the provenance of the cheese.
At the supermarket it can be hard to figure out where products come from. Online it can be hard to know where content comes from. Increasingly people get content not from the producer, but indirectly through a channel like Facebook. As content gets promoted and aggregated across a growing range of platforms and channels, the provenance of the content will be increasingly important to track. Content requires identifiers that can reveal the originator of the content. The Federal Trade Commission issued guidance recently rejecting vague statements that content is “sponsored”. Publishers need a process that can track and identify who that sponsor is.
Another challenge for content arises when it is remixed. Titles and URLs are designed to identify pages, not content components that might show up in a multitude of delivered content.
The challenge of remixed content is similar to a situation facing trial lawyers. As part of the pretrial discovery process, lawyers collect volumes of information. This information needs to be shared between opposing parties, and may not have any intrinsic order to it. Lawyers solved how to identify all these random bits with something called a Bates number. Originally a Bates number was produced by an elaborate mechanical ink stamp, that would sequentially number each page of any documentation with a unique alphanumeric string. Today, lawyers will scan documents into PDFs, which can render Bates numbers for each page automatically.
The elegance of the Bates number is that it provides a persistent identifier for a piece of information that is independent of its source and its context. No matter how different items of content are shuffled around, a specific item can be located by any party according to its unique Bates number.
Having persistent identifiers for content components is valuable when content is assembled from different components, and components are reused in many contexts.
In the Matrix
Another inevitable dimension of content is that there can be many versions of a content item. Sometimes this is unintended: organizations have generated duplicate content. But other times organizations have purposefully made different versions of the same underlying content to meet slightly different needs. Either way, it can be hard to sort out what is master content, and what is the derivative.
Distinguishing what’s the original content is an old problem. Enthusiasts of early jazz recordings faced this problem when they wanted to trace the recordings of a famous musician such as Louis Armstrong. Early recordings on 78 records didn’t supply much information about the full orchestra. And sometimes the masters of these recordings were rented to other record companies, who released the recording on their own label. Licensees even sometimes put false information on record labels to disguise that they were re-releasing an existing recording (done sometimes to get around labor contracts). To complicate matters even more, the same artist might release several versions of the same tune. Jazz is after all about improvisation, and each different version can be interesting in its own right. So even knowing the song title and the artist wasn’t sufficient to know if the recording was unique or not.
Fans who developed discographies of early jazz found a key to solving the problem of unreliable information on the labels on records. They tracked recordings according to their matrix number. Each matrix used to press records contained a hand inscribed number indicating the master recording. No matter who subsequently used the master to release the recording, the same number was stamped into the record. As a result, one could see that a French record was the same recording as an American one, because they shared the same matrix number, while two records with the same title and performers were in fact different recordings.
Content variation is a phenomenon driven by the desire of audiences to have choice. People want versions of content that match their needs: that are shorter or longer depending on their interests, or are formatted for a larger or smaller screen depending on their device. To track all these variations, organizations need identifiers that can let them know how content is being repurposed, and where.
Broadcast radio stations often identify themselves by number. They broadcast at a certain frequency, and use that frequency as an identifier: “101.3 FM” or whatever. RFID is a different kind of radio broadcast, one specifically designed to identify objects. Identifiers have morphed into stickers that we can listen to.
Last year I visited an exhibit at Expo Milan featuring an MIT prototype of the supermarket of the future. The premise of the exhibit was that RFID tags can track produce and other food items, to give consumers information about where the products are from, when they were harvested, how they were shipped, and so forth. What’s intriguing about this vision is that products can now have biographies. No longer does one need to talk about the product generically. One can now talk about a specific instance of the product: this orange, or this batch of pesto. Products now have real stories that can be told.
RFID allows us to listen to things: to know what’s been going on with them. We are starting to move toward creating specific content that tells stories about specific instances of items. To do this, we will need the ability to be very specific about what we refer to.
Identifiers give us the ability to make statements about things. They allow us to distinguish what specifically we are saying, and about what specifically we are making a statement. That capability will be important as content and products become more varied and customized. Identifiers support accountability in the face of growing complexity.
If you put two things together side-by-side, what do they have in common? The answer depends on the point of view. Alternative viewpoints mold content identity differently. Designers of content experiences, such as content strategists and information architects, can use these viewpoints to surface different kinds of content relationships.
Three actors shape the identity of content: the author or curator; the audience; and the thing or things discussed in the content. Each brings its own perspective to what content is about:
Content identity as interpreted by an author or curator
Content identity as interpreted by the audience
Content about things that reveal dimensions of themselves
Each perspective plays a different role in framing the content experience.
Scene setting: the Curatorial Perspective
Scene setting lets people understand common themes in content that aren’t obvious. An author or curator draws on their unique knowledge to construct a theme that unifies different content items. Such themes set expectations about the relationship of content to other content. It is didactic in orientation.
A common label used to announce a theme is the series — for instance, a TV series, or a narrative trilogy. Sometimes the series is just a way to divide up something into smaller parts, but keep them connected: an article becomes a two-part article. A content series can express how different items are related according to the intentions of the author or the interpretation of a curator. They can be a sequence of items presented on a common theme. The series may present the evolution of the item over time, such as versions. A building architect might show a series of images starting with a sketch, then a foam model, and finally a photo of the finished building.
A series presents a collection of items and shows how they belong together. The author/curator draws on their intimate knowledge of the content to point out connections between different content items, which may not be self-evident. We find this in the museum world: an item presented is said to originally belong with other items, that have since been dispersed. A curator might indicate how several items embody a common theme, such as when similar paintings express a recurrent motif.
Any time items are defined by the values and judgments of the author (or curator), the audience must be willing to accept that valuation as relevant. So if a curator identifies items as “new and notable,” then the intended audience needs to buy that labeling.
Mirroring: the Audience Perspective
When mirroring, content reflects themes as seen by the audience. It represents concepts the way audiences think about them to support attraction to the content. Mirroring is different from the authorial perspective, which expresses the content’s intention. The audience perspective expresses how content is imagined.
Brand names are perhaps the purest example of imagined content. Brands have no intrinsic identity: they depend entirely on the perceptions of customers to define what they mean. Even a conglomerate that sells many brand products can’t dictate how consumers view these brands. The French brand house LVHM, which sells numerous luxury brand products, can’t control whether consumers consider Dior is more similar to Givenchy or to Louis Vuitton, even though it owns all three brands. In reality, Chinese consumers may have different opinions about these relationships than Italian consumers would.
High-level concepts that are meaningful to audiences should reflect how audiences perceive them. For example, people associate different kinds of experiences with different vacation activities. Is bungee jumping active-fun, adventurous, or extreme? It is best to work with the audiences’ framework of values, rather than trying to impose one on them. Card sorting is useful for eliciting subjective perceptions about the identity of things. Yet card sorting is less reliable when defining the identity of concrete things, since it shifts the attention away from the object’s specific properties. Better, more empirical approaches are available to classify concrete items.
Discovery: Perspectives based on Item Properties
Features of items can suggest themes. Object-defined themes let the things featured in the content to speak for themselves. This involves more showing, and less telling. Properties can define identities, and reveal commonalities between different items. It promotes discovery of content relationships.
Faceted search interfaces, such as found on e-commerce sites, are the most familiar implementation of property-driven identification. People choose values for various facets (properties) of items, and get a list of items matching these values. Using properties to identify items is especially valuable for non-text content. Some Digital Asset Management systems allow people to find images that match a certain shade of a color, regardless of what the subject of the image is. Properties can identify similarities and relationships that might not be expected from a higher level label. It can support more criteria-based consideration of identity. For example, when we think of travel items — things to pack — we generally have standard things in mind: toiletries, articles of clothing, etc. But if we start with properties, the universe of travel items expands. We might define travel items as things that are both small and lightweight. We discover small and lightweight versions of things we might not ordinarily pack for travel, but might enjoy having once we become aware of the option.
Leveraging Diverse Viewpoints
There’s more than one way to define the relationship between items of content. I sometimes see people try to make a single hierarchical taxonomy serve as both an authoritative or objective classification of content, and a user-centric classification that reflects the subjective perceptions of users, without realizing they are forcing together different kinds of content identities — one relatively stable, the other contextual and subject to change.
Content can be considered objectively as it is; authoritatively as it is intended; and subjectively as it seems to various audiences. These differences offer thematic lenses for looking at content. They can be used to help audiences connect different items of content together in different ways: setting the scene for audiences so they understand relationships better, reflecting their existing attitudes to promote attraction to items of interest, and helping them discover things they didn’t know.