Categories
Intelligent Content

A Visual Approach to Learning Schema.org Metadata

Everyone involved with publishing web content, whether a writer, designer, or developer, should understand how  metadata can describe content. Unfortunately, web metadata has a reputation, not entirely undeserved, for being a beast to understand. My book, Metadata Basics for Web Content, explains the core concepts of metadata. This post is for those ready to take the next step: to understand how a metadata standard relates to their specific content.

Visualizing Metadata

How can web teams make sense of voluminous and complex metadata documentation?  Documentation about web metadata is generally written from a developer perspective, and can be hard for non-techies to comprehend. When relying on detailed documentation, it can be difficult for the entire web team to have a shared understanding of what metadata is available.  Without such a shared understanding, teams can’t have a meaningful discussion of what metadata to use in their content, and how to take advantage of it to support their content goals.

The good news is that metadata can be visualized.  I want to show how anyone can do this, with specific reference to schema.org, the most important web metadata standard today. The technique can be useful not only for content and design team members who lack a technical background, but also for developers.

Everyone who works with a complex metadata standard such as schema.org faces common challenges:

  1. A large and growing volume of entities and properties to be aware of
  2. Cases where entities and properties sometimes have overlapping roles that may not be immediately apparent
  3. Terminology that can be misunderstood unless the context is comprehended correctly
  4. The prevalence of many horizontal linkages between entities and properties, making navigation through documentation a pogo-like experience.

First, team members need to understand what kinds of things associated with their content can be described by a metadata standard.  Things mentioned in content are called entities.  Entities have properties.  Properties describe values, or  they express the relationship of one entity to another.

Entities are classified according to types, which range from general to specific.  Entity types form a hierarchy that can be expressed as a tree.  All entities derive from the parent entity, called Thing.  Currently, schema.org has over 600 entity types.  Dan Brickley, an engineer at Google who is instrumental in the development of schema.org, has helpfully developed an interactive visualization in D3 (a Javascript library for data visualization), presented as a radial tree, which shows the distribution of entity types within schema.org.  The tool is a helpful way to explore the scope of entities addressed, and the different levels of granularity available.

Screenshot of entity tree, available at http://bl.ocks.org/danbri/raw/1c121ea8bd2189cf411c/

D3 is a great visualization library, but it requires both knowledge and time to code.  For our  second kind of visualization, we’ll rely on a much simpler tool.

Graphs of Linked Data

Web metadata can connect or link different items of information together, forming a graph of knowledge.  Graphs are ideal to visualize.  By visualizing this structure, content teams can see how entities have properties that relate to other entities, or that have different kinds of values.  This kind of visualization is known as a concept map.

Let’s visualize a common topic for web content: product information.  Many things can be said about a product: who is it from, what is like, and how much it costs.  I’ve created the below graph using an affordable and easy-to-use concept mapping app called Conceptorium (though other graphic tools can be used).  Working from the schema.org documentation for products, I’ve identified some common properties and relationships for products.  Entities (things described with metadata) are in green boxes, while literal values (data you might see about them) are in salmon colored boxes.  Properties (attributes or qualities of things) are represented by lines with arrows, with the name of the property next to the line.

Concept map of schema.org entities and properties related to products

The graph illustrates some key issues in schema.org that web teams need to understand:

  • The boundary between different entity types that address similar properties
  • The difference between different instances of the same entity type
  • The directional relationships of properties.

Entity Boundaries

Concept maps help us see the boundaries between related entity types.  A product, shown in the center of our graph, has various properties, such as a name, a color, and an average user rating (AggregateRating).  But when the product is offered for sale, properties associated with the conditions of sale need to be expressed through the Offer entity.  So in schema.org, we can see that products don’t have prices or warranties; offers have prices or warranties.  Schema.org allows publishers to express an offer without providing granular details about a product.  Publishers can note the name and product code (referred to as gtin14) in the offer together with the price, and not need to use the Product entity type at all.  The Offer and Product entity types both use the name and product code (gtin14) properties.   So when discussing a product, the team needs to decide if the content is mostly about the terms of sale (the Offer), or about the features of the product (the Product), or both.

Instances and Entity Types

Concept maps help us distinguish different instances of entities, as well as cases where instances are performing different roles. From the graph, we can see that a product can be related to other products.  This can be hard to grasp in the documentation, where an entity type is presented as both the subject and the object of various properties.  Graphs can show how there can be different product instances that may have different values for the same properties (e.g., all products have a name, but each product has a different name).  In our example, we can see that on product at the bottom right is a competitive product to the product in the center.  We can compare the average rating of the competitor product with the average ratings of the main product.  We can also see another related product, which is an accessory for the main product.  This relationship can help identify products to display as complements.

An entity type provides a list of properties available to describe something.  Web content may discuss numerous, related things that all belong to the same entity type.  In our example, we see several instances of the Organization entity type.  In one case, an organization owns a product (perhaps a tractor).  In another case, the Organization is a seller.  In a third case, the Organization is a manufacturer of the product. Organizations can have different roles relating to an entity.

Content teams need to identify in their metadata which Organizations are responsible for which role.  Is the seller the manufacturer of the product, or are two different Organizations involved?  Our example illustrates how a single Person can be both an owner and a seller of a Product.

What Properties Mean

Concept maps can help web teams see what properties really represent.  Each line with an arrow has a label, which is the name of the property associated with an entity type.  Properties have a direction, indicated by the arrow.  The names of properties don’t always directly translate into an English verb, even when they at first appear to.  For example, in English, Product > manufacturer > Organization doesn’t make much sense. The product doesn’t make the organization, but rather the organization manufactures the product.  It’s important to pay attention to the direction of a property: what entity type is expected — especially when these relationships seem inverted to how we think about them normally.

Many properties are adjectives or even nouns, and need helper verbs such as “has” to make sense.  If the property describes another entity, then that entity can involve many more properties to describe additional dimensions of that entity.  So we might say that “a Product has a manufacturer which is an Organization (having a name, address, etc.)”  That’s not very elegant in English, but the diagram keeps the focus on the nature of the relationships described.

Broader Benefits of Concept Mapping for Content Strategy

So far, we’ve discussed how concept maps can help web teams understand what the metadata means, and how they need to organize their metadata descriptions.  Concept maps can also help web teams plan their content.  Teams can use maps to decide what content to present to audiences, and even what content to create that audiences may be interested in.

Content Planning

Jarno van Driel, a Dutch SEO expert, notes that many publishers treat schema.org as “an afterthought.”  Instead, Jarno argues, publishers should consult the properties available in schema.org to plan their content.  Schema.org is a collective project, where different contributors identify properties relating to entities they would like to mention that they feel would be of interest to audiences.  Schema.org can be thought of as a blueprint for information you can provide audiences about different things you publish.  While our example concept map for product properties is simplified to conserve space, a more complete map would show many more properties, some of which you might decide to address in your content.  For example, audiences might want to know about the material, the width, or the weight of the product — properties available in schema.org that publishers may not have considered including in their content.

Content Design and Interaction Design

Concept maps can also reveal relationships between different levels of information that publishers can present.  Consider how this information is displayed on the screen.  Audiences may want to compare different values. They may want to know all the values for a specific property (such as all the colors available), or they want to compare the values for a property of two different instances (average rating of two different products).

Concept maps can reveal qualifications about the content (e.g., an Offer may be qualified by an area served).  Values (shown in salmon) can be sorted and ranked.  Concept maps also help web teams decide on the right level of detail to present.  Do they want to show average ratings for a specific product, or a brand overall?  By consulting the map, they can consider what data is available, and what data would be most useful to audiences.

Concept map app shows columns of entities and values, which allow exploration of relationships

Conclusion

Creating a concept map requires effort, but is rewarding.  It requires you to compare the specification of the standard with your representation of it, to check that relationships are known and understood correctly.  It allows you to see some characteristics, such as properties used by more than one entity. It can help content teams see the bigger picture of what’s available in schema.org to describe their content, so that the team can collectively agree to metadata requirements relating to their web content.  If you want to understand schema.org more completely, to know how it relates to the content you publish, creating a concept map is a good place to start.

— Michael Andrews

Categories
Intelligent Content

Why Standards Compliance is a Tricky Notion

I just published a book about metadata, called Metadata Basics for Web Content.  The book refers to many standards, and provides samples of code illustrating metadata (or structured data, if you prefer) using these standards.  To locate good code examples, I relied on international organizations such as the W3C, industry working groups such as schema.org, and prominent companies such as Google.

All these sources are important ones for publishers to consult.  But if you pay very close attention, you may notice that the various sources aren’t always completely aligned with one another. This is a bit disconcerting. Publishers, after all, are expected to comply with standards. Various standards reference and build on each other. But certain details are different as you move between different actors in the standards arena. How can that be, that standards aren’t completely aligned?  To answer that question one must consider the governance, mission, and adoption goals of various parties involved with standards.

Publishers should recognize that no one party is in charge of metadata standards. Many parties are involved.  Decisions and practices evolve organically through a combination of planning and adaptation.  Different parties offer different choices.

The W3C is the largest standards body addressing web content.  It has a fairly open structure.  If there is sufficient interest in a topic, where enough people volunteer to work on standards issue, then a group can be started, which can begin a process of drafting notes, recommendations, and eventually standards.  The W3C doesn’t always initiate standards.  Sometimes they embrace standards that have been developed by other groups.  And sometimes the W3C has different groups addressing broadly similar issues, but in different ways.  While W3C recommendations and standards carry tremendous weight, they do not always represent a single consensus about priorities.  Generally, they skew toward accommodating a diverse range of needs, rather than enforcing a narrow set of practices.  As a nonprofit body, the W3C isn’t marketing anything, or promoting adoption of one standard over another.

Many industry groups develop standards as well.  An important one in the area of web content metadata is called schema.org.  This group started out as a partnership between search engine companies, namely Google, Bing, Yahoo and Yandex.  These companies developed a core set of standards for describing common web content with metadata.  Now that the core standard has been developed, schema.org has subsequently transformed to become a W3C community group.  Google remains the single most important driver of schema.org’s development.  But as a community, the standard has accepted contributions from many parties, and the scope of the standard is expanding.

In addition to international bodies and industry groups, certain companies, on account of their size and influence, influence standards practices through the implementation choices they make.  They may set trends of what are deemed “best practices” or they may recommend to others how to do things.  Google again is a leading example of a single firm having a big influence on standards.  As a private company, it recommends guidelines to its customers, the publishers who want their content to display in Google’s search results.  These guidelines seem like standards, though they are specific to one company.

Let’s consider how different levels of standards interact with each other.

Metadata needs to be encoded using a syntax. One widely used syntax is called RDFa, which is a W3C standard.

Metadata also needs schema to indicate entities and properties within the content.  Schema.org metadata can be encoded using RDFa syntax.  So we have one standard relying on another.  But schema.org only uses part of the RDFa specification.  There are some features in RDFa that aren’t needed when implementing schema.org.  Other metadata schemas also use the RDFa syntax, and some of these take advantage of the additional features.  The group designing schema.org decided to pare down what was needed to implement schema.org in RDFa.  They chose to keep things as simple as they could to help promote adoption of their schema.

As mentioned earlier, Google is a key player as both a developer of schema.org, and as a consumer of schema.org metadata.  Google evangelizes the use of schema.org metadata, and they offer guidelines and tools to help webmasters learn what they need to do.  Publishers often take this advice as gospel.  They presume they need to comply with Google’s standards, at least as they understand them.   What they may not realize is that Google’s tools and guidelines are often advice rather than rigid rules.  When developing its advice and tools, Google has chosen to focus on high priority content that many organizations produce, and provide guidelines to help webmasters ensure that they don’t make mistakes when creating metadata for such content.  Google’s guidelines only cover a subset of the range of content addressed by schema.org.  In effect, Google has chosen to simplify schema.org further to encourage wider adoption of it.

Google’s guidelines provide assurance that if complied with, the metadata will work with Google.  However, it does not follow that if the publisher deviates from Google’s guidelines that their metadata is wrong.  Many publishers use Google’s structured data testing tool (SDTT) to validate their metadata.  It’s a useful tool, but it validates only some dimensions of schema.org metadata, not all dimensions.

Google's structured data testing tool "complaining" about a webpage on the schema.org website
Google’s structured data testing tool “complaining” about a webpage on the schema.org website

We can see the limitations of Google’s structured data testing tool by looking at how it assesses the schema.org website.  We can find pages where the schema.org website, which Google is involved with developing, fails Google’s own SDTT.  How can that be?  The schema.org website and Google’s SDTT serve different purposes, and even different audiences.  The SDTT is trying to encourage certain practices, and in a almost gamified manner, gives a thumbs up if the metadata code conforms to the advice.  Schema.org continually develops to cover a range of needs.  Some of these needs will be more specialized, and publishers may decide to implement metadata in a standards-compliant manner that doesn’t pass inspection by Google’s SDTT.  I would not assume, however, that Google’s search algorithms are incapable of interpreting standards-compliant metadata that fails Google’s SDTT.   I’d guess that Google’s search algorithms are probably more sophisticated than the code used in the SDTT.  Sometimes the SDTT is playing catch-up with new developments in schema.org.

Google is trying to do two things at once: expand the coverage of schema.org to make it even more useful in a wider range of domains and scenarios, and popularize schema.org by presenting a simple set of guidelines for publishers to follow.  It’s a difficult situation to balance, how to manage and evolve standards over time, while promoting easy-to-follow guidelines that publishers consider reliable.  I would not expect Google to encourage publishers to adopt complicated metadata implementations that some would struggle to code correctly.  If less sophisticated publishers fail, they might fault Google for encouraging them to try something that exceeded their understanding or abilities.

Sometimes publishers gripe that they’ve created logically-valid schema.org metadata that nonetheless fails Google’s SDTT.   But publishers seem more upset when they’ve created metadata that passes the SDTT, yet they fail to see how it shines in Google’s search results.  Where’s my rich snippet I was expecting? they complain.  For many publishers, seeing the rich snippet payoff is the reward for using schema.org structured data, and for using the SDTT.  The SDTT is not just a technical tool: it is a marketing and public relations tool for Google.

A representative rich snippet as shown is SDTT. For some publishers, seeing their structured data in search results provides tangible proof they are correct and compliant.
A representative rich snippet as shown in Google’s SDTT. For some publishers, seeing their structured data in search results provides tangible proof they are correct and compliant with standards.

So does metadata compliance mean that one follows the pages of details in W3C standards, or that one gets a snippet to show in Google’s search results? Standards compliance can involve many layers. There is no one standard to follow: there can be various permutations of a standard that are sanctioned or encouraged by different parties. Publishers need to rely on the standards guidance that best supports the goals they are trying to achieve with their metadata.

— Michael Andrews

Categories
Intelligent Content

Why Structured Data needs to talk to Structured Content

A recent post on Google’s webmaster blog  illustrates how metadata needs to address both the structure of web content, and the meaning of that content.

People who work in SEO talk about structured data a lot, while those who work in content strategy talk about structured content. These topics are obviously related, but the terminology used by each party obscures how each topic relates to the other. My take: both structured data and structured content are different dimensions of metadata. Structured data is generally descriptive metadata identifying entities discussed in the content. Structured content provides the foundation for structural metadata that indicates the logic and organization of the content. Both descriptive and structural metadata are important in content, and they should ideally be integrated together.

The Google blog advises publishers to include structured data in their content. The below screenshot shows how this advice is presented.

(source: Google Central Webmaster Blog)
(source: Google Central Webmaster Blog)

The advice presented follows a pattern:

  • Advice to follow
  • Rationale
  • Best practices to implement advice (shown in green)
  • Actions not to do (shown in pink)

Some other items of advice in the post include another element:

  • Practices to avoid when implementing advice (shown in yellow)

We can see that the post follows good structure that is easy to scan and understand, and provides a foundation to reuse the information in other contexts. Now, let’s look at the post’s source code. This is where we’d expect to see the structured data associated with the content.

Source code for Blog post.
Source code for Blog post.

Disappointingly, no structured data is associated with the specific items of advice. The details of the advice are marked up with “class” attributes intended to style the content, but not to identify the meaning of the content. The only structured data on the page relates to the blog post in general (such as its author).

Imagine how the content could be reused if structured data identified the meaning of the advice. Someone might type a search looking for tips on “mistakes when using schema.org,” “why use schema.org,” or “schema.org best practices” and get specific bullets of content relating to their query.

In this example, the post’s author has done nothing wrong, though an opportunity has been missed nonetheless. Currently, schema.org doesn’t have any entity types that address advice statements that would contain sub-elements such as Rationale, Do, Avoid, and Don’t. The closest types are related to Questions and Answers, which are slightly different in their structure.

Because the structured data used in SEO, particularly schema.org, tends to focus on descriptive metadata, it has less coverage of other dimensions of metadata such as structural metadata indicating the role of content elements, or technical, administrative and rights metadata. All these kinds of metadata are important to address, to allow content to be shared and reused across different platforms and in different contexts. Fortunately, schema.org has been evolving quickly, and its coverage is improving every month. This expansion will allow for genuinely integrated metadata that indicates both the meaning and the structure of the content.

Metadata is a rich and important topic for everyone concerned with content published on the web. If you are interested in learning more about the many dimensions of metadata, you may be interested in my forthcoming book, Metadata Basics for Web Content, which will be available in early 2017 on Amazon.

— Michael Andrews