Content Engineering

All models reflect a point of view

We sometimes talk about having a “bird’s eye” perspective of a place or about a topic.  We can see how details connect to form a larger whole.  The distance highlights the patterns that may be hard to see up close.  In many ways, a content model is a bird’s eye view of content addressing topics or activities.  It’s a map of our content landscape. 

This year, many of us — having been shut indoors involuntarily and had our travel limited —  have marveled at the freedom of travel that birds enjoy.  Bird watching has become a popular pastime during the pandemic, encouraging people to consult content about birds to understand what they are newly noticing.

Content about birds provides an excellent view into the structure of content — even for people not interested in birds. Nearly 20 years ago JoAnne Hackos discussed what field guides to birds can teach us about structuring content in her book, Content Management for Dynamic Web Delivery.  Information about birds offers a rich topic to explore content structure.

 Last year, I conducted a workshop exploring content structuring decisions by comparing how field guides describe bird species.  I have a collection of field guides about Indian birds from my time living in India.  They all broadly cover the same material, but the precise coverage of each varies.  Some will talk about habitats, others will discuss diet.  They will get into different levels of detail.

Various guides to Indian birds
Various guides to Indian birds

As someone interested in birds, I noticed how these field guides differed when discussing the same bird species.  Even the images of a bird varied greatly: whether they are paintings or photos, if they are in context or not, and whether distinguishing features are described in the text or as annotations.  Image choices have profound implications for what information is conveyed.

All this variation reveals something that’s obvious when you see it: there’s no one way to describe a bird.  People make editorial choices when they structure content.  That’s a very different way of thinking about content structure than the notion of domain modeling, which assumes that things we describe have intrinsic characteristics that can be modeled objectively.  Domain modeling presumes there’s a platonic ideal that we can discover to describe things in our world. It can provide some conceptual scaffolding for identifying the relationships between concrete facts associated with a topic. Domain modeling is best approached from the viewpoint of different personas.  Why do they care about any of these facts?  And importantly, might they care about different facts?

Without doubt, it’s valuable to get outside of our subjective ways of seeing to consider other viewpoints.  But that doesn’t imply there’s a single viewpoint that is definitive or optimal.  Content modeling is not the same thing as data modeling, just as content isn’t simply data.  The limitation of domain modeling is that it doesn’t provide any means to distinguish more important facts from less important ones.  And it treats content as data, where facts are readily reduced to a few concrete objective statements rather than involving descriptive interpretations or analysis.  With a domain model, it’s not obvious why people care about any of the data.  Before we can build experiences from content, we need the basic content to be interesting and relevant. Relevance and interest have been overlooked in many discussions about content modeling.   

While I see richness in small editorial decisions among different field guides, a person not interested in birding may find these distinctions as unimportant.  To a casual viewer, all field guides look similar.  There’s a generic template that field guides seem to follow to describe birds.  Here, the editorial decisions have emerged over time as a common framework that’s widely accepted and expected.  Some people will view the task of structuring content as one of finding an existing framework that’s known, and copying it.  Instead of searching for the intrinsic properties of things that can be described, the search is finding patterns already used in the content.  Adopting vernacular structural patterns has much to commend it.  These patterns are familiar, and in many cases work fine.  But relying on habit can also cause us to miss opportunities to enrich how to describe things.  They may satisfy a basic need, but they don’t necessarily do so optimally.  

The most disruptive development to influence field guides is the smartphone app.  These apps shake up how they approach birds and can incorporate image and audio recognition to aid in identifying a species.   They can be marvelously clever in what they can do, but they too represent editorial decisions: a focus on a transactional task rather than a deeper look into context and comparison.  The more that content exists to support a repetitive transactional task, the more tightly prescribed the content will be in a model.  If you consider content as existing only to support narrow transactional needs, the structure of the model might seem obvious, because what else would someone care about?  Such a model presumes zero motivation on the part of the reader: they only want to view content that is necessary for completing a task.  

Field guides exist to aid the identification of birds.  Deciding what information is important — or is readily available — to aid the identification of birds still involves an editorial judgment.  

Topics don’t have intrinsic structures.  The structure depends in part on the tasks associated with the topic.  And the more that a topic requires the motivation of the reader — their interest, preferences, and choices outside of a narrow task supported within the UI — the more important the editorial dimensions of the content model become.    The model needs to express elements that tell readers why they would care about the topic: what they should learn or see as important and relevant.

Importantly, the discussion of a topic is not limited to a specific genre.  Species of birds can be discussed in ways other than field guides.  Within a genre,  you can change the structure associated with it.  But you can switch genres as well, and embrace a different structure entirely.

When we consider structures within genres, we may be inclined to confuse its form with its intent.  The importance of genre is that it provides a point of view to elucidate a topic.

Picture books provide an alternative genre to present content about birds.  They involve the tight juxtaposition of words and images, often to provide a richer narrative about a topic.  Beyond that, the structure involved is not standardized or well defined.  The genre is most often associated with children’s books and exemplified by outstanding writer-illustrators such as Maurice Sendak, Dr. Seuss, and Quintin Blake.

But picture books are not limited to children’s entertainment.  They can potentially be used for any sort of topic and any audience.

This spring a new picture book, What it’s like to be a Bird, was released that became an instant bestseller.  It was produced by David Sibley, a renowned ornithologist and illustrator.  “My original idea, in the early 2000s, was to produce a bird guide for kids. Then I started thinking about it as a bird guide for beginners of any age.  But having created a comprehensive North American bird guide, the concept of a ‘simplified’ guide never clicked for me. Instead, I wanted to make a broader introduction to birds.”

His goal is to “give readers some sense of what it’s like to be a bird…My growing sense as I worked on this book is that instinct must motivate a bird by feelings — of satisfaction, anxiety, pride, etc…how else do we explain the complex decision that birds make everyday…”  Sibley wants to capture the bird’s experience making decisions on its life journey.  A wonderful backdrop as we think about the reader’s experience on the journey through his book.

“Each essay focuses on one particular detail…they are meant to be read individually, not necessarily in sequence — everything is interconnected, and there are frequent cross-references suggesting which essay to read next.”    We can see how Sibley has planned a content model for his material.

Bolstered by his talents as a subject expert, writer, and illustrator, he’s been able to rethink how to present content about birds.  While he’s also written a conventional field guide to birds, his new book explores species through their behavior.  His is not the only recent book looking at bird behavior, but the perspective he offers is unique.  Rather than organizing content around behavior themes such as mating, he focuses the content on species of birds and then talks about two or three key behaviors they have that are interesting.  The birds act as protagonists in stories about their life situation.  

Sibley’s book profiles various species of birds — something field guides do as well. But Sibley’s profiles explain the bird from its own point of view, instead of from an external viewpoint of concrete properties such as feather markings or song calls.  The stories of these species incapsulate actions that happen over a period involving a motivation and outcome.  They aren’t data.  

He introduces species of birds by presenting two or three stories about each.  Each story is a short essay explaining an illustration of an activity the bird is engaged in.  Frequently, the illustration is puzzling, prompting the reader to want to understand it.  In some cases, he breaks the story into several small paragraphs, each with its own illustration, when he wants to describe a sequence of events over time.

A simplified content model will show how Sibley explains birds.  The diagram reveals a highly connected structure.  It doesn’t look like the hierarchical structure of a book. There’s no table of contents or index.  Though the content is manifested as a book, it could be delivered to alternative platforms and channels.  

A simplified content model for Sibley's What it's like to be a bird
A simplified content model for Sibley’s What it’s like to be a bird

On the far left of the diagram, we see themes about birdlife that are explored.  These themes may be broken into sub-themes.  For example, the theme of survival has two sub-themes, which has even more specific sub-themes:

  • Survival
    • Birds and weather
      • Keeping cool
      • Keeping warm
    • Avoiding predators
      • Be inconspicuous
      • Be alert
      • Create a distraction

Each theme or sub-theme presents a range of related factual statements.  For example, he presents a series of facts about how birds create distractions.  These facts represent some important highlights about a theme: an index of knowledge that offers a range of perspectives.  Each fact points to a profile of a bird specifies, where a story essay will provide context about the statement and make it more understandable.

In this example, we see how the theme of how birds use smell is revealed through a series of facts that point to essays about how different species use of smell.

Thematically grouped factual highlights about birdlife (source: Sibley's What it's like to be a Bird)
Thematically grouped factual highlights about birdlife (source: David Sibley’s What it’s like to be a Bird)

When we visit a profile of a bird species, we encounter several stories, which are a combination of picture and essay.  The illustration shows  a starling holding a cigarette in its beak.  The situation has the makings of a story — we want to know more.  The essay tells us.

Illustration and essay providing a story relating to a behavior of a bird species (source: Sibley's What it's like to be a Bird)
Illustration and essay providing a story relating to a behavior of a bird species (source: David Sibley’s What it’s like to be a Bird)

Even if the story is enjoyable, a part of us may wonder if it’s just an entertaining yarn. Picture books are most often associated with fantasy, after all.  And unlike a nature program on TV, we don’t see a museum expert in a talking head interview to make it seem more credible.  Instead, we get a list of recent scientific references relating to the issue.  All these references are arranged thematically, like our facts.  They provide an overview of the focus on recent scientific research about birds, giving us a sense of how much scientists are still learning about these ubiquitous creatures that are as old as dinosaurs. 

Source references of recent discoveries from scientific research relating to birds, arranged thematically (source: Sibley's What it's like to be a Bird)
Source references of recent discoveries from scientific research relating to birds, arranged thematically (source: David Sibley’s What it’s like to be a Bird)

The structuring of the content helps us to understand and explore.  The stories make each bird more real: something living we can become interested in.  We can understand their dilemmas and how they seek to solve them.  We are up-close.  But we can also step back and understand the broader behaviors of birds that influence their lives.  

While each species is described through stories, we are not limited to those.  How do other birds work with smell?  What have we learned recently about smell and birds?  These other pathways allow us to follow our interests and find different connections.  

Sibley’s model shows how to transcend the top-down hierarchies that force how to learn about a topic and the bottom-up collections of random facts that leave us with no structure to guide us.  Sibley’s model of content can be approached in different ways, but it is always deliberate.  There’s no feeling of being lost in hyperlinks.

Content models should reflect what readers want to get and how they might want to get it.  They are more than a technical specification.  They are an essential tool in editorial planning.  Developing a content model can be a creative act.

— Michael Andrews

Big Content

Who benefits from shapes the behavior of thousands of companies and millions of web users.  Even though few users of the internet have heard of it, the metadata standard exerts a huge influence in how people get information, and from whom they get it.  Yet the question of who precisely benefits from the standard gets little attention.  Why does the standard exist and does everyone materially benefit equally from it?  While metadata standard may sound like a technical mechanism generating impartial outcomes, metadata is not always fair in its implementation.  

Google has a strong vested interest in the fate of — but it is not the only party affected.  Other parties need to feel there are incentives to support  Should they feel they experience disincentives, that sentiment could erode support for the standard.

As has grown in influence over the past decade, that growth has been built upon a contradictory dynamic.  Google needs to keep two constituencies satisfied to grow the usage of Google products.  It needs publishers to use so it has content for its products.  Consumers need that content to be enticing enough to keep using Google products.  But to monetize its products, Google needs to control how content is acquired from publishers and presented to customers.  Both these behaviors by Google act as disincentives to usage by publishers and consumers.  

To a large extent, Google has managed this contradiction by making it difficult for various stakeholders to see how it influences their behavior. Google uses different terminology, different rationales, and even different personas to manage the expectations of stakeholders. How information about is communicated does not necessarily match the reality of it in practice.  

Although still has a low public profile, more stakeholders are starting to ask questions about it. Should they use structured data at all?  Is how Google uses this structured data unfair?  

To assess the fairness of involves looking at several inter-related issues: 

  • What is
  • Who is affected by it 
  • How it benefits or disadvantages different stakeholders in various ways.  

What kind of entity is

Before we can assess the value of to different parties, we need to answer a basic question is: What is it, really? If everyone can agree on what it is we are referring to, it should be easier to see how it benefits various stakeholders.  What seems like a simple question defies a clear simple answer.  Yet there are multiple definitions of out there, supplied by, Google, and the W3C. The website refers to it as a “collaborative, community activity,” which doesn’t offer much precision. The most candid answer is that is a chameleon. It changes its color depending on its context.

Those familiar with the vocabulary might expect that’s structured data would provide us with an unambiguous answer to that question. A core principle of the vocabulary is to indicate an entity type.  The structured data would reveal the type of entity we are referring to. Paradoxically, the website doesn’t use structured data.  While it talks about, it never reveals through metadata what type of entity it is.

We are forced to ascertain the meaning of by reading its texts.  When looking at the various ways it’s discussed, we can see has embodying four distinct identities.  It can be a:

  1. Brand
  2. Website
  3. Organization
  4. Standard

Who is affected by

A second basic question: Who are the stakeholders affected by  This includes not only who is supposed to be for, but also who gets overlooked or disempowered.  We can break these stakeholders into segments:

  • Google (the biggest contributor to and its biggest user of data utilizing the metadata)
  • Google’s search engine competitors who are partners (“sponsors”) in the schema initiative (Microsoft, Yahoo/Verizon, and Russia’s Yandex)
  • Firms that develop IT products or services other than search engines (consumer apps, data management tools) that could be competitive with search engines
  • Publishers of web content, which includes
    •  Commercial publishers who rely on search engines for revenues and in some cases may be competitors to Google products
    •  Non-Commercial publishers (universities, non-profits, religious organizations, etc
  • Consumers and the wider public that encounter and rely on data
  • Professional service providers that advise others on using such as SEO consultants and marketing agencies
  • The W3C, which has lent its reputation and accommodation to

By looking at the different dimensions of and the different stakeholders, we can consider how each interacts. is very much a Google project — over which it exerts considerable control.  But it cannot appear to be doing so, and therefore relies on various ways of distancing itself from appearing to mandate decisions. as a brand may not seem like an obvious brand. There’s no logo.  The name, while reassuringly authoritative-sounding, does not appear to be trademarked.  Even how it is spelled is poorly managed.  It is unclear if it is meant to be lower case or uppercase — both are used in the documentation, in some cases within the same paragraph (I mostly use lowercase, following examples from the earliest discussions of the standard.)

Brands are about the identity of people, products, or organizations.  A brand foremost is a story that generates an impression of the desirability and trustworthiness of a tangible thing.  The value of a brand is to attract favorable awareness and interest.   

Like many brands, has a myth about its founding, involving a journey of its heroes.’s mythic story involves three acts: life before, the creation of, and the world after

Life before is presented as chaotic.  Multiple semantic web standards were competing with one another.  Different search engines prioritized different standards.  Publishers such as Best Buy made choices on their own about which standard to adopt.  But many publishers were waiting on the sidelines, wondering what the benefits would be for them.

The founders present this period as confusing and grim.  But an alternate interpretation is that this early period of commercializing semantic web metadata was full of fresh ideas and possibilities.  Publishers rationally asked how they would benefit eventually, but there’s little to suggest they feared they would never benefit.  In short, the search engines were the ones complaining about having to deal with competition and diversity in standards. Complaints by publishers were few.

With the formation of, the search engines announced a new standard to end the existence of other general-coverage semantic web metadata standards.  This standard would vanquish the others and end the confusion about which one to follow. subsumed or weeded out competing standards. With the major search engines no longer competing with one another and agreeing to a common standard, publishers would be clear about expectations relating to what they were supposed to do.  The consolidation of semantic web standards into one is presented as inevitable.  This outcome is rationalized with the “TINA” justification: there is no alternative.  And there was no alternative for publishers, once the search engines collectively seized control of the semantic metadata standards process.

After consolidated the semantic web metadata universe, everyone has benefits, in this narrative.  The use of semantic metadata has expanded dramatically.  The coverage of has become more detailed over time. These metrics demonstrate its success.  Moreover, the project has become a movement where many people can now participate. positions itself as a force of enlightenment rising about the petty partisan squabbles that bedeviled other vocabularies in the past.   A semi-official history of states: “It would also be unrecognizable without the contributions made by members of the wider community who have come together via W3C.” 

 The brand story never questions other possibilities.  It assumes that competition was bad, rather than seeing it as representing a diversity of viewpoints that might have shaped things differently.  It assumes that the semantic web would never have managed to become commercialized, instead of recognizing the alternative commercial possibilities that might have emerged from the activity and interest by other parties.  

Any retrospective judgment that the commercialization semantic web would have failed to happen without consolidating things under the direction of search engines is speculative history.  It’s possible that multiple vocabularies could have existed side-by-side and could have been understood.  Humans speak many languages.  There’s no inherent reason why machines can’t as well.   Language diversity fosters expressive diversity. as a website is a rare entity whose name is also a web address. 

If you want to visit, you head to the website.  There’s no convention or headquarters people can visit. If it isn’t clear who runs or how it works, at least the website provides palpable evidence that exists.   Even if it’s just a URL, it provides an address and promises answers.

At times, emphasizes that is just a website — and no more than that: “ is not a formal standards body. is simply a site where we document the schemas that several major search engines will support.”

In its domain level naming, is a “dot-org,” which Wikipedia notes is “the domain is commonly used by schools, open-source projects, and communities, but also by some for-profit entities.” shares a TLD with such good samaritan organizations such as the Red Cross and the World Wildlife Foundation.  On first impression, appears to be a nonprofit charity of some sort.  

While the majority of’s documentation appears on its website, it sometimes has used the “WebSchemas” wiki on the W3C’s domain: . The W3C is well regarded for its work as a nonprofit organization.  The not-for-profit image of the W3C’s hosting lends a halo of trust to the project.  

In reality, the website is owned by Google.  All the content on the website is subject to the approval of Google employees involved with the project.  Google also provides the internal search engine for the site, the Google Programmable Search Engine.

Screenshot's Who Is listing
“Who is” for as an organization

Despite’s disavowal of being a standards body, it does in fact create standards and needs an organizational structure to allow that to happen.’s formal organization involves two tiers:

  1. A steering group of the four sponsoring companies 
  2. A W3C community group

Once again, the appearances and realities of these arrangements can be quite different.

The steering group

While the W3C community group gets the most attention, one needs to understand the steering group first.  The steering group predates the community group and oversees it.  “The day to day operations of, including decisions regarding the schema, are handled by a steering group” notes a FAQ.  The ultimate decision-making authority for rests with this steering group.

The steering group was formed at the start of the initiative.  According to steering group members writing in the ACM professional journal, “in the first year, these sponsor companies made most decisions behind closed doors. It incrementally opened up…”  

There are conflicting accounts about who can participate in the steering group.  The 2015 ACM article talks about “a steering committee [sic] that includes members from the sponsor companies, academia, and the W3C.”   The website focuses on search engines as the stakeholders who steer the initiative: “ is a collaboration between Google, Microsoft, Yahoo! and Yandex – large search engines who will use this marked-up data from web pages. Other sites – not necessarily search engines – might later join.”  A FAQ asks: “Can other websites join as partners and help decide what new schemas to support?” and the answer points to the steering committee governing this.  “The regular Steering Group participants from the search engines” oversee the project.  There have been at least two invited outside experts who have participated as non-regular participants, but the current involvement by outside participants in the steering group is not clear. projects the impression that it is a partnership of equals in the search field, but the reality belies that image. Even though the four search engines describe the steering group as a “collaboration,” the participation by sponsors seems unbalanced. With a 90% market share, Google’s dominance of search is overwhelming, and they have a far bigger interest in the outcomes than the other sponsors.  Since was formed nearly a decade ago, Microsoft has shifted its focus away from consumer products: dropping smartphones and discontinuing its Cortana voice search — both products that would have used  Yahoo has ceased being an independent company and has been absorbed by Verizon, which is not focused on search.  Without having access to the original legal agreement between the sponsors, it’s unclear why either of these companies continues to be involved in from a business perspective.

The steering group is chaired by a Google employee: Google Fellow R.V. Guha. “R.V. Guha of Google initiated and is one of its co-founders. He currently heads the steering group,” notes the website.  Guha’s Wikipedia entry also describes him as being the creator of 

Concrete information on the steering group is sparse.  There’s no information published about who is eligible to join, how is funded, and what criteria it uses to make decisions about what’s included in the vocabulary.  

What is clear is that the regular steering group participation is limited to established search engines, and that Google has been chair.  Newer search engines such as DuckDuckGo aren’t members.  No publishers are members.  Other firms exploring information retrieval technologies such as knowledge graphs aren’t members either.  

The community group

In contrast to the sparse information about the steering group, there’s much more discussion about the W3C community group, which is described as “the main forum for the project.”  

The community group, unlike the steering group, has open membership.  It operates under the umbrella of the W3C, “the main international standards organization for the World Wide Web,” in the words of Wikipedia.  Google Vice President and Chief Internet Evangelist, Vint Cerf, referred to as a “father” of the internet, brokered the ties between and the W3C.  “Vint Cerf helped establish the relations between and the W3C.”  If does not wish to be viewed as a standard, they choose an odd partner by selecting the W3C.  

The W3C’s expectations for community groups are inspiring: ”Community Groups enable anyone to socialize their ideas for the Web at the W3C for possible future standardization. “  In the W3C’s vision, anyone can influence standards.  

Screenshot of W3C community group process
W3C’s explanation of community groups

The sponsors also promote the notion that the community group is open, saying the group “make[s] it easy for publishers/developers to participate.” (ACM)  

The vague word “participation” appears multiple times in literature:  “In addition to people from the founding companies (Google, Microsoft, Yahoo and Yandex), there is substantial participation by the larger Web community.” The suggestion implied is that everyone is a participant with equal ability to contribute and decide.  

While communities are open to all to join, that doesn’t mean that everyone is equal in decision making in the’s case — notwithstanding the W3C’s vision.  Everyone can participate, but not everyone can make decisions.

Publishers are inherently disadvantaged in the community process.  Their suggestions are less important than those of search engines, who are the primary consumer of structured data.  “As always we place high emphasis on vocabulary that is likely to be consumed, rather than merely published.” as a standard does not refer to itself as a standard, even though in practice it is one.  Instead, relies on more developer-focused terminology: vocabulary, markup, and data models.  It presents itself as a helpful tool for developers rather than as a set of rules they need to follow. aims to be monolithic, where no other metadata standard is needed or used. The totalistic name chosen — — suggests that no other schema is required.  “For adoption, we need a simpler model, where publishers can be sure that a piece of vocabulary is indeed part of”

The search engine sponsors discourage publishers from incorporating other semantic vocabularies together with  This means that only certain kinds of entities can be described and only certain details.  So while schema aims to be monolithic, it can’t describe many of the kinds of details that are discussed in Wikipedia.  The overwhelming focus is on products and services that promote the usage of search engines.   The tight hold prevents other standards from emerging that are outside of the influence of’s direction.’s operating model is to absorb any competing standard that gains popularity.  “We strongly encourage schema developers to develop and evangelize their schemas. As these gain traction, we will incorporate them into”  In doing this, groups avoid the burdens of developing on their own coverage of large domains involving fine details and requiring domain-specific expertise. gets public recognition for offering coverage of these if it decides it would benefit’s sponsors. has absorbed domain-specific vocabularies relating to autos, finance, and health, which allows search engines to present detailed information relating to these fields.  

How Google shapes adoption

Google exerts enormous power over web publishing.  Many webmasters and SEO specialists devote the majority of their time satisfying the requirements that Google imposes on publishers and other businesses that need an online presence.  

Google shapes the behavior of web publishers and other parties through a combination of carrots and sticks.

Carrots: Google the persuader

Because Google depends on structured data to attract users who will see its ads, it needs to encourage publishers to adopt  The task is twofold: 

  1. Encourage more adoption, especially by publishers that may not have had much reason to use 
  2. Maintain the use of by existing publishers and keep up interest

Notably, how describes its benefits to publishers is different from how Google does.  

According to, the goal is to “make it easier for webmasters to provide us with data so that we may better direct users to their sites.”  The official rationale for is to help search engines “direct” users to “sites” that aren’t owned by the search engine.   

“When it is easier for webmasters to add markup, and search engines see more of the markup they need, users will end up with better search results and a better experience on the web.”   The official rationale, then, is that users benefit because they get better results from their search.  Because webmasters are seeking to direct people to come to their site, they will expect that the search results will direct users there.

“Search engines are using on-page markup in a variety of ways. These projects help you to surface your content more clearly or more prominently in search results.”   Again, the implied benefit of using is about search results — links people can click on to take them to other websites.  

Finally, dangles a vaguer promise that parties other than search engines may use the data for the benefit of publishers: “since the markup is publicly accessible from your web pages, other organizations may find interesting new ways to make use of it as well.”  The ability of organizations other than search engines to use metadata is indeed a genuine possibility, though it’s one that hasn’t happened to any great extent.  

When Google talks about the benefits, they are far more obtuse.  The first appeal is to understanding: make sure Google understands your content.  “If you add markup to your HTML pages, many companies and products—including Google search—will understand the data on your site. Likewise, if you add markup to your HTML-formatted email, other email products in addition to GMail might understand the data.”     Google is one of “many” firms in the mix, rather than the dominant one.  Precisely what the payoff is from understanding is not explicitly stated.

The most tangible incentive that Google dangles to publishers to use is cosmetic: they gain a prettier display in search.  “Once Google understands your page data more clearly, it can be presented more attractively and in new ways in Google Search.”

Google refers to having a more desirable display as “rich snippets,” among other terms.  It has been promoting this benefit from the start of “The first application to use this markup was Google’s Rich Snippets, which switched over to vocabulary in 2011.” 

The question is how enticing this carrot is.  

Sticks: Google the enforcer

Google encourages adoption through more coercive measures as well.  It does so in three ways:

  1. Setting rules that must be followed
  2. Limiting pushback through vague guidelines and uncertainty
  3. Invoking penalties

Even though the standard is supposedly voluntary and not “normative” about what must be adopted, Google’s implementation is much more directive.  

Google sets rules about how publishers must use  Broadly, Google lays down three ultimatums:

  1. Publishers must use to appear favorably in search
  2. They can only use and no other standards that would be competitive with it
  3. They must supply data to Google in certain ways  

An example of a Google ultimatum relates to its insistence that only the vocabulary be used — and no others.  Google has even banned the use of another vocabulary that it once backed: the data vocabulary.  Google prefers to consolidate all web metadata descriptions into the, which it actively controls.  “With the increasing usage and popularity of we decided to focus our development on a single SD [structured data] scheme. “  Publishers who continue to use the non-unfavored vocabulary face a series of threats.  Google here is making a unilateral decision about what kinds of metadata are acceptable to it.  

Google imposes a range of threats and penalties for non-compliance.  These tactics are not necessarily specific to structured data.   Google has used such tactics to promote the use of its “AMP” standard for mobile content.  But these tactics are more significant in the context of the vocabulary, which is supposed to be a voluntary and public standard.  

Google is opaque how could influence rankings.  If used incorrectly might your ranking be hurt or even disappear? 

Screenshot of article on structured data in search
Example of anxiety about search rankings and usage

Google never suggests that can positively influence search rankings.  But it leaves open the possibility that not using it could negatively influence rankings.  

Google’s threats and penalties relating to usage can be categorized into four tactics:

1. Warnings — messages in yellow that the metadata aren’t what Google expects

2. Errors — messages in red that Google won’t accept the structured data

3. Being ignored — a threat that the content won’t be prioritized by Google

4. Manual actions — a stern warning that the publisher will be sanctioned by Google

Manual actions are the death sentence that Google applies.   Publishers can appeal to Google to change its decision.  But ultimately Google decides what it wants and without a reversal of Google’s prior decision, the publisher is ostracized from Google and won’t be found by anyone searching for them.  The publisher becomes persona non grata. 

An example of a “manual action” sanction is if a publisher posts a job vacancy but there’s “no way to apply for the job” via the mechanism.  That’s doesn’t imply there’s no job —  it simply means that the poster of the job decided not to agree to Google’s terms: that they had to allow Google to let people apply from Google’s product, without the benefit of additional information that Google doesn’t allow to be included.  

While publishers may not like how they are treated, Google makes sure they have no grounds to protest. Google manages publisher expectations around fairness by providing vague guidance and introducing uncertainty.

 A typical Google statement: “Providing structured data to enable an enhanced search feature is no guarantee that the page will appear with that designated feature; structured data only enables a feature to be displayed. Google tries to display the most appropriate and engaging results to a user, and it might be the case that the feature you code for is not appropriate for a particular user at a particular time.” Google is the sole arbiter of fairness.  

In summary, Google imposes rules on publishers while making sure those rules present no obligations on Google.  Its market power allows them to do this.

How Google uses to consolidate market dominance

While there’s been much discussion by regulators, journalists, economists, and others about Google’s dominance in search, little of this discussion has focused on the role of in enabling this dominance.  But as we have seen, Google has rigged the standards-making process and pressed publishers to use under questionable pretenses.  It has been able to leverage the use of structured data based on the metadata standard to solidify its market position.

Google has used for web scraping and to build vertical products.  In both these cases, they are taking away opportunities from publishers who rely on Google and who in many cases are customers of Google’s ad business.  

Web scraping 2.0 and content acquisition

Google uses to harvest content from publisher websites.  It promotes the benefits to publishers of having this content featured as a “rich snippet” or other kinds of “direct answer” — even if there’s no link to the website of the publisher contributing the information.  

For many years governments and publishers around the world have accused Google of scrapping content without consent.  This legally fraught area has landed Google in political trouble.  A far more desirable option for Google would be to scrape web content with implied content. metadata is a perfect vector for Google to acquire vast quantities of content easily. The publishers do the work for providing the data in a format that Google can easily find and process.  And they do this voluntarily — in the belief that it will help them attract more customers to their websites.  In many ways, this is a big con.  There’s growing evidence that publishers are disadvantaged when their content appears as a direct answer.  And it’s troubling to think they are doing much work for Google to take advantage of them.  In the absence of negotiating power toward Google, they consent to do things that may harm them.

In official financial disclosures filed with the US government, Google has acknowledged that wants to avoid presenting links to other firms’ websites in its search results.  Instead, it uses data to provide “direct answers” (such as Rich Snippets) so that people stay on Google products.  “Instead of just showing ten blue links in our search results, we are increasingly able to provide direct answers — even if you’re speaking your question using Voice Search — which makes it quicker, easier and more natural to find what you’re looking for” Google noted in SEC filing last year

The Markup notes how publishers may be penalized when their content appears as a “rich snippet” if the user is looking for external links: “Google further muddied the waters recently when it decided that if a non-Google website link appears in a scraped “featured snippet” module, it would remove that site from traditional results below (if it appeared there) to avoid duplication.” 

One study by noted SEO expert Rand Fiskin found that half of all Google searches result in zero clicks.


Google has developed extensive coverage of details relating to specific “verticals” or industry sectors.  These sectors typically have data-rich transactional websites that customers use to buy services, find things, or manage their accounts.  Google has been interested in limiting consumer  use of these websites and keeping consumers on Google products.  For example, instead of encouraging people who are searching for a job to visit a job listing website, Google would prefer them to explore job opportunities while staying on Google’s search page— whether or not all the relevant jobs are available without visiting another website.  Google is directly competing with travel booking websites, job listing websites, hotel websites, airline websites, and so on.  Nearly all these sites need to pay Google for ads to appear near the top of the search page, even though Google is working to undermine the ability of users to find and access links to these sites.

According to internal Google emails submitted to the US Congress, Google decided to compete directly with firms in specific industry sectors.  To quote these Google emails:

What is the real threat if we don’t execute on verticals?”

a) “Loss of traffic from … .

b) Related revenue loss to high spend verticals like travel.

c) Missing oppty if someone else creates the platform to build verticals.

d) If one of our big competitors builds a constellation of high quality verticals, we are hurt badly.”

Examples of verticals where Google has encouraged to build out detailed metadata include: 

  • Jobs and employers 
  • Books 
  • Movies 
  • Courses 
  • Events 
  • Products 
  • Voice-enabled content 
  • Subscription content 
  • Appointments at local businesses (restaurants, spas, health clubs)

Google employees are recognized as contributing the key work in the vocabulary in two areas pertaining to verticals:

  • “Major topics”  — coverage of verticals mentioned previously
  • The “actions vocabulary” which enables the bypassing of vertical websites to do tasks.

The actions vocabulary is an overlooked dimension of Google’s vertical strategy.  Actions let you complete transactions from within Google products without needing to access the websites of the service provider. (Actions in the vocabulary are not related to Google’s Manual Actions sanctions discussed earlier.) The website notes: “Sam Goto of Google did much of the work being’s Actions vocabulary.”  This Google employee, who was responsible for the problem-space of actions, explained on his blog the barriers to consumers (and Google) to complete actions across websites:

  • “the opentable/urbanspoon/grubhub can’t reserve *any* arbitrary restaurant, they represent specific ones”
  • “Netflix/Amazon/Itunes can’t stream *any* arbitrary movie, there is a specific set of movies available”
  • “Taxis have their own coverage/service area”
  • “ can’t check-in into UA flights”
  • “UPS can’t track USPS packages, etc.”

These transactions target access to the most economically significant service industries.  While consumers will like a unified way to choose things, they still expect options about the platform they can use to make their choices. In principle, the actions could have opened up competition.  But given Google’s domination of search and smartphone platforms, the benefits of actions have filtered up to Google, not down to consumers.  They may not see all the relevant options they need.    Google has decided what choices are available to the user.

There’s a difference between having the option of using a unified access point, and having only one access point.  When a unified access point becomes the only access point, it becomes a monopoly. and the illusion of choice

Google relies on tactics of “self-favoring” — a concept that current antitrust rules don’t regulate effectively.  But a recent analysis noted the need to address the problem: “Google’s business model rests on recoupment against monetization of data blended with the selling of adjunct services. Deprived of access to this, Google might not have offered these services, nor maintained them, or even developed them in the first place.”

Google cares about because it is profitably to do so — both in the immediate term and in the long term.  Google makes money by directing customer focus to its ad-supported products, and it consolidates its stranglehold on the online market as potential competitors get squeezed by its actions.

But if Google couldn’t exploit its monopoly market position to make unfair profits from, would it care about  That’s the question that needs to be tested.  If it did, it would be willing to fund with third party oversight, and allow its competitors and others economically impacted by to have a voice it decision-making.  It would allow governments to review how Google uses the data it harvests from to present choices to customers.  

— Michael Andrews