Categories
Big Content

Who benefits from schema.org?

Schema.org shapes the behavior of thousands of companies and millions of web users.  Even though few users of the internet have heard of it, the metadata standard exerts a huge influence in how people get information, and from whom they get it.  Yet the question of who precisely benefits from the standard gets little attention.  Why does the standard exist and does everyone materially benefit equally from it?  While metadata standard may sound like a technical mechanism generating impartial outcomes, metadata is not always fair in its implementation.  

Google has a strong vested interest in the fate of schema.org — but it is not the only party affected.  Other parties need to feel there are incentives to support schema.org.  Should they feel they experience disincentives, that sentiment could erode support for the standard.

As schema.org has grown in influence over the past decade, that growth has been built upon a contradictory dynamic.  Google needs to keep two constituencies satisfied to grow the usage of Google products.  It needs publishers to use schema.org so it has content for its products.  Consumers need that content to be enticing enough to keep using Google products.  But to monetize its products, Google needs to control how schema.org content is acquired from publishers and presented to customers.  Both these behaviors by Google act as disincentives to schema.org usage by publishers and consumers.  

To a large extent, Google has managed this contradiction by making it difficult for various stakeholders to see how it influences their behavior. Google uses different terminology, different rationales, and even different personas to manage the expectations of stakeholders. How information about  schema.org is communicated does not necessarily match the reality of it in practice.  

Although schema.org still has a low public profile, more stakeholders are starting to ask questions about it. Should they use schema.org structured data at all?  Is how Google uses this structured data unfair?  

To assess the fairness of schema.org involves looking at several inter-related issues: 

  • What schema.org is
  • Who is affected by it 
  • How it benefits or disadvantages different stakeholders in various ways.  

What kind of entity is schema.org?

Before we can assess the value of schema.org to different parties, we need to answer a basic question is: What is it, really? If everyone can agree on what it is we are referring to, it should be easier to see how it benefits various stakeholders.  What seems like a simple question defies a clear simple answer.  Yet there are multiple definitions of schema.org out there, supplied by schema.org, Google, and the W3C. The schema.org website refers to it as a “collaborative, community activity,” which doesn’t offer much precision. The most candid answer is that schema.org is a chameleon. It changes its color depending on its context.

Those familiar with the schema.org vocabulary might expect that schema.org’s structured data would provide us with an unambiguous answer to that question. A core principle of the schema.org vocabulary is to indicate an entity type.  The structured data would reveal the type of entity we are referring to. Paradoxically, the schema.org website doesn’t use structured data.  While it talks about schema.org, it never reveals through metadata what type of entity it is.

We are forced to ascertain the meaning of schema.org by reading its texts.  When looking at the various ways it’s discussed, we can see schema.org has embodying four distinct identities.  It can be a:

  1. Brand
  2. Website
  3. Organization
  4. Standard

Who is affected by schema.org?

A second basic question: Who are the stakeholders affected by schema.org?  This includes not only who schema.org is supposed to be for, but also who gets overlooked or disempowered.  We can break these stakeholders into segments:

  • Google (the biggest contributor to schema.org and its biggest user of data utilizing the metadata)
  • Google’s search engine competitors who are partners (“sponsors”) in the schema initiative (Microsoft, Yahoo/Verizon, and Russia’s Yandex)
  • Firms that develop IT products or services other than search engines (consumer apps, data management tools) that could be competitive with search engines
  • Publishers of web content, which includes
    •  Commercial publishers who rely on search engines for revenues and in some cases may be competitors to Google products
    •  Non-Commercial publishers (universities, non-profits, religious organizations, etc
  • Consumers and the wider public that encounter and rely on schema.org-described data
  • Professional service providers that advise others on using schema.org such as SEO consultants and marketing agencies
  • The W3C, which has lent its reputation and accommodation to schema.org

By looking at the different dimensions of schema.org and the different stakeholders, we can consider how each interacts.  

Schema.org is very much a Google project — over which it exerts considerable control.  But it cannot appear to be doing so, and therefore relies on various ways of distancing itself from appearing to mandate decisions.  

Schema.org as a brand

Schema.org may not seem like an obvious brand. There’s no logo.  The name, while reassuringly authoritative-sounding, does not appear to be trademarked.  Even how it is spelled is poorly managed.  It is unclear if it is meant to be lower case or uppercase — both are used in the documentation, in some cases within the same paragraph (I mostly use lowercase, following examples from the earliest discussions of the standard.)

Brands are about the identity of people, products, or organizations.  A brand foremost is a story that generates an impression of the desirability and trustworthiness of a tangible thing.  The value of a brand is to attract favorable awareness and interest.   

Like many brands, schema.org has a myth about its founding, involving a journey of its heroes.

 Schema.org’s mythic story involves three acts: life before schema.org, the creation of schema.org, and the world after schema.org.

Life before schema.org is presented as chaotic.  Multiple semantic web standards were competing with one another.  Different search engines prioritized different standards.  Publishers such as Best Buy made choices on their own about which standard to adopt.  But many publishers were waiting on the sidelines, wondering what the benefits would be for them.

The schema.org founders present this period as confusing and grim.  But an alternate interpretation is that this early period of commercializing semantic web metadata was full of fresh ideas and possibilities.  Publishers rationally asked how they would benefit eventually, but there’s little to suggest they feared they would never benefit.  In short, the search engines were the ones complaining about having to deal with competition and diversity in standards. Complaints by publishers were few.

With the formation of schema.org, the search engines announced a new standard to end the existence of other general-coverage semantic web metadata standards.  This standard would vanquish the others and end the confusion about which one to follow. Schema.org subsumed or weeded out competing standards. With the major search engines no longer competing with one another and agreeing to a common standard, publishers would be clear about expectations relating to what they were supposed to do.  The consolidation of semantic web standards into one is presented as inevitable.  This outcome is rationalized with the “TINA” justification: there is no alternative.  And there was no alternative for publishers, once the search engines collectively seized control of the semantic metadata standards process.

After schema.org consolidated the semantic web metadata universe, everyone has benefits, in this narrative.  The use of semantic metadata has expanded dramatically.  The coverage of schema.org has become more detailed over time. These metrics demonstrate its success.  Moreover, the project has become a movement where many people can now participate.   Schema.org positions itself as a force of enlightenment rising about the petty partisan squabbles that bedeviled other vocabularies in the past.   A semi-official history of schema.org states: “It would also be unrecognizable without the contributions made by members of the wider community who have come together via W3C.” 

 The schema.org brand story never questions other possibilities.  It assumes that competition was bad, rather than seeing it as representing a diversity of viewpoints that might have shaped things differently.  It assumes that the semantic web would never have managed to become commercialized, instead of recognizing the alternative commercial possibilities that might have emerged from the activity and interest by other parties.  

Any retrospective judgment that the commercialization semantic web would have failed to happen without schema.org consolidating things under the direction of search engines is speculative history.  It’s possible that multiple vocabularies could have existed side-by-side and could have been understood.  Humans speak many languages.  There’s no inherent reason why machines can’t as well.   Language diversity fosters expressive diversity.  

Schema.org as a website

Schema.org is a rare entity whose name is also a web address. 

If you want to visit schema.org, you head to the website.  There’s no schema.org convention or schema.org headquarters people can visit. If it isn’t clear who runs schema.org or how it works, at least the website provides palpable evidence that schema.org exists.   Even if it’s just a URL, it provides an address and promises answers.

At times, schema.org emphasizes that is just a website — and no more than that: “Schema.org is not a formal standards body. Schema.org is simply a site where we document the schemas that several major search engines will support.”

In its domain level naming, schema.org is a “dot-org,” which Wikipedia notes is “the domain is commonly used by schools, open-source projects, and communities, but also by some for-profit entities.”   Schema.org shares a TLD with such good samaritan organizations such as the Red Cross and the World Wildlife Foundation.  On first impression, schema.org appears to be a nonprofit charity of some sort.  

While the majority of schema.org’s documentation appears on its website, it sometimes has used the “WebSchemas” wiki on the W3C’s domain: https://www.w3.org/wiki/WebSchemas . The W3C is well regarded for its work as a nonprofit organization.  The not-for-profit image of the W3C’s hosting lends a halo of trust to the project.  

In reality, the website is owned by Google.  All the content on the schema.org website is subject to the approval of Google employees involved with the schema.org project.  Google also provides the internal search engine for the site, the Google Programmable Search Engine.

Screenshot Schema.org's Who Is listing
“Who is” for schema.org

Schema.org as an organization

Despite schema.org’s disavowal of being a standards body, it does in fact create standards and needs an organizational structure to allow that to happen.  

Schema.org’s formal organization involves two tiers:

  1. A steering group of the four sponsoring companies 
  2. A W3C community group

Once again, the appearances and realities of these arrangements can be quite different.

The steering group

While the W3C community group gets the most attention, one needs to understand the steering group first.  The steering group predates the community group and oversees it.  “The day to day operations of Schema.org, including decisions regarding the schema, are handled by a steering group” notes a FAQ.  The ultimate decision-making authority for schema.org rests with this steering group.

The steering group was formed at the start of the schema.org initiative.  According to steering group members writing in the ACM professional journal, “in the first year, these sponsor companies made most decisions behind closed doors. It incrementally opened up…”  

There are conflicting accounts about who can participate in the steering group.  The 2015 ACM article talks about “a steering committee [sic] that includes members from the sponsor companies, academia, and the W3C.”   The schema.org website focuses on search engines as the stakeholders who steer the initiative: “Schema.org is a collaboration between Google, Microsoft, Yahoo! and Yandex – large search engines who will use this marked-up data from web pages. Other sites – not necessarily search engines – might later join.”  A schema.org FAQ asks: “Can other websites join schema.org as partners and help decide what new schemas to support?” and the answer points to the steering committee governing this.  “The regular Steering Group participants from the search engines” oversee the project.  There have been at least two invited outside experts who have participated as non-regular participants, but the current involvement by outside participants in the steering group is not clear.

Schema.org projects the impression that it is a partnership of equals in the search field, but the reality belies that image. Even though the four search engines describe the steering group as a “collaboration,” the participation by sponsors seems unbalanced. With a 90% market share, Google’s dominance of search is overwhelming, and they have a far bigger interest in the outcomes than the other sponsors.  Since schema.org was formed nearly a decade ago, Microsoft has shifted its focus away from consumer products: dropping smartphones and discontinuing its Cortana voice search — both products that would have used schema.org.  Yahoo has ceased being an independent company and has been absorbed by Verizon, which is not focused on search.  Without having access to the original legal agreement between the sponsors, it’s unclear why either of these companies continues to be involved in schema.org from a business perspective.

The steering group is chaired by a Google employee: Google Fellow R.V. Guha. “R.V. Guha of Google initiated schema.org and is one of its co-founders. He currently heads the steering group,” notes the schema.org website.  Guha’s Wikipedia entry also describes him as being the creator of schema.org. 

Concrete information on the steering group is sparse.  There’s no information published about who is eligible to join, how schema.org is funded, and what criteria it uses to make decisions about what’s included in the vocabulary.  

What is clear is that the regular steering group participation is limited to established search engines, and that Google has been chair.  Newer search engines such as DuckDuckGo aren’t members.  No publishers are members.  Other firms exploring information retrieval technologies such as knowledge graphs aren’t members either.  

The community group

In contrast to the sparse information about the steering group, there’s much more discussion about the W3C community group, which is described as “the main forum for the project.”  

The community group, unlike the steering group, has open membership.  It operates under the umbrella of the W3C, “the main international standards organization for the World Wide Web,” in the words of Wikipedia.  Google Vice President and Chief Internet Evangelist, Vint Cerf, referred to as a “father” of the internet, brokered the ties between schema.org and the W3C.  “Vint Cerf helped establish the relations between Schema.org and the W3C.”  If schema.org does not wish to be viewed as a standard, they choose an odd partner by selecting the W3C.  

The W3C’s expectations for community groups are inspiring: ”Community Groups enable anyone to socialize their ideas for the Web at the W3C for possible future standardization. “  In the W3C’s vision, anyone can influence standards.  

Screenshot of W3C community group process
W3C’s explanation of community groups

The sponsors also promote the notion that the community group is open, saying the group “make[s] it easy for publishers/developers to participate.” (ACM)  

The vague word “participation” appears multiple times in schema.org literature:  “In addition to people from the founding companies (Google, Microsoft, Yahoo and Yandex), there is substantial participation by the larger Web community.” The suggestion implied is that everyone is a participant with equal ability to contribute and decide.  

While communities are open to all to join, that doesn’t mean that everyone is equal in decision making in the schema.org’s case — notwithstanding the W3C’s vision.  Everyone can participate, but not everyone can make decisions.

Publishers are inherently disadvantaged in the community process.  Their suggestions are less important than those of search engines, who are the primary consumer of schema.org structured data.  “As always we place high emphasis on vocabulary that is likely to be consumed, rather than merely published.”

Schema.org as a standard

Schema.org does not refer to itself as a standard, even though in practice it is one.  Instead, schema.org relies on more developer-focused terminology: vocabulary, markup, and data models.  It presents itself as a helpful tool for developers rather than as a set of rules they need to follow.  

Schema.org aims to be monolithic, where no other metadata standard is needed or used. The totalistic name chosen — schema.org — suggests that no other schema is required.  “For adoption, we need a simpler model, where publishers can be sure that a piece of vocabulary is indeed part of Schema.org.”

The search engine sponsors discourage publishers from incorporating other semantic vocabularies together with schema.org.  This means that only certain kinds of entities can be described and only certain details.  So while schema aims to be monolithic, it can’t describe many of the kinds of details that are discussed in Wikipedia.  The overwhelming focus is on products and services that promote the usage of search engines.   The tight hold prevents other standards from emerging that are outside of the influence of schema.org’s direction.  

Schema.org’s operating model is to absorb any competing standard that gains popularity.  “We strongly encourage schema developers to develop and evangelize their schemas. As these gain traction, we will incorporate them into schema.org.”  In doing this, schema.org groups avoid the burdens of developing on their own coverage of large domains involving fine details and requiring domain-specific expertise.  Schema.org gets public recognition for offering coverage of these if it decides it would benefit schema.org’s sponsors.   Schema.org has absorbed domain-specific vocabularies relating to autos, finance, and health, which allows search engines to present detailed information relating to these fields.  

How Google shapes schema.org adoption

Google exerts enormous power over web publishing.  Many webmasters and SEO specialists devote the majority of their time satisfying the requirements that Google imposes on publishers and other businesses that need an online presence.  

Google shapes the behavior of web publishers and other parties through a combination of carrots and sticks.

Carrots: Google the persuader

Because Google depends on structured data to attract users who will see its ads, it needs to encourage publishers to adopt schema.org.  The task is twofold: 

  1. Encourage more adoption, especially by publishers that may not have had much reason to use schema.org 
  2. Maintain the use of schema.org by existing publishers and keep up interest

Notably, how schema.org describes its benefits to publishers is different from how Google does.  

According to schema.org, the goal is to “make it easier for webmasters to provide us with data so that we may better direct users to their sites.”  The official rationale for schema.org is to help search engines “direct” users to “sites” that aren’t owned by the search engine.   

“When it is easier for webmasters to add markup, and search engines see more of the markup they need, users will end up with better search results and a better experience on the web.”   The official schema.org rationale, then, is that users benefit because they get better results from their search.  Because webmasters are seeking to direct people to come to their site, they will expect that the search results will direct users there.

“Search engines are using on-page markup in a variety of ways. These projects help you to surface your content more clearly or more prominently in search results.”   Again, the implied benefit of using schema.org is about search results — links people can click on to take them to other websites.  

Finally, schema.org dangles a vaguer promise that parties other than search engines may use the data for the benefit of publishers: “since the markup is publicly accessible from your web pages, other organizations may find interesting new ways to make use of it as well.”  The ability of organizations other than search engines to use scheam.org metadata is indeed a genuine possibility, though it’s one that hasn’t happened to any great extent.  

When Google talks about the benefits, they are far more obtuse.  The first appeal is to understanding: make sure Google understands your content.  “If you add schema.org markup to your HTML pages, many companies and products—including Google search—will understand the data on your site. Likewise, if you add schema.org markup to your HTML-formatted email, other email products in addition to GMail might understand the data.”     Google is one of “many” firms in the mix, rather than the dominant one.  Precisely what the payoff is from understanding is not explicitly stated.

The most tangible incentive that Google dangles to publishers to use schema.org is cosmetic: they gain a prettier display in search.  “Once Google understands your page data more clearly, it can be presented more attractively and in new ways in Google Search.”

Google refers to having a more desirable display as “rich snippets,” among other terms.  It has been promoting this benefit from the start of schema.org: “The first application to use this markup was Google’s Rich Snippets, which switched over to Schema.org vocabulary in 2011.” 

The question is how enticing this carrot is.  

Sticks: Google the enforcer

Google encourages schema.org adoption through more coercive measures as well.  It does so in three ways:

  1. Setting rules that must be followed
  2. Limiting pushback through vague guidelines and uncertainty
  3. Invoking penalties

Even though the schema.org standard is supposedly voluntary and not “normative” about what must be adopted, Google’s implementation is much more directive.  

Google sets rules about how publishers must use schema.org.  Broadly, Google lays down three ultimatums:

  1. Publishers must use schema.org to appear favorably in search
  2. They can only use schema.org and no other standards that would be competitive with it
  3. They must supply data to Google in certain ways  

An example of a Google ultimatum relates to its insistence that only the schema.org vocabulary be used — and no others.  Google has even banned the use of another vocabulary that it once backed: the data vocabulary.  Google prefers to consolidate all web metadata descriptions into the schema.org, which it actively controls.  “With the increasing usage and popularity of schema.org we decided to focus our development on a single SD [structured data] scheme. “  Publishers who continue to use the non-unfavored vocabulary face a series of threats.  Google here is making a unilateral decision about what kinds of metadata are acceptable to it.  

Google imposes a range of threats and penalties for non-compliance.  These tactics are not necessarily specific to schema.org structured data.   Google has used such tactics to promote the use of its “AMP” standard for mobile content.  But these tactics are more significant in the context of the schema.org vocabulary, which is supposed to be a voluntary and public standard.  

Google is opaque how schema.org could influence rankings.  If used incorrectly might your ranking be hurt or even disappear? 

Screenshot of article on structured data in search
Example of anxiety about search rankings and schema.org usage

Google never suggests that schema.org can positively influence search rankings.  But it leaves open the possibility that not using it could negatively influence rankings.  

Google’s threats and penalties relating to schema.org usage can be categorized into four tactics:

1. Warnings — messages in yellow that the metadata aren’t what Google expects

2. Errors — messages in red that Google won’t accept the structured data

3. Being ignored — a threat that the content won’t be prioritized by Google

4. Manual actions — a stern warning that the publisher will be sanctioned by Google

Manual actions are the death sentence that Google applies.   Publishers can appeal to Google to change its decision.  But ultimately Google decides what it wants and without a reversal of Google’s prior decision, the publisher is ostracized from Google and won’t be found by anyone searching for them.  The publisher becomes persona non grata. 

An example of a “manual action” sanction is if a publisher posts a job vacancy but there’s “no way to apply for the job” via the schema.org mechanism.  That’s doesn’t imply there’s no job —  it simply means that the poster of the job decided not to agree to Google’s terms: that they had to allow Google to let people apply from Google’s product, without the benefit of additional information that Google doesn’t allow to be included.  

While publishers may not like how they are treated, Google makes sure they have no grounds to protest. Google manages publisher expectations around fairness by providing vague guidance and introducing uncertainty.

 A typical Google statement: “Providing structured data to enable an enhanced search feature is no guarantee that the page will appear with that designated feature; structured data only enables a feature to be displayed. Google tries to display the most appropriate and engaging results to a user, and it might be the case that the feature you code for is not appropriate for a particular user at a particular time.” Google is the sole arbiter of fairness.  

In summary, Google imposes rules on publishers while making sure those rules present no obligations on Google.  Its market power allows them to do this.

How Google uses schema.org to consolidate market dominance

While there’s been much discussion by regulators, journalists, economists, and others about Google’s dominance in search, little of this discussion has focused on the role of schema.org in enabling this dominance.  But as we have seen, Google has rigged the standards-making process and pressed publishers to use schema.org under questionable pretenses.  It has been able to leverage the use of structured data based on the schema.org metadata standard to solidify its market position.

Google has used schema.org for web scraping and to build vertical products.  In both these cases, they are taking away opportunities from publishers who rely on Google and who in many cases are customers of Google’s ad business.  

Web scraping 2.0 and content acquisition

Google uses schema.org to harvest content from publisher websites.  It promotes the benefits to publishers of having this content featured as a “rich snippet” or other kinds of “direct answer” — even if there’s no link to the website of the publisher contributing the information.  

For many years governments and publishers around the world have accused Google of scrapping content without consent.  This legally fraught area has landed Google in political trouble.  A far more desirable option for Google would be to scrape web content with implied content.  Schema.org metadata is a perfect vector for Google to acquire vast quantities of content easily. The publishers do the work for providing the data in a format that Google can easily find and process.  And they do this voluntarily — in the belief that it will help them attract more customers to their websites.  In many ways, this is a big con.  There’s growing evidence that publishers are disadvantaged when their content appears as a direct answer.  And it’s troubling to think they are doing much work for Google to take advantage of them.  In the absence of negotiating power toward Google, they consent to do things that may harm them.

In official financial disclosures filed with the US government, Google has acknowledged that wants to avoid presenting links to other firms’ websites in its search results.  Instead, it uses schema.org data to provide “direct answers” (such as Rich Snippets) so that people stay on Google products.  “Instead of just showing ten blue links in our search results, we are increasingly able to provide direct answers — even if you’re speaking your question using Voice Search — which makes it quicker, easier and more natural to find what you’re looking for” Google noted in SEC filing last year

The Markup notes how publishers may be penalized when their content appears as a “rich snippet” if the user is looking for external links: “Google further muddied the waters recently when it decided that if a non-Google website link appears in a scraped “featured snippet” module, it would remove that site from traditional results below (if it appeared there) to avoid duplication.” 

One study by noted SEO expert Rand Fiskin found that half of all Google searches result in zero clicks.

Verticals

Google has developed extensive coverage of details relating to specific “verticals” or industry sectors.  These sectors typically have data-rich transactional websites that customers use to buy services, find things, or manage their accounts.  Google has been interested in limiting consumer  use of these websites and keeping consumers on Google products.  For example, instead of encouraging people who are searching for a job to visit a job listing website, Google would prefer them to explore job opportunities while staying on Google’s search page— whether or not all the relevant jobs are available without visiting another website.  Google is directly competing with travel booking websites, job listing websites, hotel websites, airline websites, and so on.  Nearly all these sites need to pay Google for ads to appear near the top of the search page, even though Google is working to undermine the ability of users to find and access links to these sites.

According to internal Google emails submitted to the US Congress, Google decided to compete directly with firms in specific industry sectors.  To quote these Google emails:

What is the real threat if we don’t execute on verticals?”

a) “Loss of traffic from google.com … .

b) Related revenue loss to high spend verticals like travel.

c) Missing oppty if someone else creates the platform to build verticals.

d) If one of our big competitors builds a constellation of high quality verticals, we are hurt badly.”

Examples of verticals where Google has encouraged schema.org to build out detailed metadata include: 

  • Jobs and employers 
  • Books 
  • Movies 
  • Courses 
  • Events 
  • Products 
  • Voice-enabled content 
  • Subscription content 
  • Appointments at local businesses (restaurants, spas, health clubs)

Google employees are recognized as contributing the key work in the schema.org vocabulary in two areas pertaining to verticals:

  • “Major topics”  — coverage of verticals mentioned previously
  • The “actions vocabulary” which enables the bypassing of vertical websites to do tasks.

The actions vocabulary is an overlooked dimension of Google’s vertical strategy.  Actions let you complete transactions from within Google products without needing to access the websites of the service provider. (Actions in the schema.org vocabulary are not related to Google’s Manual Actions sanctions discussed earlier.) The schema.org website notes: “Sam Goto of Google did much of the work being schema.org’s Actions vocabulary.”  This Google employee, who was responsible for the problem-space of actions, explained on his blog the barriers to consumers (and Google) to complete actions across websites:

  • “the opentable/urbanspoon/grubhub can’t reserve *any* arbitrary restaurant, they represent specific ones”
  • “Netflix/Amazon/Itunes can’t stream *any* arbitrary movie, there is a specific set of movies available”
  • “Taxis have their own coverage/service area”
  • “AA.com can’t check-in into UA flights”
  • “UPS can’t track USPS packages, etc.”

These transactions target access to the most economically significant service industries.  While consumers will like a unified way to choose things, they still expect options about the platform they can use to make their choices. In principle, the actions could have opened up competition.  But given Google’s domination of search and smartphone platforms, the benefits of actions have filtered up to Google, not down to consumers.  They may not see all the relevant options they need.    Google has decided what choices are available to the user.

There’s a difference between having the option of using a unified access point, and having only one access point.  When a unified access point becomes the only access point, it becomes a monopoly.

Schema.org and the illusion of choice

Google relies on tactics of “self-favoring” — a concept that current antitrust rules don’t regulate effectively.  But a recent analysis noted the need to address the problem: “Google’s business model rests on recoupment against monetization of data blended with the selling of adjunct services. Deprived of access to this, Google might not have offered these services, nor maintained them, or even developed them in the first place.”

Google cares about schema.org because it is profitably to do so — both in the immediate term and in the long term.  Google makes money by directing customer focus to its ad-supported products, and it consolidates its stranglehold on the online market as potential competitors get squeezed by its actions.

But if Google couldn’t exploit its monopoly market position to make unfair profits from schema.org, would it care about schema.org?  That’s the question that needs to be tested.  If it did, it would be willing to fund schema.org with third party oversight, and allow its competitors and others economically impacted by schema.org to have a voice it schema.org decision-making.  It would allow governments to review how Google uses the data it harvests from schema.org to present choices to customers.  

— Michael Andrews

Categories
Content Integration

Multi-source Publishing: the Next Evolution

Most organizations that create web content primarily focus on how to publish and deliver the content to audiences directly.  In this age where “everyone is a publisher,” organizations have become engrossed in how to form a direct relationship with audiences, without a third party intermediary.  As publishers try to cultivate audiences, some are noticing that audience attention is drifting away from their website.  Increasingly, content delivery platforms are collecting and combining content from multiple sources, and presenting such integrated content to audiences to provide a more customer-centric experience.  Publishers need to consider, and plan for, how their content will fit in an emerging framework of integrated, multi-source publishing.

The Changing Behaviors of Content Consumption: from bookmarks to snippets and cards

Bookmarks were once an important tool to access websites. People wanted to remember great sources of content, and how to get to them.  A poster child for the Web 2.0 era was a site called Delicious, which combined bookmarking with a quaint labelling approach called a folksonomy.  Earlier this year, Delicious, abandoned and forgotten, was sold at a fire sale for a few thousand dollars for the scrap value of its legacy data.

People have largely stopped bookmarking sites.  I don’t even know how to use them on my smartphone.  It seems unnecessary to track websites anymore.  People expect information they need to come to them.  They’ve become accustomed to seeing snippets and cards that surface in lists and timelines within their favorite applications.

Delicious represents the apex of the publisher centric era for content.  Websites were king, and audiences collected links to them.

Single Source Publishing: a publisher centric approach to targeting information

In the race to become the best source of information — the top bookmarked website — publishers have struggled with how a single website can successfully please a diverse range of audience needs.  As audience expectations grew, publishers sought to create more specific web pages that would address the precise informational needs of individuals.  Some publishers embraced single source publishing.  Single source publishing assembles many different “bundles” of content that all come from the same publisher.  The publisher uses a common content repository (a single source) to create numerous content variations.  Audiences benefit when able to read custom webpages that address their precise needs.  Provided the audience locates the exact variant of information they need, they can bookmark it for later retrieval.

By using single source publishing, publishers have been able to dramatically increase the volume of webpages they produce.  That content, in theory, is much more targeted.  But the escalating volume of content has created new problems.  Locating specific webpages with relevant information in a large website can be as challenging as finding relevant information on more generic webpages within a smaller website.  Single source publishing, by itself, doesn’t solve the information hunting problem.

The Rise of Content Distribution Platforms: curated content

As publishers focused on making their websites king of the hill, audiences were finding new ways to avoid visiting websites altogether.  Over the past decade, content aggregation and distribution platforms have become the first port of call for audiences seeking information.  Such platforms include social media such as Facebook, Snapchat, Instagram and Pinterest, aggregation apps such as Flipboard and Apple News, and a range of Google products and apps.  In many cases, audiences get all the information they need while within the distribution or aggregation platform, with no need to visit the website hosting the original content.

Hipmunk aggregates content from other websites, as well as from other aggregators.

The rise of distribution platforms mirrors broader trends toward customer-driven content consumption. Audiences are reluctant to believe that any single source of content provides comprehensive and fully credible information.  They want easy access to content from many sources.  An early example of this trend were travel aggregators that allow shoppers to compare airfares and hotel rates from different vendor websites.  The travel industry has fought hard to counter this trend, with limited success.  Audiences are reluctant to rely on a single source such as an airline or hotel website to make choices about their plans.  They want options.  They want to know what different websites are offering, and compare these options.  They also want to know the range of perspectives on a topic. Various review and opinion websites such as Rotten Tomatoes present the judgment from different websites.

The movie review site Rotten Tomatoes republishes snippets of reviews from many websites.

Another harbinger of the future has been the evolution of Google search away from its original purpose of presenting links to websites, and toward providing answers.  Consider Google’s “featured snippets,” which interprets user queries, and provides a list of related questions and answers.   Featured snippets are significant in two respects :

  1. They present answers on the Google platform, instead of taking the user to the publisher’s website.
  2. They show different related questions and answers, meaning the publisher has less control framing how users consider a topic.

Google’s “featured snippets” presents related questions together, with answers using content extracted directly from different websites.

Google draws on content from many different websites, and combines the content together.  Google scrapes the content from different webpages, and reuses content as it decides will be in the best interest of Google searchers.  Website publishers can’t ask Google to be in a featured snippet.  They need to opt-out with a  <meta name="googlebot" content="nosnippet"> if they don’t want their content used by Google in such snippets.  These developments illustrate  how publishers no longer control exactly how their content is viewed.

A Copernican Revolution Comes to Publishing

Despite lip service to the importance of the customer, many publishers still have a publisher centric mentality that imagines customers orbiting around them.  The publisher considers itself as the center of the customer’s universe.  Nothing has changed: customers are seeking out the publisher’s content, visiting the publisher’s website.  Publishers still expect customers to come to them. The customer is not at the center of the process.

Publishers do acknowledge the role of Facebook and Google in driving traffic, and more publish directly on these platforms.  Yet such measures fall short of genuine customer-centricity.  Publishers still want to talk uninterrupted, instead of contributing information that will fill-in the gaps in the audience’s knowledge and understanding.  They expect audiences to read or view an entire article or presentation, even if that content contains information the audience knows already.

A publisher centric mentality assumes they can be, and will be, the one-best source of information, covering everything important about the topic.  The publisher decides what they believe the audience needs to know, then proceeds to tell the audience about all those things.

A customer-centric approach to content, in contrast, expects and accepts that audiences will be viewing many sources of content.  It recognizes that no one source of content will be complete or definitive.  It assumes that the customer already has prior knowledge about a topic, which may have been acquired from other sources.  It also assumes that audiences don’t want to view redundant information.

Let’s consider content needs from an audience perspective.  Earlier this month I was on holiday in Lisbon.  I naturally consulted travel guides to the city from various sources such as Lonely Planet, Rough Guides and Time Out.  Which source was best?  While each source did certain things slightly better than their rivals, there wasn’t a big difference in the quality of the content.  Travel content is fairly generic: major sources approach information in much the same way.  But while each source was similar, they weren’t identical.  Lisbon is a large enough city that no one guide could cover it comprehensively.  Each guide made its own choices about what specific highlights of the city to include.

As a consumer of this information, I wanted the ability to merge and compare the different entries from each source.  Each source has a list of “must see” attractions.  Which attractions are common to all sources (the standards), and which are unique to one source (perhaps more special)?  For the specific neighborhood where I was staying, each guide could only list a few restaurants.  Did any restaurants get multiple mentions, which perhaps indicated exquisite food, but also possibly signaled a high concentration of tourists? As a visitor to a new city, I want to know about what I don’t know, but also want to know about what others know (and plan to do), so I can plan with that in mind.  Some experiences are worth dealing with crowds; others aren’t.

The situation with travel content applies to many content areas.  No one publisher has comprehensive and definitive information, generally speaking.  People by and large want to compare perspectives from different sources.  They find it inconvenient to bounce between different sources.  As the Google featured snippets example shows, audiences gravitate toward sources that provide convenient access to content drawing on multiple sources.

A publisher-centric attitude is no longer viable. Publishers that expect audiences to read through monolithic articles on their websites will find audiences less inclined to make that effort.  The publishers that will win audience attention are those who can unbundle their content, so that audiences can get precisely want they want and need (perhaps as a snippet on a card on their smartphone).

Platforms have re-intermediated the publishing process, inserting themselves between the publisher and the audience.  Audiences are now more loyal to a channel that distributes content than they are loyal to the source creating the content.  They value the convenience of one-stop access to content.  Nonetheless, the role of publishers remains important.  Customer-centric content depends on publishers. To navigate these changes, publishers need to understand the benefit of unbundling content, and how it is done.

Content Unbundling, and playing well with others

Audience face a rich menu of choices for content. For most publishers, it is unrealistic to aspire to be the single best source of content, with the notable exception of when you are discussing your own organization and products.  Even in these cases, audiences will often be considering content from other organizations that will be in competition with your own content.

CNN’s view of different content platforms where their audiences may be spending time. Screenshot via Tow Center report on the Platform Press.

Single source publishing is best suited for captive audiences, when you know the audience is looking for something specific, from you specifically.  Enterprise content about technical specifications or financial results are good candidates for single source publishing.  Publishers face a more challenging task when seeking to participate in the larger “dialog” that the audience is having about a topic not “owned” by a brand.  For most topics, audiences consult many sources of information, and often discuss this information among themselves. Businesses rely on social media, for example, finding forums where different perspectives are discussed, and inserting teasers with links to articles.  But much content consumption happens outside of active social media discussions, where audiences explicitly express their interests.  Publishers need more robust ways to deliver relevant information when people are scanning content from multiple sources.

Consumers want all relevant content in one place. Publishers must decide where that one place might be for their audiences.  Sometimes consumers will look to topic-specific portals that aggregate perspectives from different sources.  Other times consumers will rely on generic content delivery platforms to gather preliminary information. Publishers need their content to be prepared for both scenarios.

To participate in multi-source publishing, publishers need to prepare their content so it can be used by others.  They need to follow the Golden Rule: make it easy for others to incorporate your content in other content.  Part of that task is technical: providing the technical foundation for sharing content between different organizations.  The other part of the task is shifting  perspective, by letting go of possessiveness about content, and fears of loss of control.

Rewards and Risks of Multi-source publishing

Multi-source content involves a different set of risks and rewards than when distributing content directly.  Publishers must answer two key questions:

  1. How can publishers maximize the use of their content across platforms? (Pursue rewards)
  2. What conditions, if any, do they want to place on that use? (Manage risks)

More fundamentally, why would publishers want other platforms to display their content?  The benefits are manifold.  Other platforms:

  • Can increase reach, since these platforms will often get more traffic than one’s own website, and will generally offer incrementally more views of one’s content
  • May have better authority on a topic, since they combine information from multiple sources
  • May have superior algorithms that understand the importance of different informational elements
  • Can make it easier to audiences to locate specific content of interest
  • May have better contextual or other data about audiences, which can be leveraged to provide more precise targeting.

In short, multi-source publishing can reduce the information hunting problem that audiences face. Publishers can increase the likelihood that their content will be seen at opportune moments.

Publishers have a choice about what content to limit sharing, and what content to make easy to share.  If left unmanaged, some of their content will be used by other parties regardless, and not necessarily in ways the publisher would like.  If actively managed, the publisher can facilitate the sharing of specific content, or actively discourage use of certain content by others. We will discuss the technical dimensions shortly.  First, let’s consider the strategic dimensions.

When deciding how to position their content with respect to third party publishing and distribution, publishers need to be clear on the ultimate purpose of their content.  Is the content primarily about a message intended to influence a behavior?  Is the content primarily about forming a relationship with an audience and measuring audience interests?  Or is the content intended to produce revenues through subscriptions or advertising?

Publishers will want to control access to revenue-producing content, to ensure they capture the subscription or advertising revenues of that content, and not allow the revenue value benefit a free-rider.  They want to avoid unmanaged content reuse.

In the other two cases, a more permissive access can make business sense.  Let’s call the first case the selective exposure of content highlights — for example, short tips that are related to the broader category of product you offer.  If the purpose of content is about forming a relationship, then it is important to attract interest in your perspectives, and demonstrate the brand’s expertise and helpfulness.  Some information and messages can be highlighted by third party platforms, and audiences can see that your brand is trying to be helpful.  Some of these viewers, who may not have been aware of your brand or website, may decide to click through to see the complete article.  Exposure through a platform to new audiences can be the start of new customer relationships.

The second case of promoted content relates to content about a brand, product or company. It might be a specification about a forthcoming product, a troubleshooting issue, or news about a store opening.  In cases where people are actively seeking out these details, or would be expected to want to be alerted to news about these issues, it makes sense to provide this information on whatever platform they are using directly.  Get their questions answered and keep them happy.  Don’t worry about trying to cross-sell them on viewing content about other things.  They know where to find your website if they need greater details.  The key metric to measure is customer satisfaction, not volume of articles read by customers. In this case, exposure through a platform to an existing audience can improve the customer relationship.

How to Enable Content to be Integrated Anywhere

Many pioneering examples of multi-source publishing, such as price comparison aggregators, job search websites, and Google’s featured snippets, have relied a brute-force method of mining content from other websites.  They crawl websites, looking for patterns in the content, and extract relevant information programatically.  Now, the rise of metadata standards for content, and their increased implementation by publishers, makes easier the task of assembling content derived from different sources.  Standards-based metadata can connect a publisher’s content to content elsewhere.

No one knows what new content distribution or aggregation platform will become the next Hipmunk or Flipboard.  But we can expect aggregation platforms will continue to evolve and expand.  Data on content consumption behavior (e.g., hours spent each week by website, channel and platform) indicates customers more and more favor consolidated and integrated content.  The technical effort needed to deliver content sourced from multiple websites is decreasing.  Platforms have a range of financial incentives to assemble content from other sources, including ad revenues, the development of comparative data metrics on customer interest in different products, and the opportunity to present complementary content about topics related to the content that’s being republished.  Provided your content is useful in some form to audiences, other parties will find opportunities to make money featuring your content.  Price comparison sites make money from vendors who pay for the privilege of appearing on their site.

To get in front of audiences as they browse content from different sources, a publisher needs to be able to merge content into their feed or stream, whether it is a timeline, a list of search results, or a series of recommendations that appear as audiences scroll down their screen.  Two options are available to facilitate content merging:

  1. Planned syndication
  2. Discoverable reuse

Planned Syndication

Publishers can syndicate their content, and plan how they want others to use it.  The integration of content between different  publishers can be either tightly coupled, or loosely coupled.  For publishers who follow a single sourcing process, such as DITA, it is possible to integrate their content with content from other publishers, provided the other publishers follow the same DITA approach.  Seth Earley, a leading expert on content metadata, describes a use case for syndication of content using DITA:

“Manufacturers of mobile devices work through carriers like Verizon who are the distribution channels.   Content from an engineering group can be syndicated through to support who can in turn syndicate their content through marketing and through distribution partners.  In other words, a change in product support or technical specifications or troubleshooting content can be pushed off through channels within hours through automated and semi-automated updates instead of days or weeks with manual conversions and refactoring of content.”

While such tightly coupled approaches can be effective, they aren’t flexible, as they require all partners to follow a common, publisher-defined content architecture.  A more flexible approach is available when publisher systems are decoupled, and content is exchanged via APIs.  Content integration via APIs embraces a very different philosophy than  the single sourcing approach.  APIs define chunks of content to exchange flexibly, whereas single-sourcing approaches like DITA define chunks more formally and rigidly. While APIs can accommodate a wide range of source content based on any content architecture, single sourcing only allows content that conforms to a publisher’s existing content architecture.  Developers are increasingly using flexible microservices to make content available to different parties and platforms.

In the API model, publishers can expand the reach of their content two ways.  They can submit their content to other parties, and/or permit other parties to access and use their content.  The precise content they exchange, and the conditions under which it is exchanged, is defined by the API.  Publishers can define their content idiosyncratically when using an API, but if they follow metadata standards, the API will be easier to adopt and use.  The use of metadata standards in APIs can reduce the amount of special API documentation required.

Discoverable Reuse

Many examples cited earlier involve the efforts of a single party, rather than the cooperation of two parties.  Platforms often acquire content from many sources without the active involvement of the original publishers.  When the original publisher of the content does not need to be involved with the reuse of their content, the content has the capacity to reach a wider audience, and be discovered in unplanned, serendipitous ways.

Aggregators and delivery platforms can bypass the original publisher two ways.  First, they can rely on crowdsourcing.  Audiences might submit content to the platform, such as Pinterest’s “pins”.  Users can pin images to Pinterest because these images contain Open Graph or schema.org metadata.

Second, platforms and aggregators can discover content algorithmically. Programs can crawl websites to find interesting content to extract.  Web scraping, which was once solely done by search engines such as Google, has become easier and more widely available, due to the emergence of services such as Import.IO.  Aided by advances in machine learning, some webscraping tools don’t require any coding at all, though to achieve greater precision requires some coding.  The content that is most easily discovered by crawlers is content described by metadata standards such as schema.org.  Tools can use simple Regex or XPath expressions to extract specific content that is defined by metadata .

Influencing Third-party Re-use

Publishers can benefit when other parties want to re-publish their content, but they will also want to influence how their content is used by others.   Whether they actively manage this process by creating or accessing an API, or they choose not to directly coordinate with other parties, publishers can influence how others use their content through various measures:

  • They can choose what content elements to describe with metadata, which facilitates use of that content elsewhere
  • They can assert their authorship and copyright ownership of the content using metadata, to ensure that appropriate credit is given to the original source
  • They can indicate, using metadata, any content licensing requirements.
  • For publishers using APIs, they can control access via API keys, and limit the usage allowed to a party
  • When the volume of re-use justifies, publishers can explore revenue sharing agreements with platforms, as newspapers are doing with Facebook.

Readers interested in these issues can consult my book, Metadata Basics for Web Content, for a discussion of rights and permissions metadata, which covers issues such as content attribution and licensing.

Where is Content Sourcing heading?

Digital web content in some ways is starting to resemble electronic dance music, where content gets “sampled” and “remixed” by others. The rise of content microservices, and of customer expectations for multi-sourced, integrated content experiences, are undermining the supremacy of the article as the defining unit of content.

For publishers accustomed being in control, the rise of multi-source publishing represents a “who moved my cheese” moment.  Publishers need to adapt to a changing reality that is uncertain and diffuse. Unlike the parable about cheese, publishers have choices about how they respond.  New opportunities also beckon. This area is still very fluid, and eludes any simple list of best practices to follow.  Publishers would be foolish, however, to ignore the many signals that collectively suggest a shift from individual websites and toward more integrated content destinations.  They need to engage with these trends to be able to capitalize on them effectively.

— Michael Andrews